Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,600 --> 00:00:01,800
Instructor: Hey again.
2
00:00:01,800 --> 00:00:04,230
We finished the last lecture with this graph.
3
00:00:04,230 --> 00:00:07,230
It shows 1000 points and their centroid.
4
00:00:07,230 --> 00:00:08,580
In cluster analysis,
5
00:00:08,580 --> 00:00:11,820
that's how a cluster would look in two-dimensional space.
6
00:00:11,820 --> 00:00:14,400
There are two dimensions or two features
7
00:00:14,400 --> 00:00:17,220
based on which we are performing clustering.
8
00:00:17,220 --> 00:00:19,800
For instance, the age and money spent
9
00:00:19,800 --> 00:00:21,333
from our earlier example.
10
00:00:22,200 --> 00:00:25,680
Certainly it makes no sense to have only one cluster,
11
00:00:25,680 --> 00:00:28,375
so let me zoom out of this graph.
12
00:00:28,375 --> 00:00:30,840
Here's a nice picture of clusters.
13
00:00:30,840 --> 00:00:33,420
We can clearly see two clusters.
14
00:00:33,420 --> 00:00:35,253
I'll also indicate their centroids.
15
00:00:36,180 --> 00:00:38,850
If we want to identify three clusters,
16
00:00:38,850 --> 00:00:41,250
this is the result we obtain,
17
00:00:41,250 --> 00:00:44,133
and that's more or less how clustering works graphically.
18
00:00:45,000 --> 00:00:49,429
Okay, how do we perform clustering in practice?
19
00:00:49,429 --> 00:00:51,510
There are different methods we can apply
20
00:00:51,510 --> 00:00:53,250
to identify clusters.
21
00:00:53,250 --> 00:00:55,500
The most popular one is K-means,
22
00:00:55,500 --> 00:00:57,630
so that's where we will start.
23
00:00:57,630 --> 00:01:00,300
Let's simplify this scatter to 15 points,
24
00:01:00,300 --> 00:01:03,390
so we can get a better grasp of what happens.
25
00:01:03,390 --> 00:01:04,620
Cool.
26
00:01:04,620 --> 00:01:07,031
Here's how K-means works.
27
00:01:07,031 --> 00:01:11,700
First, we must choose how many clusters we'd like to have.
28
00:01:11,700 --> 00:01:14,490
That's where this method gets its name from.
29
00:01:14,490 --> 00:01:16,440
K stands for the number of clusters
30
00:01:16,440 --> 00:01:17,883
we are trying to identify.
31
00:01:18,750 --> 00:01:20,583
I'll start with two clusters.
32
00:01:21,600 --> 00:01:24,630
The next step is to specify the cluster seeds.
33
00:01:24,630 --> 00:01:27,720
A seed is basically a starting centroid.
34
00:01:27,720 --> 00:01:31,440
It is chosen at random or is specified by the data scientist
35
00:01:31,440 --> 00:01:33,513
based on prior knowledge about the data.
36
00:01:34,560 --> 00:01:37,050
One of the clusters will be the green cluster.
37
00:01:37,050 --> 00:01:41,700
The other one, the orange cluster, and these are the seeds.
38
00:01:41,700 --> 00:01:42,630
The following step
39
00:01:42,630 --> 00:01:45,960
is to assign each point on the graph to a seed,
40
00:01:45,960 --> 00:01:47,853
which is done based on proximity.
41
00:01:48,750 --> 00:01:51,420
For instance, this point is closer to the green seed
42
00:01:51,420 --> 00:01:52,980
than to the orange one.
43
00:01:52,980 --> 00:01:55,623
Therefore, it will belong to the green cluster.
44
00:01:56,550 --> 00:01:59,910
This point, on the other hand, is closer to the orange seed.
45
00:01:59,910 --> 00:02:02,460
Therefore, it will be a part of the orange cluster.
46
00:02:03,330 --> 00:02:06,420
In this way, we can color all points on the graph
47
00:02:06,420 --> 00:02:09,630
based on their Euclidean distance from the seeds.
48
00:02:09,630 --> 00:02:10,860
Great.
49
00:02:10,860 --> 00:02:13,350
The final step is to calculate the centroid
50
00:02:13,350 --> 00:02:16,290
of the green points and the orange points.
51
00:02:16,290 --> 00:02:18,630
The green seed will move closer to the green points
52
00:02:18,630 --> 00:02:20,190
to become their centroid,
53
00:02:20,190 --> 00:02:23,820
and the orange will do the same for the orange points.
54
00:02:23,820 --> 00:02:27,690
From here, we would repeat the last two steps.
55
00:02:27,690 --> 00:02:30,360
Let's recalculate the distances.
56
00:02:30,360 --> 00:02:31,320
All the green points
57
00:02:31,320 --> 00:02:33,600
are obviously closer to the green centroid,
58
00:02:33,600 --> 00:02:36,573
and the orange points are closer to the orange centroid.
59
00:02:37,470 --> 00:02:39,180
What about these two?
60
00:02:39,180 --> 00:02:41,783
Both of them are closer to the green centroid,
61
00:02:41,783 --> 00:02:45,663
so at this step, we will reassign them to the green cluster.
62
00:02:46,530 --> 00:02:50,400
Finally, we must recalculate the centroids.
63
00:02:50,400 --> 00:02:52,470
That's the new result.
64
00:02:52,470 --> 00:02:55,740
Now, all the green points are closest to the green centroid
65
00:02:55,740 --> 00:02:58,170
and all the orange ones to the orange.
66
00:02:58,170 --> 00:03:00,000
We can no longer reassign points,
67
00:03:00,000 --> 00:03:02,820
which completes the clustering process.
68
00:03:02,820 --> 00:03:05,880
This is the two-cluster solution.
69
00:03:05,880 --> 00:03:10,320
All right, so that's the whole idea behind clustering.
70
00:03:10,320 --> 00:03:12,540
In order to solidify your understanding,
71
00:03:12,540 --> 00:03:14,493
we will redo the process.
72
00:03:15,390 --> 00:03:18,390
In the beginning, we said that with K-means clustering,
73
00:03:18,390 --> 00:03:20,490
we must specify the number of clusters
74
00:03:20,490 --> 00:03:23,280
prior to clustering, right?
75
00:03:23,280 --> 00:03:26,220
What if we wanna obtain three clusters?
76
00:03:26,220 --> 00:03:29,250
The first step involves selecting the seeds.
77
00:03:29,250 --> 00:03:31,050
Let's have another seed.
78
00:03:31,050 --> 00:03:32,973
We'll use red for this one.
79
00:03:33,870 --> 00:03:36,540
Next, we must associate each of the points
80
00:03:36,540 --> 00:03:37,833
with the closest seed.
81
00:03:38,700 --> 00:03:42,033
Finally, we calculate the centroids of the colored points.
82
00:03:43,140 --> 00:03:46,830
We already know that K-means is an iterative process,
83
00:03:46,830 --> 00:03:48,600
so we go back to the step
84
00:03:48,600 --> 00:03:52,440
where we associate each of the points with the closest seed.
85
00:03:52,440 --> 00:03:56,550
All orange points are settled, so no movement there.
86
00:03:56,550 --> 00:03:58,320
What about these two points?
87
00:03:58,320 --> 00:04:00,180
Now they're closer to the red seed,
88
00:04:00,180 --> 00:04:02,670
so they will go into the red cluster.
89
00:04:02,670 --> 00:04:04,863
That's the only change in the whole graph.
90
00:04:05,730 --> 00:04:08,340
In the end, we recalculate the centroids
91
00:04:08,340 --> 00:04:09,390
and reach a situation
92
00:04:09,390 --> 00:04:11,430
where no more adjustments are necessary
93
00:04:11,430 --> 00:04:13,770
using the K-means algorithm.
94
00:04:13,770 --> 00:04:16,173
We have reached a three-cluster solution.
95
00:04:17,010 --> 00:04:18,469
This is the exact algorithm
96
00:04:18,469 --> 00:04:20,490
which was used to find the solution
97
00:04:20,490 --> 00:04:23,190
of the problem you saw at the beginning of the lesson.
98
00:04:24,270 --> 00:04:26,100
Here's a Python-generated graph
99
00:04:26,100 --> 00:04:28,530
with the three clusters colored.
100
00:04:28,530 --> 00:04:31,710
I'm sorry they're not the same, but you get the point.
101
00:04:31,710 --> 00:04:32,850
That's how we would usually
102
00:04:32,850 --> 00:04:35,610
represent the clusters graphically.
103
00:04:35,610 --> 00:04:36,630
Great.
104
00:04:36,630 --> 00:04:39,540
I think we have a good basis to start coding.
105
00:04:39,540 --> 00:04:40,940
See you in the next lecture.
8113
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.