Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,000 --> 00:00:01,230
Narrator: We've been juggling
2
00:00:01,230 --> 00:00:03,840
with the number of clusters for too long.
3
00:00:03,840 --> 00:00:05,040
Isn't there a criterion
4
00:00:05,040 --> 00:00:07,860
for setting the proper number of clusters?
5
00:00:07,860 --> 00:00:09,753
Luckily for us, there is,
6
00:00:10,800 --> 00:00:13,230
probably the most widely adopted criterion
7
00:00:13,230 --> 00:00:15,960
is the so-called, elbow method.
8
00:00:15,960 --> 00:00:18,030
What's the rationale behind it?
9
00:00:18,030 --> 00:00:20,070
Well, remember that clustering was about
10
00:00:20,070 --> 00:00:23,040
minimizing the distance between points and a cluster
11
00:00:23,040 --> 00:00:26,280
and maximizing the distance between clusters.
12
00:00:26,280 --> 00:00:28,200
It turns out that for k-means
13
00:00:28,200 --> 00:00:30,870
these two occur simultaneously,
14
00:00:30,870 --> 00:00:32,189
if we minimize the distance
15
00:00:32,189 --> 00:00:33,930
between points and a cluster,
16
00:00:33,930 --> 00:00:34,950
we are automatically
17
00:00:34,950 --> 00:00:38,040
maximizing the distance between clusters,
18
00:00:38,040 --> 00:00:40,170
one less thing to worry about.
19
00:00:40,170 --> 00:00:41,850
Now, the distance between
20
00:00:41,850 --> 00:00:44,970
points and a cluster sounds clumsy, doesn't it?
21
00:00:44,970 --> 00:00:47,001
That distance is measured in sum of squares
22
00:00:47,001 --> 00:00:49,020
and the academic term is,
23
00:00:49,020 --> 00:00:53,000
within-cluster sum of squares, or WCSS,
24
00:00:53,000 --> 00:00:57,004
not much better, but at least the abbreviation is nice.
25
00:00:57,004 --> 00:01:02,004
Okay, similar to SST, SSR and SSE from regressions
26
00:01:02,998 --> 00:01:07,998
WCSS is a measure developed within the ANOVA framework,
27
00:01:08,250 --> 00:01:10,002
if we minimize WCSS
28
00:01:10,002 --> 00:01:12,993
we have reached the perfect clustering solution.
29
00:01:14,970 --> 00:01:16,710
Here's the problem,
30
00:01:16,710 --> 00:01:18,810
if we have the same six countries
31
00:01:18,810 --> 00:01:20,820
and each one of them is a different cluster,
32
00:01:20,820 --> 00:01:25,620
so a total of six clusters, then WCSS is zero,
33
00:01:25,620 --> 00:01:28,680
that's because, there is just one point in each cluster
34
00:01:28,680 --> 00:01:31,830
and we can't have a within-cluster sum of squares,
35
00:01:31,830 --> 00:01:33,180
furthermore, the clusters
36
00:01:33,180 --> 00:01:35,193
are as far as they can possibly be.
37
00:01:36,750 --> 00:01:39,810
Imagine this with 1,000,000 observations,
38
00:01:39,810 --> 00:01:44,001
a 1,000,000 cluster solution is definitely of no use,
39
00:01:44,001 --> 00:01:47,880
similarly, if all observations are in the same cluster
40
00:01:47,880 --> 00:01:52,000
the solution is useless and WCSS is at its maximum.
41
00:01:52,000 --> 00:01:55,230
There must be some middle ground.
42
00:01:55,230 --> 00:01:56,820
Applying some common sense,
43
00:01:56,820 --> 00:01:58,710
we easily reach the conclusion
44
00:01:58,710 --> 00:02:02,001
that we don't really want WCSS to be minimized,
45
00:02:02,001 --> 00:02:05,001
instead, we want it to be as low as possible
46
00:02:05,001 --> 00:02:08,220
while we can still have a small number of clusters,
47
00:02:08,220 --> 00:02:10,259
so we can interpret them.
48
00:02:10,259 --> 00:02:14,001
All right, if we plot WCSS against the number of clusters
49
00:02:14,001 --> 00:02:16,001
we get this pretty graph.
50
00:02:16,001 --> 00:02:19,023
It looks like an elbow, hence the name.
51
00:02:19,950 --> 00:02:22,002
The point is that, the within-cluster sum of squares
52
00:02:22,002 --> 00:02:24,810
is a monotonously decreasing function,
53
00:02:24,810 --> 00:02:28,020
which is lower for a bigger number of clusters.
54
00:02:28,020 --> 00:02:30,270
Here's the big revelation,
55
00:02:30,270 --> 00:02:33,999
in the beginning, WCSS is declining extremely fast
56
00:02:33,999 --> 00:02:36,870
at some point, it reaches the elbow
57
00:02:36,870 --> 00:02:39,690
afterwards we are not reaching a much better solution
58
00:02:39,690 --> 00:02:44,000
in terms of WCSS by increasing the number of clusters.
59
00:02:44,000 --> 00:02:45,690
For our case,
60
00:02:45,690 --> 00:02:47,790
we say that the optimal number of clusters
61
00:02:47,790 --> 00:02:49,998
is three, as this is the elbow,
62
00:02:49,998 --> 00:02:52,530
that's the biggest number of clusters for which
63
00:02:52,530 --> 00:02:55,998
we are still getting a significant decrease In WCSS
64
00:02:55,998 --> 00:03:00,960
thereafter, there is almost no improvement, cool.
65
00:03:00,960 --> 00:03:02,910
How can we put that to use?
66
00:03:02,910 --> 00:03:05,160
We need two pieces of information,
67
00:03:05,160 --> 00:03:07,260
the number of clusters, k
68
00:03:07,260 --> 00:03:10,999
and the WCSS for a specific number of clusters.
69
00:03:10,999 --> 00:03:14,070
K is set by us at the beginning of the process,
70
00:03:14,070 --> 00:03:17,883
while there is an SK learn method that gives us the WCSS,
71
00:03:18,995 --> 00:03:22,890
for instance, to get the WCSS for our last example,
72
00:03:22,890 --> 00:03:27,890
we just write k-means dot inertia underscore
73
00:03:29,610 --> 00:03:30,810
to plot the elbow,
74
00:03:30,810 --> 00:03:34,800
we actually need to solve the problem with 1, 2, 3 and so on
75
00:03:34,800 --> 00:03:38,940
clusters and calculate WCSS for each of them.
76
00:03:38,940 --> 00:03:41,250
Let's do that with a loop.
77
00:03:41,250 --> 00:03:45,840
First, I'll declare an empty list called WCSS.
78
00:03:45,840 --> 00:03:47,940
for i in range one to seven,
79
00:03:47,940 --> 00:03:51,183
as we have a total of six observations, colons.
80
00:03:52,110 --> 00:03:57,110
K-means equals k-means with capital K and M of i.
81
00:03:58,920 --> 00:04:03,570
Next, I want to fit the input data x using k-means,
82
00:04:03,570 --> 00:04:07,630
so k-means dot fit x
83
00:04:09,180 --> 00:04:12,870
then we will calculate the WCSS for the iteration
84
00:04:12,870 --> 00:04:14,820
using the inertia method.
85
00:04:14,820 --> 00:04:19,820
let WCSS underscore iter be equal to k-means dot inertia.
86
00:04:21,995 --> 00:04:25,920
Finally, we will add the WCSS for the iteration
87
00:04:25,920 --> 00:04:28,004
to the WCSS list,
88
00:04:28,004 --> 00:04:30,997
a handy method to do that is append,
89
00:04:30,997 --> 00:04:32,996
if you are not familiar with it,
90
00:04:32,996 --> 00:04:36,001
just pick the list dot append
91
00:04:36,001 --> 00:04:38,610
and in brackets you can include the value
92
00:04:38,610 --> 00:04:41,040
you'd like to append to the list,
93
00:04:41,040 --> 00:04:46,040
so WCSS dot append brackets WCSS iter,
94
00:04:49,004 --> 00:04:52,200
cool, let's run the code.
95
00:04:52,200 --> 00:04:54,690
WCSS should be a list which contains
96
00:04:54,690 --> 00:04:56,520
the within-cluster sum of squares
97
00:04:56,520 --> 00:05:00,513
for one cluster, two clusters, and so on until six,
98
00:05:01,740 --> 00:05:04,170
as you can see, the sequence is decreasing
99
00:05:04,170 --> 00:05:06,780
with very big leaps in the first two steps
100
00:05:06,780 --> 00:05:09,210
and much smaller ones later on,
101
00:05:09,210 --> 00:05:12,060
finally, when each point is a separate cluster
102
00:05:12,060 --> 00:05:17,060
we have a WCSS equal to zero, let's plot that,
103
00:05:17,220 --> 00:05:21,060
we have WCSS, so let's declare a variable called,
104
00:05:21,060 --> 00:05:25,110
number clusters, which is also a list from one to six.
105
00:05:25,110 --> 00:05:29,996
Number clusters equals range one, seven, cool.
106
00:05:29,996 --> 00:05:33,000
Then using some conventional plotting code,
107
00:05:33,000 --> 00:05:34,533
we get the graph.
108
00:05:37,200 --> 00:05:40,050
Finally, we will use the elbow method to decide
109
00:05:40,050 --> 00:05:42,180
the optimal number of clusters.
110
00:05:42,180 --> 00:05:44,910
There are two points, which can be the elbow,
111
00:05:44,910 --> 00:05:46,997
this one and that one.
112
00:05:46,997 --> 00:05:50,000
A three cluster solution is definitely the better one
113
00:05:50,000 --> 00:05:52,980
as after it there's not much to gain.
114
00:05:52,980 --> 00:05:56,490
A two cluster solution in this case would be suboptimal
115
00:05:56,490 --> 00:06:01,023
as the leap from two to three is very big in terms of WCSS.
116
00:06:01,920 --> 00:06:03,998
Okay, let's wrap it up here
117
00:06:03,998 --> 00:06:06,000
and we will practice this new knowledge
118
00:06:06,000 --> 00:06:08,999
on other data sets in our next lessons.
119
00:06:08,999 --> 00:06:10,383
Thanks for watching.
9299
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.