All language subtitles for 006 How to Choose the Number of Clusters_en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian Download
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,000 --> 00:00:01,230 Narrator: We've been juggling 2 00:00:01,230 --> 00:00:03,840 with the number of clusters for too long. 3 00:00:03,840 --> 00:00:05,040 Isn't there a criterion 4 00:00:05,040 --> 00:00:07,860 for setting the proper number of clusters? 5 00:00:07,860 --> 00:00:09,753 Luckily for us, there is, 6 00:00:10,800 --> 00:00:13,230 probably the most widely adopted criterion 7 00:00:13,230 --> 00:00:15,960 is the so-called, elbow method. 8 00:00:15,960 --> 00:00:18,030 What's the rationale behind it? 9 00:00:18,030 --> 00:00:20,070 Well, remember that clustering was about 10 00:00:20,070 --> 00:00:23,040 minimizing the distance between points and a cluster 11 00:00:23,040 --> 00:00:26,280 and maximizing the distance between clusters. 12 00:00:26,280 --> 00:00:28,200 It turns out that for k-means 13 00:00:28,200 --> 00:00:30,870 these two occur simultaneously, 14 00:00:30,870 --> 00:00:32,189 if we minimize the distance 15 00:00:32,189 --> 00:00:33,930 between points and a cluster, 16 00:00:33,930 --> 00:00:34,950 we are automatically 17 00:00:34,950 --> 00:00:38,040 maximizing the distance between clusters, 18 00:00:38,040 --> 00:00:40,170 one less thing to worry about. 19 00:00:40,170 --> 00:00:41,850 Now, the distance between 20 00:00:41,850 --> 00:00:44,970 points and a cluster sounds clumsy, doesn't it? 21 00:00:44,970 --> 00:00:47,001 That distance is measured in sum of squares 22 00:00:47,001 --> 00:00:49,020 and the academic term is, 23 00:00:49,020 --> 00:00:53,000 within-cluster sum of squares, or WCSS, 24 00:00:53,000 --> 00:00:57,004 not much better, but at least the abbreviation is nice. 25 00:00:57,004 --> 00:01:02,004 Okay, similar to SST, SSR and SSE from regressions 26 00:01:02,998 --> 00:01:07,998 WCSS is a measure developed within the ANOVA framework, 27 00:01:08,250 --> 00:01:10,002 if we minimize WCSS 28 00:01:10,002 --> 00:01:12,993 we have reached the perfect clustering solution. 29 00:01:14,970 --> 00:01:16,710 Here's the problem, 30 00:01:16,710 --> 00:01:18,810 if we have the same six countries 31 00:01:18,810 --> 00:01:20,820 and each one of them is a different cluster, 32 00:01:20,820 --> 00:01:25,620 so a total of six clusters, then WCSS is zero, 33 00:01:25,620 --> 00:01:28,680 that's because, there is just one point in each cluster 34 00:01:28,680 --> 00:01:31,830 and we can't have a within-cluster sum of squares, 35 00:01:31,830 --> 00:01:33,180 furthermore, the clusters 36 00:01:33,180 --> 00:01:35,193 are as far as they can possibly be. 37 00:01:36,750 --> 00:01:39,810 Imagine this with 1,000,000 observations, 38 00:01:39,810 --> 00:01:44,001 a 1,000,000 cluster solution is definitely of no use, 39 00:01:44,001 --> 00:01:47,880 similarly, if all observations are in the same cluster 40 00:01:47,880 --> 00:01:52,000 the solution is useless and WCSS is at its maximum. 41 00:01:52,000 --> 00:01:55,230 There must be some middle ground. 42 00:01:55,230 --> 00:01:56,820 Applying some common sense, 43 00:01:56,820 --> 00:01:58,710 we easily reach the conclusion 44 00:01:58,710 --> 00:02:02,001 that we don't really want WCSS to be minimized, 45 00:02:02,001 --> 00:02:05,001 instead, we want it to be as low as possible 46 00:02:05,001 --> 00:02:08,220 while we can still have a small number of clusters, 47 00:02:08,220 --> 00:02:10,259 so we can interpret them. 48 00:02:10,259 --> 00:02:14,001 All right, if we plot WCSS against the number of clusters 49 00:02:14,001 --> 00:02:16,001 we get this pretty graph. 50 00:02:16,001 --> 00:02:19,023 It looks like an elbow, hence the name. 51 00:02:19,950 --> 00:02:22,002 The point is that, the within-cluster sum of squares 52 00:02:22,002 --> 00:02:24,810 is a monotonously decreasing function, 53 00:02:24,810 --> 00:02:28,020 which is lower for a bigger number of clusters. 54 00:02:28,020 --> 00:02:30,270 Here's the big revelation, 55 00:02:30,270 --> 00:02:33,999 in the beginning, WCSS is declining extremely fast 56 00:02:33,999 --> 00:02:36,870 at some point, it reaches the elbow 57 00:02:36,870 --> 00:02:39,690 afterwards we are not reaching a much better solution 58 00:02:39,690 --> 00:02:44,000 in terms of WCSS by increasing the number of clusters. 59 00:02:44,000 --> 00:02:45,690 For our case, 60 00:02:45,690 --> 00:02:47,790 we say that the optimal number of clusters 61 00:02:47,790 --> 00:02:49,998 is three, as this is the elbow, 62 00:02:49,998 --> 00:02:52,530 that's the biggest number of clusters for which 63 00:02:52,530 --> 00:02:55,998 we are still getting a significant decrease In WCSS 64 00:02:55,998 --> 00:03:00,960 thereafter, there is almost no improvement, cool. 65 00:03:00,960 --> 00:03:02,910 How can we put that to use? 66 00:03:02,910 --> 00:03:05,160 We need two pieces of information, 67 00:03:05,160 --> 00:03:07,260 the number of clusters, k 68 00:03:07,260 --> 00:03:10,999 and the WCSS for a specific number of clusters. 69 00:03:10,999 --> 00:03:14,070 K is set by us at the beginning of the process, 70 00:03:14,070 --> 00:03:17,883 while there is an SK learn method that gives us the WCSS, 71 00:03:18,995 --> 00:03:22,890 for instance, to get the WCSS for our last example, 72 00:03:22,890 --> 00:03:27,890 we just write k-means dot inertia underscore 73 00:03:29,610 --> 00:03:30,810 to plot the elbow, 74 00:03:30,810 --> 00:03:34,800 we actually need to solve the problem with 1, 2, 3 and so on 75 00:03:34,800 --> 00:03:38,940 clusters and calculate WCSS for each of them. 76 00:03:38,940 --> 00:03:41,250 Let's do that with a loop. 77 00:03:41,250 --> 00:03:45,840 First, I'll declare an empty list called WCSS. 78 00:03:45,840 --> 00:03:47,940 for i in range one to seven, 79 00:03:47,940 --> 00:03:51,183 as we have a total of six observations, colons. 80 00:03:52,110 --> 00:03:57,110 K-means equals k-means with capital K and M of i. 81 00:03:58,920 --> 00:04:03,570 Next, I want to fit the input data x using k-means, 82 00:04:03,570 --> 00:04:07,630 so k-means dot fit x 83 00:04:09,180 --> 00:04:12,870 then we will calculate the WCSS for the iteration 84 00:04:12,870 --> 00:04:14,820 using the inertia method. 85 00:04:14,820 --> 00:04:19,820 let WCSS underscore iter be equal to k-means dot inertia. 86 00:04:21,995 --> 00:04:25,920 Finally, we will add the WCSS for the iteration 87 00:04:25,920 --> 00:04:28,004 to the WCSS list, 88 00:04:28,004 --> 00:04:30,997 a handy method to do that is append, 89 00:04:30,997 --> 00:04:32,996 if you are not familiar with it, 90 00:04:32,996 --> 00:04:36,001 just pick the list dot append 91 00:04:36,001 --> 00:04:38,610 and in brackets you can include the value 92 00:04:38,610 --> 00:04:41,040 you'd like to append to the list, 93 00:04:41,040 --> 00:04:46,040 so WCSS dot append brackets WCSS iter, 94 00:04:49,004 --> 00:04:52,200 cool, let's run the code. 95 00:04:52,200 --> 00:04:54,690 WCSS should be a list which contains 96 00:04:54,690 --> 00:04:56,520 the within-cluster sum of squares 97 00:04:56,520 --> 00:05:00,513 for one cluster, two clusters, and so on until six, 98 00:05:01,740 --> 00:05:04,170 as you can see, the sequence is decreasing 99 00:05:04,170 --> 00:05:06,780 with very big leaps in the first two steps 100 00:05:06,780 --> 00:05:09,210 and much smaller ones later on, 101 00:05:09,210 --> 00:05:12,060 finally, when each point is a separate cluster 102 00:05:12,060 --> 00:05:17,060 we have a WCSS equal to zero, let's plot that, 103 00:05:17,220 --> 00:05:21,060 we have WCSS, so let's declare a variable called, 104 00:05:21,060 --> 00:05:25,110 number clusters, which is also a list from one to six. 105 00:05:25,110 --> 00:05:29,996 Number clusters equals range one, seven, cool. 106 00:05:29,996 --> 00:05:33,000 Then using some conventional plotting code, 107 00:05:33,000 --> 00:05:34,533 we get the graph. 108 00:05:37,200 --> 00:05:40,050 Finally, we will use the elbow method to decide 109 00:05:40,050 --> 00:05:42,180 the optimal number of clusters. 110 00:05:42,180 --> 00:05:44,910 There are two points, which can be the elbow, 111 00:05:44,910 --> 00:05:46,997 this one and that one. 112 00:05:46,997 --> 00:05:50,000 A three cluster solution is definitely the better one 113 00:05:50,000 --> 00:05:52,980 as after it there's not much to gain. 114 00:05:52,980 --> 00:05:56,490 A two cluster solution in this case would be suboptimal 115 00:05:56,490 --> 00:06:01,023 as the leap from two to three is very big in terms of WCSS. 116 00:06:01,920 --> 00:06:03,998 Okay, let's wrap it up here 117 00:06:03,998 --> 00:06:06,000 and we will practice this new knowledge 118 00:06:06,000 --> 00:06:08,999 on other data sets in our next lessons. 119 00:06:08,999 --> 00:06:10,383 Thanks for watching. 9299

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.