subtitlecat.com

All language subtitles for 006 How to Choose the Number of Clusters_en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,000 --> 00:00:01,230 Narrator: We've been juggling 2 00:00:01,230 --> 00:00:03,840 with the number of clusters for too long. 3 00:00:03,840 --> 00:00:05,040 Isn't there a criterion 4 00:00:05,040 --> 00:00:07,860 for setting the proper number of clusters? 5 00:00:07,860 --> 00:00:09,753 Luckily for us, there is, 6 00:00:10,800 --> 00:00:13,230 probably the most widely adopted criterion 7 00:00:13,230 --> 00:00:15,960 is the so-called, elbow method. 8 00:00:15,960 --> 00:00:18,030 What's the rationale behind it? 9 00:00:18,030 --> 00:00:20,070 Well, remember that clustering was about 10 00:00:20,070 --> 00:00:23,040 minimizing the distance between points and a cluster 11 00:00:23,040 --> 00:00:26,280 and maximizing the distance between clusters. 12 00:00:26,280 --> 00:00:28,200 It turns out that for k-means 13 00:00:28,200 --> 00:00:30,870 these two occur simultaneously, 14 00:00:30,870 --> 00:00:32,189 if we minimize the distance 15 00:00:32,189 --> 00:00:33,930 between points and a cluster, 16 00:00:33,930 --> 00:00:34,950 we are automatically 17 00:00:34,950 --> 00:00:38,040 maximizing the distance between clusters, 18 00:00:38,040 --> 00:00:40,170 one less thing to worry about. 19 00:00:40,170 --> 00:00:41,850 Now, the distance between 20 00:00:41,850 --> 00:00:44,970 points and a cluster sounds clumsy, doesn't it? 21 00:00:44,970 --> 00:00:47,001 That distance is measured in sum of squares 22 00:00:47,001 --> 00:00:49,020 and the academic term is, 23 00:00:49,020 --> 00:00:53,000 within-cluster sum of squares, or WCSS, 24 00:00:53,000 --> 00:00:57,004 not much better, but at least the abbreviation is nice. 25 00:00:57,004 --> 00:01:02,004 Okay, similar to SST, SSR and SSE from regressions 26 00:01:02,998 --> 00:01:07,998 WCSS is a measure developed within the ANOVA framework, 27 00:01:08,250 --> 00:01:10,002 if we minimize WCSS 28 00:01:10,002 --> 00:01:12,993 we have reached the perfect clustering solution. 29 00:01:14,970 --> 00:01:16,710 Here's the problem, 30 00:01:16,710 --> 00:01:18,810 if we have the same six countries 31 00:01:18,810 --> 00:01:20,820 and each one of them is a different cluster, 32 00:01:20,820 --> 00:01:25,620 so a total of six clusters, then WCSS is zero, 33 00:01:25,620 --> 00:01:28,680 that's because, there is just one point in each cluster 34 00:01:28,680 --> 00:01:31,830 and we can't have a within-cluster sum of squares, 35 00:01:31,830 --> 00:01:33,180 furthermore, the clusters 36 00:01:33,180 --> 00:01:35,193 are as far as they can possibly be. 37 00:01:36,750 --> 00:01:39,810 Imagine this with 1,000,000 observations, 38 00:01:39,810 --> 00:01:44,001 a 1,000,000 cluster solution is definitely of no use, 39 00:01:44,001 --> 00:01:47,880 similarly, if all observations are in the same cluster 40 00:01:47,880 --> 00:01:52,000 the solution is useless and WCSS is at its maximum. 41 00:01:52,000 --> 00:01:55,230 There must be some middle ground. 42 00:01:55,230 --> 00:01:56,820 Applying some common sense, 43 00:01:56,820 --> 00:01:58,710 we easily reach the conclusion 44 00:01:58,710 --> 00:02:02,001 that we don't really want WCSS to be minimized, 45 00:02:02,001 --> 00:02:05,001 instead, we want it to be as low as possible 46 00:02:05,001 --> 00:02:08,220 while we can still have a small number of clusters, 47 00:02:08,220 --> 00:02:10,259 so we can interpret them. 48 00:02:10,259 --> 00:02:14,001 All right, if we plot WCSS against the number of clusters 49 00:02:14,001 --> 00:02:16,001 we get this pretty graph. 50 00:02:16,001 --> 00:02:19,023 It looks like an elbow, hence the name. 51 00:02:19,950 --> 00:02:22,002 The point is that, the within-cluster sum of squares 52 00:02:22,002 --> 00:02:24,810 is a monotonously decreasing function, 53 00:02:24,810 --> 00:02:28,020 which is lower for a bigger number of clusters. 54 00:02:28,020 --> 00:02:30,270 Here's the big revelation, 55 00:02:30,270 --> 00:02:33,999 in the beginning, WCSS is declining extremely fast 56 00:02:33,999 --> 00:02:36,870 at some point, it reaches the elbow 57 00:02:36,870 --> 00:02:39,690 afterwards we are not reaching a much better solution 58 00:02:39,690 --> 00:02:44,000 in terms of WCSS by increasing the number of clusters. 59 00:02:44,000 --> 00:02:45,690 For our case, 60 00:02:45,690 --> 00:02:47,790 we say that the optimal number of clusters 61 00:02:47,790 --> 00:02:49,998 is three, as this is the elbow, 62 00:02:49,998 --> 00:02:52,530 that's the biggest number of clusters for which 63 00:02:52,530 --> 00:02:55,998 we are still getting a significant decrease In WCSS 64 00:02:55,998 --> 00:03:00,960 thereafter, there is almost no improvement, cool. 65 00:03:00,960 --> 00:03:02,910 How can we put that to use? 66 00:03:02,910 --> 00:03:05,160 We need two pieces of information, 67 00:03:05,160 --> 00:03:07,260 the number of clusters, k 68 00:03:07,260 --> 00:03:10,999 and the WCSS for a specific number of clusters. 69 00:03:10,999 --> 00:03:14,070 K is set by us at the beginning of the process, 70 00:03:14,070 --> 00:03:17,883 while there is an SK learn method that gives us the WCSS, 71 00:03:18,995 --> 00:03:22,890 for instance, to get the WCSS for our last example, 72 00:03:22,890 --> 00:03:27,890 we just write k-means dot inertia underscore 73 00:03:29,610 --> 00:03:30,810 to plot the elbow, 74 00:03:30,810 --> 00:03:34,800 we actually need to solve the problem with 1, 2, 3 and so on 75 00:03:34,800 --> 00:03:38,940 clusters and calculate WCSS for each of them. 76 00:03:38,940 --> 00:03:41,250 Let's do that with a loop. 77 00:03:41,250 --> 00:03:45,840 First, I'll declare an empty list called WCSS. 78 00:03:45,840 --> 00:03:47,940 for i in range one to seven, 79 00:03:47,940 --> 00:03:51,183 as we have a total of six observations, colons. 80 00:03:52,110 --> 00:03:57,110 K-means equals k-means with capital K and M of i. 81 00:03:58,920 --> 00:04:03,570 Next, I want to fit the input data x using k-means, 82 00:04:03,570 --> 00:04:07,630 so k-means dot fit x 83 00:04:09,180 --> 00:04:12,870 then we will calculate the WCSS for the iteration 84 00:04:12,870 --> 00:04:14,820 using the inertia method. 85 00:04:14,820 --> 00:04:19,820 let WCSS underscore iter be equal to k-means dot inertia. 86 00:04:21,995 --> 00:04:25,920 Finally, we will add the WCSS for the iteration 87 00:04:25,920 --> 00:04:28,004 to the WCSS list, 88 00:04:28,004 --> 00:04:30,997 a handy method to do that is append, 89 00:04:30,997 --> 00:04:32,996 if you are not familiar with it, 90 00:04:32,996 --> 00:04:36,001 just pick the list dot append 91 00:04:36,001 --> 00:04:38,610 and in brackets you can include the value 92 00:04:38,610 --> 00:04:41,040 you'd like to append to the list, 93 00:04:41,040 --> 00:04:46,040 so WCSS dot append brackets WCSS iter, 94 00:04:49,004 --> 00:04:52,200 cool, let's run the code. 95 00:04:52,200 --> 00:04:54,690 WCSS should be a list which contains 96 00:04:54,690 --> 00:04:56,520 the within-cluster sum of squares 97 00:04:56,520 --> 00:05:00,513 for one cluster, two clusters, and so on until six, 98 00:05:01,740 --> 00:05:04,170 as you can see, the sequence is decreasing 99 00:05:04,170 --> 00:05:06,780 with very big leaps in the first two steps 100 00:05:06,780 --> 00:05:09,210 and much smaller ones later on, 101 00:05:09,210 --> 00:05:12,060 finally, when each point is a separate cluster 102 00:05:12,060 --> 00:05:17,060 we have a WCSS equal to zero, let's plot that, 103 00:05:17,220 --> 00:05:21,060 we have WCSS, so let's declare a variable called, 104 00:05:21,060 --> 00:05:25,110 number clusters, which is also a list from one to six. 105 00:05:25,110 --> 00:05:29,996 Number clusters equals range one, seven, cool. 106 00:05:29,996 --> 00:05:33,000 Then using some conventional plotting code, 107 00:05:33,000 --> 00:05:34,533 we get the graph. 108 00:05:37,200 --> 00:05:40,050 Finally, we will use the elbow method to decide 109 00:05:40,050 --> 00:05:42,180 the optimal number of clusters. 110 00:05:42,180 --> 00:05:44,910 There are two points, which can be the elbow, 111 00:05:44,910 --> 00:05:46,997 this one and that one. 112 00:05:46,997 --> 00:05:50,000 A three cluster solution is definitely the better one 113 00:05:50,000 --> 00:05:52,980 as after it there's not much to gain. 114 00:05:52,980 --> 00:05:56,490 A two cluster solution in this case would be suboptimal 115 00:05:56,490 --> 00:06:01,023 as the leap from two to three is very big in terms of WCSS. 116 00:06:01,920 --> 00:06:03,998 Okay, let's wrap it up here 117 00:06:03,998 --> 00:06:06,000 and we will practice this new knowledge 118 00:06:06,000 --> 00:06:08,999 on other data sets in our next lessons. 119 00:06:08,999 --> 00:06:10,383 Thanks for watching. 9299