subtitlecat.com

All language subtitles for 001 K-Means Clustering_en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,600 --> 00:00:01,800 Instructor: Hey again. 2 00:00:01,800 --> 00:00:04,230 We finished the last lecture with this graph. 3 00:00:04,230 --> 00:00:07,230 It shows 1000 points and their centroid. 4 00:00:07,230 --> 00:00:08,580 In cluster analysis, 5 00:00:08,580 --> 00:00:11,820 that's how a cluster would look in two-dimensional space. 6 00:00:11,820 --> 00:00:14,400 There are two dimensions or two features 7 00:00:14,400 --> 00:00:17,220 based on which we are performing clustering. 8 00:00:17,220 --> 00:00:19,800 For instance, the age and money spent 9 00:00:19,800 --> 00:00:21,333 from our earlier example. 10 00:00:22,200 --> 00:00:25,680 Certainly it makes no sense to have only one cluster, 11 00:00:25,680 --> 00:00:28,375 so let me zoom out of this graph. 12 00:00:28,375 --> 00:00:30,840 Here's a nice picture of clusters. 13 00:00:30,840 --> 00:00:33,420 We can clearly see two clusters. 14 00:00:33,420 --> 00:00:35,253 I'll also indicate their centroids. 15 00:00:36,180 --> 00:00:38,850 If we want to identify three clusters, 16 00:00:38,850 --> 00:00:41,250 this is the result we obtain, 17 00:00:41,250 --> 00:00:44,133 and that's more or less how clustering works graphically. 18 00:00:45,000 --> 00:00:49,429 Okay, how do we perform clustering in practice? 19 00:00:49,429 --> 00:00:51,510 There are different methods we can apply 20 00:00:51,510 --> 00:00:53,250 to identify clusters. 21 00:00:53,250 --> 00:00:55,500 The most popular one is K-means, 22 00:00:55,500 --> 00:00:57,630 so that's where we will start. 23 00:00:57,630 --> 00:01:00,300 Let's simplify this scatter to 15 points, 24 00:01:00,300 --> 00:01:03,390 so we can get a better grasp of what happens. 25 00:01:03,390 --> 00:01:04,620 Cool. 26 00:01:04,620 --> 00:01:07,031 Here's how K-means works. 27 00:01:07,031 --> 00:01:11,700 First, we must choose how many clusters we'd like to have. 28 00:01:11,700 --> 00:01:14,490 That's where this method gets its name from. 29 00:01:14,490 --> 00:01:16,440 K stands for the number of clusters 30 00:01:16,440 --> 00:01:17,883 we are trying to identify. 31 00:01:18,750 --> 00:01:20,583 I'll start with two clusters. 32 00:01:21,600 --> 00:01:24,630 The next step is to specify the cluster seeds. 33 00:01:24,630 --> 00:01:27,720 A seed is basically a starting centroid. 34 00:01:27,720 --> 00:01:31,440 It is chosen at random or is specified by the data scientist 35 00:01:31,440 --> 00:01:33,513 based on prior knowledge about the data. 36 00:01:34,560 --> 00:01:37,050 One of the clusters will be the green cluster. 37 00:01:37,050 --> 00:01:41,700 The other one, the orange cluster, and these are the seeds. 38 00:01:41,700 --> 00:01:42,630 The following step 39 00:01:42,630 --> 00:01:45,960 is to assign each point on the graph to a seed, 40 00:01:45,960 --> 00:01:47,853 which is done based on proximity. 41 00:01:48,750 --> 00:01:51,420 For instance, this point is closer to the green seed 42 00:01:51,420 --> 00:01:52,980 than to the orange one. 43 00:01:52,980 --> 00:01:55,623 Therefore, it will belong to the green cluster. 44 00:01:56,550 --> 00:01:59,910 This point, on the other hand, is closer to the orange seed. 45 00:01:59,910 --> 00:02:02,460 Therefore, it will be a part of the orange cluster. 46 00:02:03,330 --> 00:02:06,420 In this way, we can color all points on the graph 47 00:02:06,420 --> 00:02:09,630 based on their Euclidean distance from the seeds. 48 00:02:09,630 --> 00:02:10,860 Great. 49 00:02:10,860 --> 00:02:13,350 The final step is to calculate the centroid 50 00:02:13,350 --> 00:02:16,290 of the green points and the orange points. 51 00:02:16,290 --> 00:02:18,630 The green seed will move closer to the green points 52 00:02:18,630 --> 00:02:20,190 to become their centroid, 53 00:02:20,190 --> 00:02:23,820 and the orange will do the same for the orange points. 54 00:02:23,820 --> 00:02:27,690 From here, we would repeat the last two steps. 55 00:02:27,690 --> 00:02:30,360 Let's recalculate the distances. 56 00:02:30,360 --> 00:02:31,320 All the green points 57 00:02:31,320 --> 00:02:33,600 are obviously closer to the green centroid, 58 00:02:33,600 --> 00:02:36,573 and the orange points are closer to the orange centroid. 59 00:02:37,470 --> 00:02:39,180 What about these two? 60 00:02:39,180 --> 00:02:41,783 Both of them are closer to the green centroid, 61 00:02:41,783 --> 00:02:45,663 so at this step, we will reassign them to the green cluster. 62 00:02:46,530 --> 00:02:50,400 Finally, we must recalculate the centroids. 63 00:02:50,400 --> 00:02:52,470 That's the new result. 64 00:02:52,470 --> 00:02:55,740 Now, all the green points are closest to the green centroid 65 00:02:55,740 --> 00:02:58,170 and all the orange ones to the orange. 66 00:02:58,170 --> 00:03:00,000 We can no longer reassign points, 67 00:03:00,000 --> 00:03:02,820 which completes the clustering process. 68 00:03:02,820 --> 00:03:05,880 This is the two-cluster solution. 69 00:03:05,880 --> 00:03:10,320 All right, so that's the whole idea behind clustering. 70 00:03:10,320 --> 00:03:12,540 In order to solidify your understanding, 71 00:03:12,540 --> 00:03:14,493 we will redo the process. 72 00:03:15,390 --> 00:03:18,390 In the beginning, we said that with K-means clustering, 73 00:03:18,390 --> 00:03:20,490 we must specify the number of clusters 74 00:03:20,490 --> 00:03:23,280 prior to clustering, right? 75 00:03:23,280 --> 00:03:26,220 What if we wanna obtain three clusters? 76 00:03:26,220 --> 00:03:29,250 The first step involves selecting the seeds. 77 00:03:29,250 --> 00:03:31,050 Let's have another seed. 78 00:03:31,050 --> 00:03:32,973 We'll use red for this one. 79 00:03:33,870 --> 00:03:36,540 Next, we must associate each of the points 80 00:03:36,540 --> 00:03:37,833 with the closest seed. 81 00:03:38,700 --> 00:03:42,033 Finally, we calculate the centroids of the colored points. 82 00:03:43,140 --> 00:03:46,830 We already know that K-means is an iterative process, 83 00:03:46,830 --> 00:03:48,600 so we go back to the step 84 00:03:48,600 --> 00:03:52,440 where we associate each of the points with the closest seed. 85 00:03:52,440 --> 00:03:56,550 All orange points are settled, so no movement there. 86 00:03:56,550 --> 00:03:58,320 What about these two points? 87 00:03:58,320 --> 00:04:00,180 Now they're closer to the red seed, 88 00:04:00,180 --> 00:04:02,670 so they will go into the red cluster. 89 00:04:02,670 --> 00:04:04,863 That's the only change in the whole graph. 90 00:04:05,730 --> 00:04:08,340 In the end, we recalculate the centroids 91 00:04:08,340 --> 00:04:09,390 and reach a situation 92 00:04:09,390 --> 00:04:11,430 where no more adjustments are necessary 93 00:04:11,430 --> 00:04:13,770 using the K-means algorithm. 94 00:04:13,770 --> 00:04:16,173 We have reached a three-cluster solution. 95 00:04:17,010 --> 00:04:18,469 This is the exact algorithm 96 00:04:18,469 --> 00:04:20,490 which was used to find the solution 97 00:04:20,490 --> 00:04:23,190 of the problem you saw at the beginning of the lesson. 98 00:04:24,270 --> 00:04:26,100 Here's a Python-generated graph 99 00:04:26,100 --> 00:04:28,530 with the three clusters colored. 100 00:04:28,530 --> 00:04:31,710 I'm sorry they're not the same, but you get the point. 101 00:04:31,710 --> 00:04:32,850 That's how we would usually 102 00:04:32,850 --> 00:04:35,610 represent the clusters graphically. 103 00:04:35,610 --> 00:04:36,630 Great. 104 00:04:36,630 --> 00:04:39,540 I think we have a good basis to start coding. 105 00:04:39,540 --> 00:04:40,940 See you in the next lecture. 8113