All language subtitles for 001 K-Means Clustering_en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian Download
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,600 --> 00:00:01,800 Instructor: Hey again. 2 00:00:01,800 --> 00:00:04,230 We finished the last lecture with this graph. 3 00:00:04,230 --> 00:00:07,230 It shows 1000 points and their centroid. 4 00:00:07,230 --> 00:00:08,580 In cluster analysis, 5 00:00:08,580 --> 00:00:11,820 that's how a cluster would look in two-dimensional space. 6 00:00:11,820 --> 00:00:14,400 There are two dimensions or two features 7 00:00:14,400 --> 00:00:17,220 based on which we are performing clustering. 8 00:00:17,220 --> 00:00:19,800 For instance, the age and money spent 9 00:00:19,800 --> 00:00:21,333 from our earlier example. 10 00:00:22,200 --> 00:00:25,680 Certainly it makes no sense to have only one cluster, 11 00:00:25,680 --> 00:00:28,375 so let me zoom out of this graph. 12 00:00:28,375 --> 00:00:30,840 Here's a nice picture of clusters. 13 00:00:30,840 --> 00:00:33,420 We can clearly see two clusters. 14 00:00:33,420 --> 00:00:35,253 I'll also indicate their centroids. 15 00:00:36,180 --> 00:00:38,850 If we want to identify three clusters, 16 00:00:38,850 --> 00:00:41,250 this is the result we obtain, 17 00:00:41,250 --> 00:00:44,133 and that's more or less how clustering works graphically. 18 00:00:45,000 --> 00:00:49,429 Okay, how do we perform clustering in practice? 19 00:00:49,429 --> 00:00:51,510 There are different methods we can apply 20 00:00:51,510 --> 00:00:53,250 to identify clusters. 21 00:00:53,250 --> 00:00:55,500 The most popular one is K-means, 22 00:00:55,500 --> 00:00:57,630 so that's where we will start. 23 00:00:57,630 --> 00:01:00,300 Let's simplify this scatter to 15 points, 24 00:01:00,300 --> 00:01:03,390 so we can get a better grasp of what happens. 25 00:01:03,390 --> 00:01:04,620 Cool. 26 00:01:04,620 --> 00:01:07,031 Here's how K-means works. 27 00:01:07,031 --> 00:01:11,700 First, we must choose how many clusters we'd like to have. 28 00:01:11,700 --> 00:01:14,490 That's where this method gets its name from. 29 00:01:14,490 --> 00:01:16,440 K stands for the number of clusters 30 00:01:16,440 --> 00:01:17,883 we are trying to identify. 31 00:01:18,750 --> 00:01:20,583 I'll start with two clusters. 32 00:01:21,600 --> 00:01:24,630 The next step is to specify the cluster seeds. 33 00:01:24,630 --> 00:01:27,720 A seed is basically a starting centroid. 34 00:01:27,720 --> 00:01:31,440 It is chosen at random or is specified by the data scientist 35 00:01:31,440 --> 00:01:33,513 based on prior knowledge about the data. 36 00:01:34,560 --> 00:01:37,050 One of the clusters will be the green cluster. 37 00:01:37,050 --> 00:01:41,700 The other one, the orange cluster, and these are the seeds. 38 00:01:41,700 --> 00:01:42,630 The following step 39 00:01:42,630 --> 00:01:45,960 is to assign each point on the graph to a seed, 40 00:01:45,960 --> 00:01:47,853 which is done based on proximity. 41 00:01:48,750 --> 00:01:51,420 For instance, this point is closer to the green seed 42 00:01:51,420 --> 00:01:52,980 than to the orange one. 43 00:01:52,980 --> 00:01:55,623 Therefore, it will belong to the green cluster. 44 00:01:56,550 --> 00:01:59,910 This point, on the other hand, is closer to the orange seed. 45 00:01:59,910 --> 00:02:02,460 Therefore, it will be a part of the orange cluster. 46 00:02:03,330 --> 00:02:06,420 In this way, we can color all points on the graph 47 00:02:06,420 --> 00:02:09,630 based on their Euclidean distance from the seeds. 48 00:02:09,630 --> 00:02:10,860 Great. 49 00:02:10,860 --> 00:02:13,350 The final step is to calculate the centroid 50 00:02:13,350 --> 00:02:16,290 of the green points and the orange points. 51 00:02:16,290 --> 00:02:18,630 The green seed will move closer to the green points 52 00:02:18,630 --> 00:02:20,190 to become their centroid, 53 00:02:20,190 --> 00:02:23,820 and the orange will do the same for the orange points. 54 00:02:23,820 --> 00:02:27,690 From here, we would repeat the last two steps. 55 00:02:27,690 --> 00:02:30,360 Let's recalculate the distances. 56 00:02:30,360 --> 00:02:31,320 All the green points 57 00:02:31,320 --> 00:02:33,600 are obviously closer to the green centroid, 58 00:02:33,600 --> 00:02:36,573 and the orange points are closer to the orange centroid. 59 00:02:37,470 --> 00:02:39,180 What about these two? 60 00:02:39,180 --> 00:02:41,783 Both of them are closer to the green centroid, 61 00:02:41,783 --> 00:02:45,663 so at this step, we will reassign them to the green cluster. 62 00:02:46,530 --> 00:02:50,400 Finally, we must recalculate the centroids. 63 00:02:50,400 --> 00:02:52,470 That's the new result. 64 00:02:52,470 --> 00:02:55,740 Now, all the green points are closest to the green centroid 65 00:02:55,740 --> 00:02:58,170 and all the orange ones to the orange. 66 00:02:58,170 --> 00:03:00,000 We can no longer reassign points, 67 00:03:00,000 --> 00:03:02,820 which completes the clustering process. 68 00:03:02,820 --> 00:03:05,880 This is the two-cluster solution. 69 00:03:05,880 --> 00:03:10,320 All right, so that's the whole idea behind clustering. 70 00:03:10,320 --> 00:03:12,540 In order to solidify your understanding, 71 00:03:12,540 --> 00:03:14,493 we will redo the process. 72 00:03:15,390 --> 00:03:18,390 In the beginning, we said that with K-means clustering, 73 00:03:18,390 --> 00:03:20,490 we must specify the number of clusters 74 00:03:20,490 --> 00:03:23,280 prior to clustering, right? 75 00:03:23,280 --> 00:03:26,220 What if we wanna obtain three clusters? 76 00:03:26,220 --> 00:03:29,250 The first step involves selecting the seeds. 77 00:03:29,250 --> 00:03:31,050 Let's have another seed. 78 00:03:31,050 --> 00:03:32,973 We'll use red for this one. 79 00:03:33,870 --> 00:03:36,540 Next, we must associate each of the points 80 00:03:36,540 --> 00:03:37,833 with the closest seed. 81 00:03:38,700 --> 00:03:42,033 Finally, we calculate the centroids of the colored points. 82 00:03:43,140 --> 00:03:46,830 We already know that K-means is an iterative process, 83 00:03:46,830 --> 00:03:48,600 so we go back to the step 84 00:03:48,600 --> 00:03:52,440 where we associate each of the points with the closest seed. 85 00:03:52,440 --> 00:03:56,550 All orange points are settled, so no movement there. 86 00:03:56,550 --> 00:03:58,320 What about these two points? 87 00:03:58,320 --> 00:04:00,180 Now they're closer to the red seed, 88 00:04:00,180 --> 00:04:02,670 so they will go into the red cluster. 89 00:04:02,670 --> 00:04:04,863 That's the only change in the whole graph. 90 00:04:05,730 --> 00:04:08,340 In the end, we recalculate the centroids 91 00:04:08,340 --> 00:04:09,390 and reach a situation 92 00:04:09,390 --> 00:04:11,430 where no more adjustments are necessary 93 00:04:11,430 --> 00:04:13,770 using the K-means algorithm. 94 00:04:13,770 --> 00:04:16,173 We have reached a three-cluster solution. 95 00:04:17,010 --> 00:04:18,469 This is the exact algorithm 96 00:04:18,469 --> 00:04:20,490 which was used to find the solution 97 00:04:20,490 --> 00:04:23,190 of the problem you saw at the beginning of the lesson. 98 00:04:24,270 --> 00:04:26,100 Here's a Python-generated graph 99 00:04:26,100 --> 00:04:28,530 with the three clusters colored. 100 00:04:28,530 --> 00:04:31,710 I'm sorry they're not the same, but you get the point. 101 00:04:31,710 --> 00:04:32,850 That's how we would usually 102 00:04:32,850 --> 00:04:35,610 represent the clusters graphically. 103 00:04:35,610 --> 00:04:36,630 Great. 104 00:04:36,630 --> 00:04:39,540 I think we have a good basis to start coding. 105 00:04:39,540 --> 00:04:40,940 See you in the next lecture. 8113

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.