All language subtitles for 008 Pros and Cons of K-Means Clustering_en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian Download
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,630 --> 00:00:02,250 Instructor: It wouldn't be data science 2 00:00:02,250 --> 00:00:04,950 if there wasn't this very important topic, 3 00:00:04,950 --> 00:00:09,420 problems with, issues with, or limitations of X. 4 00:00:09,420 --> 00:00:13,113 Well, let's look at the pros and cons of K-means clustering. 5 00:00:14,160 --> 00:00:15,780 The pros are already known to you 6 00:00:15,780 --> 00:00:17,790 even if you don't realize it. 7 00:00:17,790 --> 00:00:21,060 It is simple to understand and fast to cluster. 8 00:00:21,060 --> 00:00:23,760 Moreover, there are many packages that offer it, 9 00:00:23,760 --> 00:00:26,580 so implementation is effortless. 10 00:00:26,580 --> 00:00:29,820 Finally, clustering it always yields a result. 11 00:00:29,820 --> 00:00:32,670 No matter the data, it will always spit out a solution 12 00:00:32,670 --> 00:00:33,603 which is great. 13 00:00:34,800 --> 00:00:36,840 Time for the cons. 14 00:00:36,840 --> 00:00:38,190 We will dig a bit into them 15 00:00:38,190 --> 00:00:41,160 as they are very interesting to explore. 16 00:00:41,160 --> 00:00:44,700 Moreover, this lecture will solidify your understanding 17 00:00:44,700 --> 00:00:45,843 like no other. 18 00:00:46,710 --> 00:00:49,650 The first con is that we need to pick K. 19 00:00:49,650 --> 00:00:52,680 As we already saw, the elbow method fixes that, 20 00:00:52,680 --> 00:00:55,383 but it is not extremely scientific per se. 21 00:00:56,911 --> 00:01:00,540 Second, K-means is sensitive to initialization. 22 00:01:00,540 --> 00:01:03,090 That's a very interesting problem. 23 00:01:03,090 --> 00:01:05,010 Say that these are our points. 24 00:01:05,010 --> 00:01:08,790 If we randomly choose the centroids here and here, 25 00:01:08,790 --> 00:01:11,190 the obvious solution is one top cluster 26 00:01:11,190 --> 00:01:13,170 and one bottom cluster. 27 00:01:13,170 --> 00:01:16,350 However, clustering the points on the left in one cluster 28 00:01:16,350 --> 00:01:17,820 and those on the right in another 29 00:01:17,820 --> 00:01:19,713 is a more appropriate solution. 30 00:01:20,970 --> 00:01:22,950 Now imagine the same situation, 31 00:01:22,950 --> 00:01:26,100 but with much more widely spread points. 32 00:01:26,100 --> 00:01:27,120 Guess what? 33 00:01:27,120 --> 00:01:28,830 Given the same initial seeds, 34 00:01:28,830 --> 00:01:33,000 we get the same clusters because that's how K-means works. 35 00:01:33,000 --> 00:01:35,940 It takes the closest points to the seeds. 36 00:01:35,940 --> 00:01:38,670 So if your initial seeds are problematic, 37 00:01:38,670 --> 00:01:41,550 the whole solution is meaningless. 38 00:01:41,550 --> 00:01:46,380 The remedy is simple. It is called K-means++. 39 00:01:46,380 --> 00:01:49,860 The idea is that a preliminary iterative algorithm is ran 40 00:01:49,860 --> 00:01:53,310 prior to K-means to determine the most appropriate seeds 41 00:01:53,310 --> 00:01:55,380 for the clustering itself. 42 00:01:55,380 --> 00:01:56,880 If we go back to our code, 43 00:01:56,880 --> 00:02:01,880 we will see that sklearn employs K-means++ by default, 44 00:02:01,920 --> 00:02:03,750 so we are safe here, 45 00:02:03,750 --> 00:02:05,640 but if you are using a different package, 46 00:02:05,640 --> 00:02:08,013 remember that initialization matters. 47 00:02:10,500 --> 00:02:11,880 A third major problem 48 00:02:11,880 --> 00:02:14,910 is that K-means is sensitive to outliers. 49 00:02:14,910 --> 00:02:16,080 What does this mean? 50 00:02:16,080 --> 00:02:17,970 Well, if there is a single point 51 00:02:17,970 --> 00:02:20,040 that is too far away from the rest, 52 00:02:20,040 --> 00:02:23,730 it will always be placed in its own one-point cluster. 53 00:02:23,730 --> 00:02:25,770 Have we already experienced that? 54 00:02:25,770 --> 00:02:27,810 Well, of course we have. 55 00:02:27,810 --> 00:02:31,140 Australia was the sole cluster in almost all the solutions 56 00:02:31,140 --> 00:02:33,900 we had for our country clusters example. 57 00:02:33,900 --> 00:02:36,330 It is so far away from the rest of the countries 58 00:02:36,330 --> 00:02:39,180 that it is destined to be in its own cluster. 59 00:02:39,180 --> 00:02:42,990 The remedy, just get rid of outliers prior to clustering. 60 00:02:42,990 --> 00:02:45,180 Alternatively, if you do the clustering 61 00:02:45,180 --> 00:02:49,023 and spot one-point clusters, remove them and cluster again. 62 00:02:51,390 --> 00:02:55,560 A fourth con, K-means produces spherical solutions. 63 00:02:55,560 --> 00:02:58,290 This means that on a 2D plane that we have seen, 64 00:02:58,290 --> 00:03:01,170 we would more often see clusters that look like circles 65 00:03:01,170 --> 00:03:03,330 rather than elliptic shapes. 66 00:03:03,330 --> 00:03:06,270 The reason for that is that we are using Euclidean distance 67 00:03:06,270 --> 00:03:07,710 from the centroid. 68 00:03:07,710 --> 00:03:11,643 This is also why outliers are such a big issue for K-means. 69 00:03:14,010 --> 00:03:16,770 Finally, we have standardization. 70 00:03:16,770 --> 00:03:19,530 Oh, good old standardization. 71 00:03:19,530 --> 00:03:22,350 Let's leave that for the next lesson, shall we? 72 00:03:22,350 --> 00:03:23,350 Thanks for watching. 5619

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.