subtitlecat.com

All language subtitles for 008 Pros and Cons of K-Means Clustering_en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,630 --> 00:00:02,250 Instructor: It wouldn't be data science 2 00:00:02,250 --> 00:00:04,950 if there wasn't this very important topic, 3 00:00:04,950 --> 00:00:09,420 problems with, issues with, or limitations of X. 4 00:00:09,420 --> 00:00:13,113 Well, let's look at the pros and cons of K-means clustering. 5 00:00:14,160 --> 00:00:15,780 The pros are already known to you 6 00:00:15,780 --> 00:00:17,790 even if you don't realize it. 7 00:00:17,790 --> 00:00:21,060 It is simple to understand and fast to cluster. 8 00:00:21,060 --> 00:00:23,760 Moreover, there are many packages that offer it, 9 00:00:23,760 --> 00:00:26,580 so implementation is effortless. 10 00:00:26,580 --> 00:00:29,820 Finally, clustering it always yields a result. 11 00:00:29,820 --> 00:00:32,670 No matter the data, it will always spit out a solution 12 00:00:32,670 --> 00:00:33,603 which is great. 13 00:00:34,800 --> 00:00:36,840 Time for the cons. 14 00:00:36,840 --> 00:00:38,190 We will dig a bit into them 15 00:00:38,190 --> 00:00:41,160 as they are very interesting to explore. 16 00:00:41,160 --> 00:00:44,700 Moreover, this lecture will solidify your understanding 17 00:00:44,700 --> 00:00:45,843 like no other. 18 00:00:46,710 --> 00:00:49,650 The first con is that we need to pick K. 19 00:00:49,650 --> 00:00:52,680 As we already saw, the elbow method fixes that, 20 00:00:52,680 --> 00:00:55,383 but it is not extremely scientific per se. 21 00:00:56,911 --> 00:01:00,540 Second, K-means is sensitive to initialization. 22 00:01:00,540 --> 00:01:03,090 That's a very interesting problem. 23 00:01:03,090 --> 00:01:05,010 Say that these are our points. 24 00:01:05,010 --> 00:01:08,790 If we randomly choose the centroids here and here, 25 00:01:08,790 --> 00:01:11,190 the obvious solution is one top cluster 26 00:01:11,190 --> 00:01:13,170 and one bottom cluster. 27 00:01:13,170 --> 00:01:16,350 However, clustering the points on the left in one cluster 28 00:01:16,350 --> 00:01:17,820 and those on the right in another 29 00:01:17,820 --> 00:01:19,713 is a more appropriate solution. 30 00:01:20,970 --> 00:01:22,950 Now imagine the same situation, 31 00:01:22,950 --> 00:01:26,100 but with much more widely spread points. 32 00:01:26,100 --> 00:01:27,120 Guess what? 33 00:01:27,120 --> 00:01:28,830 Given the same initial seeds, 34 00:01:28,830 --> 00:01:33,000 we get the same clusters because that's how K-means works. 35 00:01:33,000 --> 00:01:35,940 It takes the closest points to the seeds. 36 00:01:35,940 --> 00:01:38,670 So if your initial seeds are problematic, 37 00:01:38,670 --> 00:01:41,550 the whole solution is meaningless. 38 00:01:41,550 --> 00:01:46,380 The remedy is simple. It is called K-means++. 39 00:01:46,380 --> 00:01:49,860 The idea is that a preliminary iterative algorithm is ran 40 00:01:49,860 --> 00:01:53,310 prior to K-means to determine the most appropriate seeds 41 00:01:53,310 --> 00:01:55,380 for the clustering itself. 42 00:01:55,380 --> 00:01:56,880 If we go back to our code, 43 00:01:56,880 --> 00:02:01,880 we will see that sklearn employs K-means++ by default, 44 00:02:01,920 --> 00:02:03,750 so we are safe here, 45 00:02:03,750 --> 00:02:05,640 but if you are using a different package, 46 00:02:05,640 --> 00:02:08,013 remember that initialization matters. 47 00:02:10,500 --> 00:02:11,880 A third major problem 48 00:02:11,880 --> 00:02:14,910 is that K-means is sensitive to outliers. 49 00:02:14,910 --> 00:02:16,080 What does this mean? 50 00:02:16,080 --> 00:02:17,970 Well, if there is a single point 51 00:02:17,970 --> 00:02:20,040 that is too far away from the rest, 52 00:02:20,040 --> 00:02:23,730 it will always be placed in its own one-point cluster. 53 00:02:23,730 --> 00:02:25,770 Have we already experienced that? 54 00:02:25,770 --> 00:02:27,810 Well, of course we have. 55 00:02:27,810 --> 00:02:31,140 Australia was the sole cluster in almost all the solutions 56 00:02:31,140 --> 00:02:33,900 we had for our country clusters example. 57 00:02:33,900 --> 00:02:36,330 It is so far away from the rest of the countries 58 00:02:36,330 --> 00:02:39,180 that it is destined to be in its own cluster. 59 00:02:39,180 --> 00:02:42,990 The remedy, just get rid of outliers prior to clustering. 60 00:02:42,990 --> 00:02:45,180 Alternatively, if you do the clustering 61 00:02:45,180 --> 00:02:49,023 and spot one-point clusters, remove them and cluster again. 62 00:02:51,390 --> 00:02:55,560 A fourth con, K-means produces spherical solutions. 63 00:02:55,560 --> 00:02:58,290 This means that on a 2D plane that we have seen, 64 00:02:58,290 --> 00:03:01,170 we would more often see clusters that look like circles 65 00:03:01,170 --> 00:03:03,330 rather than elliptic shapes. 66 00:03:03,330 --> 00:03:06,270 The reason for that is that we are using Euclidean distance 67 00:03:06,270 --> 00:03:07,710 from the centroid. 68 00:03:07,710 --> 00:03:11,643 This is also why outliers are such a big issue for K-means. 69 00:03:14,010 --> 00:03:16,770 Finally, we have standardization. 70 00:03:16,770 --> 00:03:19,530 Oh, good old standardization. 71 00:03:19,530 --> 00:03:22,350 Let's leave that for the next lesson, shall we? 72 00:03:22,350 --> 00:03:23,350 Thanks for watching. 5619