All language subtitles for 002 A Simple Example of Clustering_en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian Download
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,990 --> 00:00:03,570 Narrator: Remember the first lecture of this section? 2 00:00:03,570 --> 00:00:05,860 We gave an example about six countries, 3 00:00:05,860 --> 00:00:10,860 USA, Canada, France, UK, Germany, and Australia. 4 00:00:11,730 --> 00:00:13,470 Well, guess what! 5 00:00:13,470 --> 00:00:16,440 This was not just for illustrative purposes. 6 00:00:16,440 --> 00:00:19,050 In fact, we are going to cluster these countries 7 00:00:19,050 --> 00:00:21,660 using K-Means in Python. 8 00:00:21,660 --> 00:00:25,374 Plus, we'll learn a couple of nice tricks along the way. 9 00:00:25,374 --> 00:00:26,730 Cool. 10 00:00:26,730 --> 00:00:29,640 Let's import the relevant libraries. 11 00:00:29,640 --> 00:00:34,640 They are pandas, numpy, matplotlib.pyplot, and seaborn. 12 00:00:36,570 --> 00:00:39,240 As usual, I will set the style of all graphs 13 00:00:39,240 --> 00:00:40,950 to the seaborn one. 14 00:00:40,950 --> 00:00:43,890 In this course, we will rely on scikit-learn 15 00:00:43,890 --> 00:00:45,900 for the actual clustering. 16 00:00:45,900 --> 00:00:50,430 Let's import k-means from sklearn.cluster. 17 00:00:50,430 --> 00:00:54,725 Note that both the K and the M in KMeans are capital. 18 00:00:54,725 --> 00:00:58,620 Next, we will create a variable called data 19 00:00:58,620 --> 00:01:00,990 where we will load the csv file, 20 00:01:00,990 --> 00:01:03,960 3.01 Country clusters. 21 00:01:03,960 --> 00:01:05,313 Let's see what's inside! 22 00:01:06,180 --> 00:01:11,100 We've got Country, Latitude, Longitude, and Language. 23 00:01:11,100 --> 00:01:13,950 Let's see how we gathered that data. 24 00:01:13,950 --> 00:01:16,440 Country and Language are clear. 25 00:01:16,440 --> 00:01:19,230 What about the latitude and longitude values? 26 00:01:19,230 --> 00:01:21,870 These entries correspond to the geographic centers 27 00:01:21,870 --> 00:01:24,150 of the countries in our data set. 28 00:01:24,150 --> 00:01:26,833 That is one way to represent location. 29 00:01:26,833 --> 00:01:29,610 I'll quickly give an example. 30 00:01:29,610 --> 00:01:33,030 If you google geographic center of US, 31 00:01:33,030 --> 00:01:35,460 you'll get a Wikipedia article indicating it to be 32 00:01:35,460 --> 00:01:37,170 some point in South Dakota 33 00:01:37,170 --> 00:01:40,950 with a latitude of 44 degrees and 58 minutes north 34 00:01:40,950 --> 00:01:45,504 and a longitude of 103 degrees and 46 minutes west. 35 00:01:45,504 --> 00:01:48,480 Then, we can convert them to decimal degrees 36 00:01:48,480 --> 00:01:50,160 using some online converter 37 00:01:50,160 --> 00:01:52,413 like the one provided by LatLong.net. 38 00:01:53,340 --> 00:01:55,530 It's important to know that the convention is such 39 00:01:55,530 --> 00:01:57,600 that north and east are positive 40 00:01:57,600 --> 00:02:00,115 while west and south are negative. 41 00:02:00,115 --> 00:02:01,023 Okay. 42 00:02:01,860 --> 00:02:03,360 So, that's what we did. 43 00:02:03,360 --> 00:02:05,880 We got the decimal degrees of the geographic centers 44 00:02:05,880 --> 00:02:08,070 of the countries in the sample. 45 00:02:08,070 --> 00:02:10,710 That's not optimal as the choice of South Dakota 46 00:02:10,710 --> 00:02:13,140 was biased by Alaska and Hawaii, 47 00:02:13,140 --> 00:02:15,270 but you'll see that that won't matter too much 48 00:02:15,270 --> 00:02:16,980 for the clustering. 49 00:02:16,980 --> 00:02:17,850 Right. 50 00:02:17,850 --> 00:02:20,370 Let's quickly plot the data. 51 00:02:20,370 --> 00:02:22,320 If we want our data to resemble a map, 52 00:02:22,320 --> 00:02:25,552 we must set the axis to reflect the natural domain 53 00:02:25,552 --> 00:02:27,783 of latitude and longitude. 54 00:02:28,620 --> 00:02:29,970 Done! 55 00:02:29,970 --> 00:02:32,490 If I put the actual map next to this one, 56 00:02:32,490 --> 00:02:34,890 you will quickly notice that this methodology, 57 00:02:34,890 --> 00:02:37,598 while simple, is not bad at all. 58 00:02:37,598 --> 00:02:41,640 All right, let's do some clustering. 59 00:02:41,640 --> 00:02:44,340 As we did earlier, our inputs will be contained 60 00:02:44,340 --> 00:02:46,830 in a variable called X. 61 00:02:46,830 --> 00:02:49,890 We will start by clustering based on location. 62 00:02:49,890 --> 00:02:53,306 So, we want X to contain the latitude and the longitude. 63 00:02:53,306 --> 00:02:57,240 I'll use the pandas method, iloc. 64 00:02:57,240 --> 00:02:58,560 We haven't mentioned it before 65 00:02:58,560 --> 00:03:00,600 and you probably don't know that 66 00:03:00,600 --> 00:03:03,810 but iloc is a method which slices a data frame. 67 00:03:03,810 --> 00:03:06,120 The first argument indicates the row indices 68 00:03:06,120 --> 00:03:10,530 we want to keep while the second, the column indices. 69 00:03:10,530 --> 00:03:13,620 I want to keep all rows, so I'll put colons 70 00:03:13,620 --> 00:03:14,823 as the first argument. 71 00:03:15,690 --> 00:03:16,950 Okay. 72 00:03:16,950 --> 00:03:20,220 Remember that pandas indices start from zero. 73 00:03:20,220 --> 00:03:23,250 From the columns, I need latitude and longitude 74 00:03:23,250 --> 00:03:25,620 or columns one and two. 75 00:03:25,620 --> 00:03:30,620 So, the appropriate argument is 1:3. 76 00:03:31,050 --> 00:03:33,930 This will slice the first and the second columns 77 00:03:33,930 --> 00:03:35,043 out of the data frame. 78 00:03:36,030 --> 00:03:38,940 Let's print X to see the result. 79 00:03:38,940 --> 00:03:40,864 Exactly as we wanted it. 80 00:03:40,864 --> 00:03:45,063 Next, I'll declare a variable called kmeans. 81 00:03:45,900 --> 00:03:49,714 Kmeans is equal to capital K, capital M, 82 00:03:49,714 --> 00:03:53,883 and lowercase eans, brackets two. 83 00:03:55,020 --> 00:03:57,330 The right side is actually the KMeans method 84 00:03:57,330 --> 00:03:59,670 that we imported from sk-learn. 85 00:03:59,670 --> 00:04:02,040 The value in brackets is the number of clusters 86 00:04:02,040 --> 00:04:03,930 we want to produce. 87 00:04:03,930 --> 00:04:06,900 So, our variable KMeans is now an object 88 00:04:06,900 --> 00:04:09,930 which we will use for the clustering itself. 89 00:04:09,930 --> 00:04:12,210 Similar to what we've seen with regressions, 90 00:04:12,210 --> 00:04:15,388 the clustering itself happens using the fit method. 91 00:04:15,388 --> 00:04:20,387 Kmeans.fit of x. 92 00:04:20,490 --> 00:04:22,203 That's all we need to write. 93 00:04:23,040 --> 00:04:25,830 This line of code will apply k-means clustering 94 00:04:25,830 --> 00:04:29,670 with two clusters to the input data from x. 95 00:04:29,670 --> 00:04:31,440 The output indicates that the clustering 96 00:04:31,440 --> 00:04:33,890 has been completed with the following parameters. 97 00:04:36,480 --> 00:04:37,980 Usually though, we don't need to just 98 00:04:37,980 --> 00:04:40,230 perform the clustering, but are interested 99 00:04:40,230 --> 00:04:42,420 in the clusters themselves. 100 00:04:42,420 --> 00:04:44,040 We can obtain the predicted clusters 101 00:04:44,040 --> 00:04:47,651 for each observation using the fit predict method. 102 00:04:47,651 --> 00:04:51,000 Let's declare a new variable called 103 00:04:51,000 --> 00:04:55,760 identified_clusters equal to kmeans.fit_predict 104 00:04:57,450 --> 00:04:59,193 with input x. 105 00:05:00,090 --> 00:05:02,550 I'll also print this variable. 106 00:05:02,550 --> 00:05:06,630 The result is an array containing the predicted clusters. 107 00:05:06,630 --> 00:05:10,140 There are two clusters indicated by zero and one. 108 00:05:10,140 --> 00:05:13,020 You can clearly see that the first five observations 109 00:05:13,020 --> 00:05:15,420 are in the same cluster, zero, 110 00:05:15,420 --> 00:05:18,660 while the last one is in cluster one. 111 00:05:18,660 --> 00:05:19,493 Okay. 112 00:05:20,430 --> 00:05:23,703 Let's create a data frame so we can see things more clearly. 113 00:05:24,690 --> 00:05:27,870 I'll call this data frame data with clusters 114 00:05:27,870 --> 00:05:29,973 and it will be equal to data. 115 00:05:30,870 --> 00:05:33,270 Then, I'll add an additional column to it 116 00:05:33,270 --> 00:05:36,483 called cluster equal to identified_clusters. 117 00:05:39,840 --> 00:05:41,850 As you can see, we have our table 118 00:05:41,850 --> 00:05:45,780 with the countries, latitude, longitude, language, 119 00:05:45,780 --> 00:05:47,550 but also cluster. 120 00:05:47,550 --> 00:05:50,160 It seems that the USA, Canada, France, 121 00:05:50,160 --> 00:05:53,490 UK, and Germany are in cluster zero, 122 00:05:53,490 --> 00:05:56,673 while Australia is alone in cluster one. 123 00:05:58,740 --> 00:06:00,150 Cool! 124 00:06:00,150 --> 00:06:03,870 Finally, let's plot all this on a scatter plot. 125 00:06:03,870 --> 00:06:06,060 In order to resemble the map of a world, 126 00:06:06,060 --> 00:06:08,250 the y-axis will be the longitude 127 00:06:08,250 --> 00:06:10,593 while the x-axis, latitude. 128 00:06:11,490 --> 00:06:14,910 But that's the same graph as before, isn't it? 129 00:06:14,910 --> 00:06:17,040 Let's use the first trick. 130 00:06:17,040 --> 00:06:20,370 In matplotlib, we can set the color to be determined 131 00:06:20,370 --> 00:06:21,870 by a variable. 132 00:06:21,870 --> 00:06:25,050 In our case, that will be cluster. 133 00:06:25,050 --> 00:06:29,373 Let's write c=data_with_clusters of Cluster. 134 00:06:30,330 --> 00:06:32,941 We have just indicated that we wanna have 135 00:06:32,941 --> 00:06:35,610 as many colors for the points as there are clusters. 136 00:06:35,610 --> 00:06:38,100 The default color map is not so pretty. 137 00:06:38,100 --> 00:06:41,043 So, I'll set the color map to rainbow. 138 00:06:41,910 --> 00:06:45,060 Cmap equals rainbow. 139 00:06:45,060 --> 00:06:46,350 Okay. 140 00:06:46,350 --> 00:06:48,360 We can see the two clusters. 141 00:06:48,360 --> 00:06:51,720 One is purple and the other is red. 142 00:06:51,720 --> 00:06:54,393 And that's how we perform KMeans clustering! 143 00:06:57,690 --> 00:07:00,690 What if we wanted to have three clusters? 144 00:07:00,690 --> 00:07:03,630 Well, we can go back to the line where we specified 145 00:07:03,630 --> 00:07:05,550 the desired number of clusters 146 00:07:05,550 --> 00:07:07,023 and change that to three. 147 00:07:07,890 --> 00:07:09,363 Let's run all cells. 148 00:07:10,440 --> 00:07:12,900 There are three clusters, as wanted. 149 00:07:12,900 --> 00:07:15,630 Zero, one, and two. 150 00:07:15,630 --> 00:07:18,690 From the data frame, we can see that USA and Canada 151 00:07:18,690 --> 00:07:20,103 are in the same cluster. 152 00:07:20,970 --> 00:07:23,610 France, UK, and Germany in another. 153 00:07:23,610 --> 00:07:26,850 And Australia is alone, once again. 154 00:07:26,850 --> 00:07:29,280 What about the visualization? 155 00:07:29,280 --> 00:07:31,500 There are three colors representing the 156 00:07:31,500 --> 00:07:33,093 three different clusters. 157 00:07:35,790 --> 00:07:36,633 Great work! 158 00:07:37,500 --> 00:07:41,250 It seems that clustering is not that hard after all. 159 00:07:41,250 --> 00:07:44,220 In the next lesson, we will cluster the observations 160 00:07:44,220 --> 00:07:46,740 based on a categorical feature. 161 00:07:46,740 --> 00:07:47,740 Thanks for watching! 12116

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.