subtitlecat.com

All language subtitles for 002 A Simple Example of Clustering_en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,990 --> 00:00:03,570 Narrator: Remember the first lecture of this section? 2 00:00:03,570 --> 00:00:05,860 We gave an example about six countries, 3 00:00:05,860 --> 00:00:10,860 USA, Canada, France, UK, Germany, and Australia. 4 00:00:11,730 --> 00:00:13,470 Well, guess what! 5 00:00:13,470 --> 00:00:16,440 This was not just for illustrative purposes. 6 00:00:16,440 --> 00:00:19,050 In fact, we are going to cluster these countries 7 00:00:19,050 --> 00:00:21,660 using K-Means in Python. 8 00:00:21,660 --> 00:00:25,374 Plus, we'll learn a couple of nice tricks along the way. 9 00:00:25,374 --> 00:00:26,730 Cool. 10 00:00:26,730 --> 00:00:29,640 Let's import the relevant libraries. 11 00:00:29,640 --> 00:00:34,640 They are pandas, numpy, matplotlib.pyplot, and seaborn. 12 00:00:36,570 --> 00:00:39,240 As usual, I will set the style of all graphs 13 00:00:39,240 --> 00:00:40,950 to the seaborn one. 14 00:00:40,950 --> 00:00:43,890 In this course, we will rely on scikit-learn 15 00:00:43,890 --> 00:00:45,900 for the actual clustering. 16 00:00:45,900 --> 00:00:50,430 Let's import k-means from sklearn.cluster. 17 00:00:50,430 --> 00:00:54,725 Note that both the K and the M in KMeans are capital. 18 00:00:54,725 --> 00:00:58,620 Next, we will create a variable called data 19 00:00:58,620 --> 00:01:00,990 where we will load the csv file, 20 00:01:00,990 --> 00:01:03,960 3.01 Country clusters. 21 00:01:03,960 --> 00:01:05,313 Let's see what's inside! 22 00:01:06,180 --> 00:01:11,100 We've got Country, Latitude, Longitude, and Language. 23 00:01:11,100 --> 00:01:13,950 Let's see how we gathered that data. 24 00:01:13,950 --> 00:01:16,440 Country and Language are clear. 25 00:01:16,440 --> 00:01:19,230 What about the latitude and longitude values? 26 00:01:19,230 --> 00:01:21,870 These entries correspond to the geographic centers 27 00:01:21,870 --> 00:01:24,150 of the countries in our data set. 28 00:01:24,150 --> 00:01:26,833 That is one way to represent location. 29 00:01:26,833 --> 00:01:29,610 I'll quickly give an example. 30 00:01:29,610 --> 00:01:33,030 If you google geographic center of US, 31 00:01:33,030 --> 00:01:35,460 you'll get a Wikipedia article indicating it to be 32 00:01:35,460 --> 00:01:37,170 some point in South Dakota 33 00:01:37,170 --> 00:01:40,950 with a latitude of 44 degrees and 58 minutes north 34 00:01:40,950 --> 00:01:45,504 and a longitude of 103 degrees and 46 minutes west. 35 00:01:45,504 --> 00:01:48,480 Then, we can convert them to decimal degrees 36 00:01:48,480 --> 00:01:50,160 using some online converter 37 00:01:50,160 --> 00:01:52,413 like the one provided by LatLong.net. 38 00:01:53,340 --> 00:01:55,530 It's important to know that the convention is such 39 00:01:55,530 --> 00:01:57,600 that north and east are positive 40 00:01:57,600 --> 00:02:00,115 while west and south are negative. 41 00:02:00,115 --> 00:02:01,023 Okay. 42 00:02:01,860 --> 00:02:03,360 So, that's what we did. 43 00:02:03,360 --> 00:02:05,880 We got the decimal degrees of the geographic centers 44 00:02:05,880 --> 00:02:08,070 of the countries in the sample. 45 00:02:08,070 --> 00:02:10,710 That's not optimal as the choice of South Dakota 46 00:02:10,710 --> 00:02:13,140 was biased by Alaska and Hawaii, 47 00:02:13,140 --> 00:02:15,270 but you'll see that that won't matter too much 48 00:02:15,270 --> 00:02:16,980 for the clustering. 49 00:02:16,980 --> 00:02:17,850 Right. 50 00:02:17,850 --> 00:02:20,370 Let's quickly plot the data. 51 00:02:20,370 --> 00:02:22,320 If we want our data to resemble a map, 52 00:02:22,320 --> 00:02:25,552 we must set the axis to reflect the natural domain 53 00:02:25,552 --> 00:02:27,783 of latitude and longitude. 54 00:02:28,620 --> 00:02:29,970 Done! 55 00:02:29,970 --> 00:02:32,490 If I put the actual map next to this one, 56 00:02:32,490 --> 00:02:34,890 you will quickly notice that this methodology, 57 00:02:34,890 --> 00:02:37,598 while simple, is not bad at all. 58 00:02:37,598 --> 00:02:41,640 All right, let's do some clustering. 59 00:02:41,640 --> 00:02:44,340 As we did earlier, our inputs will be contained 60 00:02:44,340 --> 00:02:46,830 in a variable called X. 61 00:02:46,830 --> 00:02:49,890 We will start by clustering based on location. 62 00:02:49,890 --> 00:02:53,306 So, we want X to contain the latitude and the longitude. 63 00:02:53,306 --> 00:02:57,240 I'll use the pandas method, iloc. 64 00:02:57,240 --> 00:02:58,560 We haven't mentioned it before 65 00:02:58,560 --> 00:03:00,600 and you probably don't know that 66 00:03:00,600 --> 00:03:03,810 but iloc is a method which slices a data frame. 67 00:03:03,810 --> 00:03:06,120 The first argument indicates the row indices 68 00:03:06,120 --> 00:03:10,530 we want to keep while the second, the column indices. 69 00:03:10,530 --> 00:03:13,620 I want to keep all rows, so I'll put colons 70 00:03:13,620 --> 00:03:14,823 as the first argument. 71 00:03:15,690 --> 00:03:16,950 Okay. 72 00:03:16,950 --> 00:03:20,220 Remember that pandas indices start from zero. 73 00:03:20,220 --> 00:03:23,250 From the columns, I need latitude and longitude 74 00:03:23,250 --> 00:03:25,620 or columns one and two. 75 00:03:25,620 --> 00:03:30,620 So, the appropriate argument is 1:3. 76 00:03:31,050 --> 00:03:33,930 This will slice the first and the second columns 77 00:03:33,930 --> 00:03:35,043 out of the data frame. 78 00:03:36,030 --> 00:03:38,940 Let's print X to see the result. 79 00:03:38,940 --> 00:03:40,864 Exactly as we wanted it. 80 00:03:40,864 --> 00:03:45,063 Next, I'll declare a variable called kmeans. 81 00:03:45,900 --> 00:03:49,714 Kmeans is equal to capital K, capital M, 82 00:03:49,714 --> 00:03:53,883 and lowercase eans, brackets two. 83 00:03:55,020 --> 00:03:57,330 The right side is actually the KMeans method 84 00:03:57,330 --> 00:03:59,670 that we imported from sk-learn. 85 00:03:59,670 --> 00:04:02,040 The value in brackets is the number of clusters 86 00:04:02,040 --> 00:04:03,930 we want to produce. 87 00:04:03,930 --> 00:04:06,900 So, our variable KMeans is now an object 88 00:04:06,900 --> 00:04:09,930 which we will use for the clustering itself. 89 00:04:09,930 --> 00:04:12,210 Similar to what we've seen with regressions, 90 00:04:12,210 --> 00:04:15,388 the clustering itself happens using the fit method. 91 00:04:15,388 --> 00:04:20,387 Kmeans.fit of x. 92 00:04:20,490 --> 00:04:22,203 That's all we need to write. 93 00:04:23,040 --> 00:04:25,830 This line of code will apply k-means clustering 94 00:04:25,830 --> 00:04:29,670 with two clusters to the input data from x. 95 00:04:29,670 --> 00:04:31,440 The output indicates that the clustering 96 00:04:31,440 --> 00:04:33,890 has been completed with the following parameters. 97 00:04:36,480 --> 00:04:37,980 Usually though, we don't need to just 98 00:04:37,980 --> 00:04:40,230 perform the clustering, but are interested 99 00:04:40,230 --> 00:04:42,420 in the clusters themselves. 100 00:04:42,420 --> 00:04:44,040 We can obtain the predicted clusters 101 00:04:44,040 --> 00:04:47,651 for each observation using the fit predict method. 102 00:04:47,651 --> 00:04:51,000 Let's declare a new variable called 103 00:04:51,000 --> 00:04:55,760 identified_clusters equal to kmeans.fit_predict 104 00:04:57,450 --> 00:04:59,193 with input x. 105 00:05:00,090 --> 00:05:02,550 I'll also print this variable. 106 00:05:02,550 --> 00:05:06,630 The result is an array containing the predicted clusters. 107 00:05:06,630 --> 00:05:10,140 There are two clusters indicated by zero and one. 108 00:05:10,140 --> 00:05:13,020 You can clearly see that the first five observations 109 00:05:13,020 --> 00:05:15,420 are in the same cluster, zero, 110 00:05:15,420 --> 00:05:18,660 while the last one is in cluster one. 111 00:05:18,660 --> 00:05:19,493 Okay. 112 00:05:20,430 --> 00:05:23,703 Let's create a data frame so we can see things more clearly. 113 00:05:24,690 --> 00:05:27,870 I'll call this data frame data with clusters 114 00:05:27,870 --> 00:05:29,973 and it will be equal to data. 115 00:05:30,870 --> 00:05:33,270 Then, I'll add an additional column to it 116 00:05:33,270 --> 00:05:36,483 called cluster equal to identified_clusters. 117 00:05:39,840 --> 00:05:41,850 As you can see, we have our table 118 00:05:41,850 --> 00:05:45,780 with the countries, latitude, longitude, language, 119 00:05:45,780 --> 00:05:47,550 but also cluster. 120 00:05:47,550 --> 00:05:50,160 It seems that the USA, Canada, France, 121 00:05:50,160 --> 00:05:53,490 UK, and Germany are in cluster zero, 122 00:05:53,490 --> 00:05:56,673 while Australia is alone in cluster one. 123 00:05:58,740 --> 00:06:00,150 Cool! 124 00:06:00,150 --> 00:06:03,870 Finally, let's plot all this on a scatter plot. 125 00:06:03,870 --> 00:06:06,060 In order to resemble the map of a world, 126 00:06:06,060 --> 00:06:08,250 the y-axis will be the longitude 127 00:06:08,250 --> 00:06:10,593 while the x-axis, latitude. 128 00:06:11,490 --> 00:06:14,910 But that's the same graph as before, isn't it? 129 00:06:14,910 --> 00:06:17,040 Let's use the first trick. 130 00:06:17,040 --> 00:06:20,370 In matplotlib, we can set the color to be determined 131 00:06:20,370 --> 00:06:21,870 by a variable. 132 00:06:21,870 --> 00:06:25,050 In our case, that will be cluster. 133 00:06:25,050 --> 00:06:29,373 Let's write c=data_with_clusters of Cluster. 134 00:06:30,330 --> 00:06:32,941 We have just indicated that we wanna have 135 00:06:32,941 --> 00:06:35,610 as many colors for the points as there are clusters. 136 00:06:35,610 --> 00:06:38,100 The default color map is not so pretty. 137 00:06:38,100 --> 00:06:41,043 So, I'll set the color map to rainbow. 138 00:06:41,910 --> 00:06:45,060 Cmap equals rainbow. 139 00:06:45,060 --> 00:06:46,350 Okay. 140 00:06:46,350 --> 00:06:48,360 We can see the two clusters. 141 00:06:48,360 --> 00:06:51,720 One is purple and the other is red. 142 00:06:51,720 --> 00:06:54,393 And that's how we perform KMeans clustering! 143 00:06:57,690 --> 00:07:00,690 What if we wanted to have three clusters? 144 00:07:00,690 --> 00:07:03,630 Well, we can go back to the line where we specified 145 00:07:03,630 --> 00:07:05,550 the desired number of clusters 146 00:07:05,550 --> 00:07:07,023 and change that to three. 147 00:07:07,890 --> 00:07:09,363 Let's run all cells. 148 00:07:10,440 --> 00:07:12,900 There are three clusters, as wanted. 149 00:07:12,900 --> 00:07:15,630 Zero, one, and two. 150 00:07:15,630 --> 00:07:18,690 From the data frame, we can see that USA and Canada 151 00:07:18,690 --> 00:07:20,103 are in the same cluster. 152 00:07:20,970 --> 00:07:23,610 France, UK, and Germany in another. 153 00:07:23,610 --> 00:07:26,850 And Australia is alone, once again. 154 00:07:26,850 --> 00:07:29,280 What about the visualization? 155 00:07:29,280 --> 00:07:31,500 There are three colors representing the 156 00:07:31,500 --> 00:07:33,093 three different clusters. 157 00:07:35,790 --> 00:07:36,633 Great work! 158 00:07:37,500 --> 00:07:41,250 It seems that clustering is not that hard after all. 159 00:07:41,250 --> 00:07:44,220 In the next lesson, we will cluster the observations 160 00:07:44,220 --> 00:07:46,740 based on a categorical feature. 161 00:07:46,740 --> 00:07:47,740 Thanks for watching! 12116