Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,990 --> 00:00:03,570
Narrator: Remember the first lecture of this section?
2
00:00:03,570 --> 00:00:05,860
We gave an example about six countries,
3
00:00:05,860 --> 00:00:10,860
USA, Canada, France, UK, Germany, and Australia.
4
00:00:11,730 --> 00:00:13,470
Well, guess what!
5
00:00:13,470 --> 00:00:16,440
This was not just for illustrative purposes.
6
00:00:16,440 --> 00:00:19,050
In fact, we are going to cluster these countries
7
00:00:19,050 --> 00:00:21,660
using K-Means in Python.
8
00:00:21,660 --> 00:00:25,374
Plus, we'll learn a couple of nice tricks along the way.
9
00:00:25,374 --> 00:00:26,730
Cool.
10
00:00:26,730 --> 00:00:29,640
Let's import the relevant libraries.
11
00:00:29,640 --> 00:00:34,640
They are pandas, numpy, matplotlib.pyplot, and seaborn.
12
00:00:36,570 --> 00:00:39,240
As usual, I will set the style of all graphs
13
00:00:39,240 --> 00:00:40,950
to the seaborn one.
14
00:00:40,950 --> 00:00:43,890
In this course, we will rely on scikit-learn
15
00:00:43,890 --> 00:00:45,900
for the actual clustering.
16
00:00:45,900 --> 00:00:50,430
Let's import k-means from sklearn.cluster.
17
00:00:50,430 --> 00:00:54,725
Note that both the K and the M in KMeans are capital.
18
00:00:54,725 --> 00:00:58,620
Next, we will create a variable called data
19
00:00:58,620 --> 00:01:00,990
where we will load the csv file,
20
00:01:00,990 --> 00:01:03,960
3.01 Country clusters.
21
00:01:03,960 --> 00:01:05,313
Let's see what's inside!
22
00:01:06,180 --> 00:01:11,100
We've got Country, Latitude, Longitude, and Language.
23
00:01:11,100 --> 00:01:13,950
Let's see how we gathered that data.
24
00:01:13,950 --> 00:01:16,440
Country and Language are clear.
25
00:01:16,440 --> 00:01:19,230
What about the latitude and longitude values?
26
00:01:19,230 --> 00:01:21,870
These entries correspond to the geographic centers
27
00:01:21,870 --> 00:01:24,150
of the countries in our data set.
28
00:01:24,150 --> 00:01:26,833
That is one way to represent location.
29
00:01:26,833 --> 00:01:29,610
I'll quickly give an example.
30
00:01:29,610 --> 00:01:33,030
If you google geographic center of US,
31
00:01:33,030 --> 00:01:35,460
you'll get a Wikipedia article indicating it to be
32
00:01:35,460 --> 00:01:37,170
some point in South Dakota
33
00:01:37,170 --> 00:01:40,950
with a latitude of 44 degrees and 58 minutes north
34
00:01:40,950 --> 00:01:45,504
and a longitude of 103 degrees and 46 minutes west.
35
00:01:45,504 --> 00:01:48,480
Then, we can convert them to decimal degrees
36
00:01:48,480 --> 00:01:50,160
using some online converter
37
00:01:50,160 --> 00:01:52,413
like the one provided by LatLong.net.
38
00:01:53,340 --> 00:01:55,530
It's important to know that the convention is such
39
00:01:55,530 --> 00:01:57,600
that north and east are positive
40
00:01:57,600 --> 00:02:00,115
while west and south are negative.
41
00:02:00,115 --> 00:02:01,023
Okay.
42
00:02:01,860 --> 00:02:03,360
So, that's what we did.
43
00:02:03,360 --> 00:02:05,880
We got the decimal degrees of the geographic centers
44
00:02:05,880 --> 00:02:08,070
of the countries in the sample.
45
00:02:08,070 --> 00:02:10,710
That's not optimal as the choice of South Dakota
46
00:02:10,710 --> 00:02:13,140
was biased by Alaska and Hawaii,
47
00:02:13,140 --> 00:02:15,270
but you'll see that that won't matter too much
48
00:02:15,270 --> 00:02:16,980
for the clustering.
49
00:02:16,980 --> 00:02:17,850
Right.
50
00:02:17,850 --> 00:02:20,370
Let's quickly plot the data.
51
00:02:20,370 --> 00:02:22,320
If we want our data to resemble a map,
52
00:02:22,320 --> 00:02:25,552
we must set the axis to reflect the natural domain
53
00:02:25,552 --> 00:02:27,783
of latitude and longitude.
54
00:02:28,620 --> 00:02:29,970
Done!
55
00:02:29,970 --> 00:02:32,490
If I put the actual map next to this one,
56
00:02:32,490 --> 00:02:34,890
you will quickly notice that this methodology,
57
00:02:34,890 --> 00:02:37,598
while simple, is not bad at all.
58
00:02:37,598 --> 00:02:41,640
All right, let's do some clustering.
59
00:02:41,640 --> 00:02:44,340
As we did earlier, our inputs will be contained
60
00:02:44,340 --> 00:02:46,830
in a variable called X.
61
00:02:46,830 --> 00:02:49,890
We will start by clustering based on location.
62
00:02:49,890 --> 00:02:53,306
So, we want X to contain the latitude and the longitude.
63
00:02:53,306 --> 00:02:57,240
I'll use the pandas method, iloc.
64
00:02:57,240 --> 00:02:58,560
We haven't mentioned it before
65
00:02:58,560 --> 00:03:00,600
and you probably don't know that
66
00:03:00,600 --> 00:03:03,810
but iloc is a method which slices a data frame.
67
00:03:03,810 --> 00:03:06,120
The first argument indicates the row indices
68
00:03:06,120 --> 00:03:10,530
we want to keep while the second, the column indices.
69
00:03:10,530 --> 00:03:13,620
I want to keep all rows, so I'll put colons
70
00:03:13,620 --> 00:03:14,823
as the first argument.
71
00:03:15,690 --> 00:03:16,950
Okay.
72
00:03:16,950 --> 00:03:20,220
Remember that pandas indices start from zero.
73
00:03:20,220 --> 00:03:23,250
From the columns, I need latitude and longitude
74
00:03:23,250 --> 00:03:25,620
or columns one and two.
75
00:03:25,620 --> 00:03:30,620
So, the appropriate argument is 1:3.
76
00:03:31,050 --> 00:03:33,930
This will slice the first and the second columns
77
00:03:33,930 --> 00:03:35,043
out of the data frame.
78
00:03:36,030 --> 00:03:38,940
Let's print X to see the result.
79
00:03:38,940 --> 00:03:40,864
Exactly as we wanted it.
80
00:03:40,864 --> 00:03:45,063
Next, I'll declare a variable called kmeans.
81
00:03:45,900 --> 00:03:49,714
Kmeans is equal to capital K, capital M,
82
00:03:49,714 --> 00:03:53,883
and lowercase eans, brackets two.
83
00:03:55,020 --> 00:03:57,330
The right side is actually the KMeans method
84
00:03:57,330 --> 00:03:59,670
that we imported from sk-learn.
85
00:03:59,670 --> 00:04:02,040
The value in brackets is the number of clusters
86
00:04:02,040 --> 00:04:03,930
we want to produce.
87
00:04:03,930 --> 00:04:06,900
So, our variable KMeans is now an object
88
00:04:06,900 --> 00:04:09,930
which we will use for the clustering itself.
89
00:04:09,930 --> 00:04:12,210
Similar to what we've seen with regressions,
90
00:04:12,210 --> 00:04:15,388
the clustering itself happens using the fit method.
91
00:04:15,388 --> 00:04:20,387
Kmeans.fit of x.
92
00:04:20,490 --> 00:04:22,203
That's all we need to write.
93
00:04:23,040 --> 00:04:25,830
This line of code will apply k-means clustering
94
00:04:25,830 --> 00:04:29,670
with two clusters to the input data from x.
95
00:04:29,670 --> 00:04:31,440
The output indicates that the clustering
96
00:04:31,440 --> 00:04:33,890
has been completed with the following parameters.
97
00:04:36,480 --> 00:04:37,980
Usually though, we don't need to just
98
00:04:37,980 --> 00:04:40,230
perform the clustering, but are interested
99
00:04:40,230 --> 00:04:42,420
in the clusters themselves.
100
00:04:42,420 --> 00:04:44,040
We can obtain the predicted clusters
101
00:04:44,040 --> 00:04:47,651
for each observation using the fit predict method.
102
00:04:47,651 --> 00:04:51,000
Let's declare a new variable called
103
00:04:51,000 --> 00:04:55,760
identified_clusters equal to kmeans.fit_predict
104
00:04:57,450 --> 00:04:59,193
with input x.
105
00:05:00,090 --> 00:05:02,550
I'll also print this variable.
106
00:05:02,550 --> 00:05:06,630
The result is an array containing the predicted clusters.
107
00:05:06,630 --> 00:05:10,140
There are two clusters indicated by zero and one.
108
00:05:10,140 --> 00:05:13,020
You can clearly see that the first five observations
109
00:05:13,020 --> 00:05:15,420
are in the same cluster, zero,
110
00:05:15,420 --> 00:05:18,660
while the last one is in cluster one.
111
00:05:18,660 --> 00:05:19,493
Okay.
112
00:05:20,430 --> 00:05:23,703
Let's create a data frame so we can see things more clearly.
113
00:05:24,690 --> 00:05:27,870
I'll call this data frame data with clusters
114
00:05:27,870 --> 00:05:29,973
and it will be equal to data.
115
00:05:30,870 --> 00:05:33,270
Then, I'll add an additional column to it
116
00:05:33,270 --> 00:05:36,483
called cluster equal to identified_clusters.
117
00:05:39,840 --> 00:05:41,850
As you can see, we have our table
118
00:05:41,850 --> 00:05:45,780
with the countries, latitude, longitude, language,
119
00:05:45,780 --> 00:05:47,550
but also cluster.
120
00:05:47,550 --> 00:05:50,160
It seems that the USA, Canada, France,
121
00:05:50,160 --> 00:05:53,490
UK, and Germany are in cluster zero,
122
00:05:53,490 --> 00:05:56,673
while Australia is alone in cluster one.
123
00:05:58,740 --> 00:06:00,150
Cool!
124
00:06:00,150 --> 00:06:03,870
Finally, let's plot all this on a scatter plot.
125
00:06:03,870 --> 00:06:06,060
In order to resemble the map of a world,
126
00:06:06,060 --> 00:06:08,250
the y-axis will be the longitude
127
00:06:08,250 --> 00:06:10,593
while the x-axis, latitude.
128
00:06:11,490 --> 00:06:14,910
But that's the same graph as before, isn't it?
129
00:06:14,910 --> 00:06:17,040
Let's use the first trick.
130
00:06:17,040 --> 00:06:20,370
In matplotlib, we can set the color to be determined
131
00:06:20,370 --> 00:06:21,870
by a variable.
132
00:06:21,870 --> 00:06:25,050
In our case, that will be cluster.
133
00:06:25,050 --> 00:06:29,373
Let's write c=data_with_clusters of Cluster.
134
00:06:30,330 --> 00:06:32,941
We have just indicated that we wanna have
135
00:06:32,941 --> 00:06:35,610
as many colors for the points as there are clusters.
136
00:06:35,610 --> 00:06:38,100
The default color map is not so pretty.
137
00:06:38,100 --> 00:06:41,043
So, I'll set the color map to rainbow.
138
00:06:41,910 --> 00:06:45,060
Cmap equals rainbow.
139
00:06:45,060 --> 00:06:46,350
Okay.
140
00:06:46,350 --> 00:06:48,360
We can see the two clusters.
141
00:06:48,360 --> 00:06:51,720
One is purple and the other is red.
142
00:06:51,720 --> 00:06:54,393
And that's how we perform KMeans clustering!
143
00:06:57,690 --> 00:07:00,690
What if we wanted to have three clusters?
144
00:07:00,690 --> 00:07:03,630
Well, we can go back to the line where we specified
145
00:07:03,630 --> 00:07:05,550
the desired number of clusters
146
00:07:05,550 --> 00:07:07,023
and change that to three.
147
00:07:07,890 --> 00:07:09,363
Let's run all cells.
148
00:07:10,440 --> 00:07:12,900
There are three clusters, as wanted.
149
00:07:12,900 --> 00:07:15,630
Zero, one, and two.
150
00:07:15,630 --> 00:07:18,690
From the data frame, we can see that USA and Canada
151
00:07:18,690 --> 00:07:20,103
are in the same cluster.
152
00:07:20,970 --> 00:07:23,610
France, UK, and Germany in another.
153
00:07:23,610 --> 00:07:26,850
And Australia is alone, once again.
154
00:07:26,850 --> 00:07:29,280
What about the visualization?
155
00:07:29,280 --> 00:07:31,500
There are three colors representing the
156
00:07:31,500 --> 00:07:33,093
three different clusters.
157
00:07:35,790 --> 00:07:36,633
Great work!
158
00:07:37,500 --> 00:07:41,250
It seems that clustering is not that hard after all.
159
00:07:41,250 --> 00:07:44,220
In the next lesson, we will cluster the observations
160
00:07:44,220 --> 00:07:46,740
based on a categorical feature.
161
00:07:46,740 --> 00:07:47,740
Thanks for watching!
12116
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.