Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,330 --> 00:00:02,700
Instructor: Hey, let's continue the problem
2
00:00:02,700 --> 00:00:04,620
from the last lecture.
3
00:00:04,620 --> 00:00:07,230
As you can see, we had one other piece of information
4
00:00:07,230 --> 00:00:10,080
that we did not use, language.
5
00:00:10,080 --> 00:00:11,400
In order to make use of it,
6
00:00:11,400 --> 00:00:13,980
we must first encode it in some way.
7
00:00:13,980 --> 00:00:16,762
The simplest way to do that is by using numbers.
8
00:00:16,762 --> 00:00:20,130
I'll create a new variable called data_mapped
9
00:00:20,130 --> 00:00:23,617
equal to data.copy.
10
00:00:23,617 --> 00:00:28,028
Next, I'll map the languages using the usual method.
11
00:00:28,028 --> 00:00:33,028
Data_mapped language equals data_mapped language.map.
12
00:00:36,210 --> 00:00:38,640
And I'll set English to zero,
13
00:00:38,640 --> 00:00:41,343
French to one, and German to two.
14
00:00:42,270 --> 00:00:45,060
Note that this is not the optimal way to encode them
15
00:00:45,060 --> 00:00:46,860
but it will work for now.
16
00:00:46,860 --> 00:00:49,950
Here's the result, cool.
17
00:00:49,950 --> 00:00:51,960
Next, let's choose the features
18
00:00:51,960 --> 00:00:53,913
that we want to use for clustering.
19
00:00:54,840 --> 00:00:57,600
Did you know that we can use a single feature?
20
00:00:57,600 --> 00:00:59,970
Well, we certainly can.
21
00:00:59,970 --> 00:01:03,667
Let x be equal to data_mapped.iloc:,3:4.
22
00:01:10,890 --> 00:01:15,300
I am basically slicing all rows, but only the last column.
23
00:01:15,300 --> 00:01:18,210
What we are left with is this.
24
00:01:18,210 --> 00:01:20,133
Now we can perform clustering.
25
00:01:21,060 --> 00:01:24,540
I have the same code ready, so I'll just use it.
26
00:01:24,540 --> 00:01:26,580
We are running k means clustering
27
00:01:26,580 --> 00:01:28,440
with three clusters.
28
00:01:28,440 --> 00:01:31,770
Run, run, run, run and we are done.
29
00:01:31,770 --> 00:01:33,660
The plot is unequivocal.
30
00:01:33,660 --> 00:01:38,660
The three clusters are USA, Canada, UK and Australia
31
00:01:39,180 --> 00:01:42,180
in the first one, France in the second
32
00:01:42,180 --> 00:01:43,533
and Germany in the third.
33
00:01:44,400 --> 00:01:47,070
That's precisely what we expected, right?
34
00:01:47,070 --> 00:01:49,950
English, French and German.
35
00:01:49,950 --> 00:01:51,180
Great.
36
00:01:51,180 --> 00:01:53,493
By the way, we are still using the longitude
37
00:01:53,493 --> 00:01:56,056
and latitude as axis of the plot.
38
00:01:56,056 --> 00:01:58,500
Unlike regression, when doing clustering
39
00:01:58,500 --> 00:02:00,930
you can plot the data as you wish.
40
00:02:00,930 --> 00:02:02,850
The cluster information is contained
41
00:02:02,850 --> 00:02:05,250
in the cluster column in the data frame
42
00:02:05,250 --> 00:02:07,473
and is the color of the points on the plot.
43
00:02:09,300 --> 00:02:10,830
Can we use both numerical
44
00:02:10,830 --> 00:02:13,500
and categorical data in clustering?
45
00:02:13,500 --> 00:02:17,139
Sure, Let's go back to our input data, x,
46
00:02:17,139 --> 00:02:21,416
and take the last three series instead of just one.
47
00:02:21,416 --> 00:02:24,240
Run, run, run, run.
48
00:02:24,240 --> 00:02:27,150
Okay, this time the three clusters turned out
49
00:02:27,150 --> 00:02:29,850
to be based simply on geographical location
50
00:02:29,850 --> 00:02:31,833
instead of language and location.
51
00:02:32,738 --> 00:02:36,273
Hmm, what if we use two clusters?
52
00:02:41,520 --> 00:02:44,220
We've seen that solution too, haven't we?
53
00:02:44,220 --> 00:02:45,930
We will have to work on figuring out
54
00:02:45,930 --> 00:02:48,660
what's going on in the following lesson.
55
00:02:48,660 --> 00:02:49,660
Thanks for watching.
4100
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.