Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,630 --> 00:00:02,250
Instructor: It wouldn't be data science
2
00:00:02,250 --> 00:00:04,950
if there wasn't this very important topic,
3
00:00:04,950 --> 00:00:09,420
problems with, issues with, or limitations of X.
4
00:00:09,420 --> 00:00:13,113
Well, let's look at the pros and cons of K-means clustering.
5
00:00:14,160 --> 00:00:15,780
The pros are already known to you
6
00:00:15,780 --> 00:00:17,790
even if you don't realize it.
7
00:00:17,790 --> 00:00:21,060
It is simple to understand and fast to cluster.
8
00:00:21,060 --> 00:00:23,760
Moreover, there are many packages that offer it,
9
00:00:23,760 --> 00:00:26,580
so implementation is effortless.
10
00:00:26,580 --> 00:00:29,820
Finally, clustering it always yields a result.
11
00:00:29,820 --> 00:00:32,670
No matter the data, it will always spit out a solution
12
00:00:32,670 --> 00:00:33,603
which is great.
13
00:00:34,800 --> 00:00:36,840
Time for the cons.
14
00:00:36,840 --> 00:00:38,190
We will dig a bit into them
15
00:00:38,190 --> 00:00:41,160
as they are very interesting to explore.
16
00:00:41,160 --> 00:00:44,700
Moreover, this lecture will solidify your understanding
17
00:00:44,700 --> 00:00:45,843
like no other.
18
00:00:46,710 --> 00:00:49,650
The first con is that we need to pick K.
19
00:00:49,650 --> 00:00:52,680
As we already saw, the elbow method fixes that,
20
00:00:52,680 --> 00:00:55,383
but it is not extremely scientific per se.
21
00:00:56,911 --> 00:01:00,540
Second, K-means is sensitive to initialization.
22
00:01:00,540 --> 00:01:03,090
That's a very interesting problem.
23
00:01:03,090 --> 00:01:05,010
Say that these are our points.
24
00:01:05,010 --> 00:01:08,790
If we randomly choose the centroids here and here,
25
00:01:08,790 --> 00:01:11,190
the obvious solution is one top cluster
26
00:01:11,190 --> 00:01:13,170
and one bottom cluster.
27
00:01:13,170 --> 00:01:16,350
However, clustering the points on the left in one cluster
28
00:01:16,350 --> 00:01:17,820
and those on the right in another
29
00:01:17,820 --> 00:01:19,713
is a more appropriate solution.
30
00:01:20,970 --> 00:01:22,950
Now imagine the same situation,
31
00:01:22,950 --> 00:01:26,100
but with much more widely spread points.
32
00:01:26,100 --> 00:01:27,120
Guess what?
33
00:01:27,120 --> 00:01:28,830
Given the same initial seeds,
34
00:01:28,830 --> 00:01:33,000
we get the same clusters because that's how K-means works.
35
00:01:33,000 --> 00:01:35,940
It takes the closest points to the seeds.
36
00:01:35,940 --> 00:01:38,670
So if your initial seeds are problematic,
37
00:01:38,670 --> 00:01:41,550
the whole solution is meaningless.
38
00:01:41,550 --> 00:01:46,380
The remedy is simple. It is called K-means++.
39
00:01:46,380 --> 00:01:49,860
The idea is that a preliminary iterative algorithm is ran
40
00:01:49,860 --> 00:01:53,310
prior to K-means to determine the most appropriate seeds
41
00:01:53,310 --> 00:01:55,380
for the clustering itself.
42
00:01:55,380 --> 00:01:56,880
If we go back to our code,
43
00:01:56,880 --> 00:02:01,880
we will see that sklearn employs K-means++ by default,
44
00:02:01,920 --> 00:02:03,750
so we are safe here,
45
00:02:03,750 --> 00:02:05,640
but if you are using a different package,
46
00:02:05,640 --> 00:02:08,013
remember that initialization matters.
47
00:02:10,500 --> 00:02:11,880
A third major problem
48
00:02:11,880 --> 00:02:14,910
is that K-means is sensitive to outliers.
49
00:02:14,910 --> 00:02:16,080
What does this mean?
50
00:02:16,080 --> 00:02:17,970
Well, if there is a single point
51
00:02:17,970 --> 00:02:20,040
that is too far away from the rest,
52
00:02:20,040 --> 00:02:23,730
it will always be placed in its own one-point cluster.
53
00:02:23,730 --> 00:02:25,770
Have we already experienced that?
54
00:02:25,770 --> 00:02:27,810
Well, of course we have.
55
00:02:27,810 --> 00:02:31,140
Australia was the sole cluster in almost all the solutions
56
00:02:31,140 --> 00:02:33,900
we had for our country clusters example.
57
00:02:33,900 --> 00:02:36,330
It is so far away from the rest of the countries
58
00:02:36,330 --> 00:02:39,180
that it is destined to be in its own cluster.
59
00:02:39,180 --> 00:02:42,990
The remedy, just get rid of outliers prior to clustering.
60
00:02:42,990 --> 00:02:45,180
Alternatively, if you do the clustering
61
00:02:45,180 --> 00:02:49,023
and spot one-point clusters, remove them and cluster again.
62
00:02:51,390 --> 00:02:55,560
A fourth con, K-means produces spherical solutions.
63
00:02:55,560 --> 00:02:58,290
This means that on a 2D plane that we have seen,
64
00:02:58,290 --> 00:03:01,170
we would more often see clusters that look like circles
65
00:03:01,170 --> 00:03:03,330
rather than elliptic shapes.
66
00:03:03,330 --> 00:03:06,270
The reason for that is that we are using Euclidean distance
67
00:03:06,270 --> 00:03:07,710
from the centroid.
68
00:03:07,710 --> 00:03:11,643
This is also why outliers are such a big issue for K-means.
69
00:03:14,010 --> 00:03:16,770
Finally, we have standardization.
70
00:03:16,770 --> 00:03:19,530
Oh, good old standardization.
71
00:03:19,530 --> 00:03:22,350
Let's leave that for the next lesson, shall we?
72
00:03:22,350 --> 00:03:23,350
Thanks for watching.
5619
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.