Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,270 --> 00:00:02,430
Tutor: Should we standardize?
2
00:00:02,430 --> 00:00:05,760
I avoided preparing this lecture for a couple of days.
3
00:00:05,760 --> 00:00:07,620
Today I was drinking some coffee
4
00:00:07,620 --> 00:00:09,667
and explaining to a colleague of mine,
5
00:00:09,667 --> 00:00:12,180
"I really wanna elaborate on standardization,
6
00:00:12,180 --> 00:00:15,090
but I don't think the students will be interested.
7
00:00:15,090 --> 00:00:18,390
Moreover, there is a dispute on the topic."
8
00:00:18,390 --> 00:00:20,647
My colleague then looked at me and said,
9
00:00:20,647 --> 00:00:22,710
"Then tell that to the students.
10
00:00:22,710 --> 00:00:24,480
Show them both sides"
11
00:00:24,480 --> 00:00:27,273
And that's how he closed the topic and got rid of me.
12
00:00:28,140 --> 00:00:31,500
So to standardize or not to standardize?
13
00:00:31,500 --> 00:00:33,840
That is the question.
14
00:00:33,840 --> 00:00:36,780
Let's explore a simple example.
15
00:00:36,780 --> 00:00:39,480
Here's a scatter plot with four apartments.
16
00:00:39,480 --> 00:00:41,550
The X axis shows the size,
17
00:00:41,550 --> 00:00:44,220
while the Y axis, the price.
18
00:00:44,220 --> 00:00:47,070
That's a very common regression relationship,
19
00:00:47,070 --> 00:00:48,450
but we are doing clustering here,
20
00:00:48,450 --> 00:00:50,400
so instead of causality,
21
00:00:50,400 --> 00:00:52,983
think about how we can group the four observations.
22
00:00:53,910 --> 00:00:58,910
A is a 500 square-foot apartment that is worth $50,000.
23
00:00:59,400 --> 00:01:04,400
B is a 500 square-foot apartment that is worth $100,000.
24
00:01:04,860 --> 00:01:09,210
C is a 1,200 square-foot apartment worth $50,000,
25
00:01:09,210 --> 00:01:10,770
and D has the same size,
26
00:01:10,770 --> 00:01:13,050
but is twice as expensive.
27
00:01:13,050 --> 00:01:14,850
If we were to create two clusters,
28
00:01:14,850 --> 00:01:16,500
just by looking at the plot,
29
00:01:16,500 --> 00:01:19,863
they are likely to be AB and CD, right?
30
00:01:21,210 --> 00:01:25,143
Now, what if we standardize the X axis, size that is?
31
00:01:26,790 --> 00:01:29,040
Without taking you through the calculations,
32
00:01:29,040 --> 00:01:30,990
that's the new situation.
33
00:01:30,990 --> 00:01:35,220
The X axis of these points are either minus 1 or 1.
34
00:01:35,220 --> 00:01:37,260
How would we group them now?
35
00:01:37,260 --> 00:01:41,460
Well, AC and BD looks reasonable, right?
36
00:01:41,460 --> 00:01:42,603
Yes, it does.
37
00:01:44,220 --> 00:01:48,153
Finally, let's also standardize the Y axis or price.
38
00:01:49,470 --> 00:01:53,283
Now, the Y axis only minus 1 or 1, too.
39
00:01:54,570 --> 00:01:57,150
What we see is a perfect square.
40
00:01:57,150 --> 00:01:58,380
We have no way of deciding
41
00:01:58,380 --> 00:02:03,360
if the clusters should be AB and CD, or AC and BD.
42
00:02:03,360 --> 00:02:05,610
So we went from one solution,
43
00:02:05,610 --> 00:02:07,410
through a totally different one
44
00:02:07,410 --> 00:02:10,080
to no solution whatsoever.
45
00:02:10,080 --> 00:02:11,940
Why did that happen?
46
00:02:11,940 --> 00:02:13,830
The ultimate aim of standardization
47
00:02:13,830 --> 00:02:16,140
is to reduce the weight of higher numbers,
48
00:02:16,140 --> 00:02:18,240
and increase that of lower ones.
49
00:02:18,240 --> 00:02:20,883
Now, let's see the first graph once again.
50
00:02:22,260 --> 00:02:24,690
If both axes had the same scale,
51
00:02:24,690 --> 00:02:28,680
so from 0 to 100,000, we would get something like this,
52
00:02:28,680 --> 00:02:31,260
but even more dramatic.
53
00:02:31,260 --> 00:02:33,840
A K-means algorithm would immediately cluster
54
00:02:33,840 --> 00:02:36,810
A with C and B with D,
55
00:02:36,810 --> 00:02:39,210
just because the scale of price was so different
56
00:02:39,210 --> 00:02:40,170
compared to size,
57
00:02:40,170 --> 00:02:42,210
in terms of mere numbers.
58
00:02:42,210 --> 00:02:43,803
So, scale matters.
59
00:02:45,510 --> 00:02:48,180
Finally, the last graph resulted in a square
60
00:02:48,180 --> 00:02:51,630
because there were only two values for each axis.
61
00:02:51,630 --> 00:02:54,120
Logically, every rectangle on a graph
62
00:02:54,120 --> 00:02:55,380
after being standardized
63
00:02:55,380 --> 00:02:57,270
turns into a square.
64
00:02:57,270 --> 00:02:59,490
So no matter how I chose the axes
65
00:02:59,490 --> 00:03:01,830
or how far off they were from each other,
66
00:03:01,830 --> 00:03:04,140
as long as they were in the shape of a rectangle,
67
00:03:04,140 --> 00:03:07,309
the standardized output would've been a square.
68
00:03:07,309 --> 00:03:08,142
With that said,
69
00:03:08,142 --> 00:03:10,020
by standardizing both axes,
70
00:03:10,020 --> 00:03:13,233
we remove the weight introduced by the high price values.
71
00:03:14,790 --> 00:03:17,040
To sum up, if we don't standardize,
72
00:03:17,040 --> 00:03:19,080
the range of the values will serve as weights
73
00:03:19,080 --> 00:03:20,670
for each variable.
74
00:03:20,670 --> 00:03:22,620
Price had much higher values,
75
00:03:22,620 --> 00:03:24,180
which would indicate to K-means
76
00:03:24,180 --> 00:03:26,400
that price is more important.
77
00:03:26,400 --> 00:03:31,260
This would lead to clusters based on price, AC and BD,
78
00:03:31,260 --> 00:03:34,470
the Economy Cluster and the Luxury Cluster.
79
00:03:34,470 --> 00:03:36,210
Note that the clustering would barely,
80
00:03:36,210 --> 00:03:38,640
if at all, care about size.
81
00:03:38,640 --> 00:03:40,560
So, if we don't standardize,
82
00:03:40,560 --> 00:03:43,743
we are not taking advantage of the size data whatsoever.
83
00:03:44,580 --> 00:03:47,460
Therefore, it is a good practice to standardize the data
84
00:03:47,460 --> 00:03:50,433
before clustering, especially for beginners.
85
00:03:51,300 --> 00:03:52,770
The final note I'll leave you with
86
00:03:52,770 --> 00:03:55,110
is when you should not standardize.
87
00:03:55,110 --> 00:03:57,750
As standardization is trying to put all variables
88
00:03:57,750 --> 00:03:59,040
on equal footing,
89
00:03:59,040 --> 00:04:01,920
In some cases, we don't need to do that.
90
00:04:01,920 --> 00:04:03,120
If we know that one variable
91
00:04:03,120 --> 00:04:05,520
is inherently more important than another,
92
00:04:05,520 --> 00:04:08,490
then standardization shouldn't be used.
93
00:04:08,490 --> 00:04:11,700
Our price/size relationship could be one of those.
94
00:04:11,700 --> 00:04:13,410
Most people are affected by the price
95
00:04:13,410 --> 00:04:15,960
much more than the size, aren't they?
96
00:04:15,960 --> 00:04:17,339
If you can't afford the price
97
00:04:17,339 --> 00:04:20,047
you won't care about the size, right?
98
00:04:20,047 --> 00:04:21,060
"How can you know that
99
00:04:21,060 --> 00:04:23,340
prior to clustering," you may ask?
100
00:04:23,340 --> 00:04:25,410
Experience plays a big role,
101
00:04:25,410 --> 00:04:27,303
so practice when the time comes.
102
00:04:28,170 --> 00:04:31,500
We will discuss this a bit more in the next lecture.
103
00:04:31,500 --> 00:04:32,500
Thanks for watching.
7735
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.