Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,780 --> 00:00:02,130
Instructor: The most common problem
2
00:00:02,130 --> 00:00:03,840
when working with numerical data
3
00:00:03,840 --> 00:00:05,610
is about the difference in magnitudes
4
00:00:05,610 --> 00:00:07,620
as we mentioned in the first lesson.
5
00:00:07,620 --> 00:00:10,680
An easy fix for this issue is standardization.
6
00:00:10,680 --> 00:00:13,050
Other names by which you may have heard this term
7
00:00:13,050 --> 00:00:16,050
are feature scaling and normalization.
8
00:00:16,050 --> 00:00:18,510
However, normalization could refer
9
00:00:18,510 --> 00:00:21,570
to a few additional concepts even within machine learning
10
00:00:21,570 --> 00:00:24,270
which is why we'll stick with the term standardization
11
00:00:24,270 --> 00:00:25,443
and feature scaling.
12
00:00:26,940 --> 00:00:29,130
Standardization or feature scaling
13
00:00:29,130 --> 00:00:31,140
is the process of transforming the data
14
00:00:31,140 --> 00:00:33,423
we are working with into a standard scale.
15
00:00:34,530 --> 00:00:36,780
A very common way to approach this problem
16
00:00:36,780 --> 00:00:38,070
is by subtracting the mean
17
00:00:38,070 --> 00:00:40,650
and dividing by the standard deviation.
18
00:00:40,650 --> 00:00:41,640
In this way,
19
00:00:41,640 --> 00:00:43,470
regardless of the data set,
20
00:00:43,470 --> 00:00:46,620
we will always obtain a distribution with a mean of zero
21
00:00:46,620 --> 00:00:48,450
and a standard deviation of one,
22
00:00:48,450 --> 00:00:50,133
which could easily be proven.
23
00:00:51,360 --> 00:00:54,240
Let's show that with an FX example.
24
00:00:54,240 --> 00:00:57,090
Say our algorithm has two input variables,
25
00:00:57,090 --> 00:01:00,003
Euro dollar exchange rate and the daily trading volume.
26
00:01:01,470 --> 00:01:04,440
We have three days worth of observations.
27
00:01:04,440 --> 00:01:07,623
First day, 1.3 and 110,000,
28
00:01:08,850 --> 00:01:13,850
second day, 1.34 and 98,700,
29
00:01:13,920 --> 00:01:18,003
and the third day, 1.25 and 135,000.
30
00:01:19,260 --> 00:01:21,900
The first value shows the Euro dollar exchange rate,
31
00:01:21,900 --> 00:01:25,320
while the second one shows the daily trading volume.
32
00:01:25,320 --> 00:01:27,480
Let's standardize these figures.
33
00:01:27,480 --> 00:01:29,820
We standardize the Euro dollar exchange rates
34
00:01:29,820 --> 00:01:32,790
regarding the other Euro dollar exchange rates.
35
00:01:32,790 --> 00:01:37,740
So, we look at 1.3, 1.34 and 1.25.
36
00:01:37,740 --> 00:01:39,639
The mean is 1.3,
37
00:01:39,639 --> 00:01:42,993
while the standard deviation 0.045.
38
00:01:44,370 --> 00:01:47,040
Going through the above mentioned transformation,
39
00:01:47,040 --> 00:01:52,040
these values become 0.07, 0.96 and -1.03 respectively.
40
00:01:56,010 --> 00:01:57,750
Standardizing trading volumes,
41
00:01:57,750 --> 00:02:02,750
we obtain -0.25, -0.85 and 1.1.
42
00:02:04,410 --> 00:02:05,280
In this way,
43
00:02:05,280 --> 00:02:07,740
we have focused figures of very different scales
44
00:02:07,740 --> 00:02:09,090
to appear similar.
45
00:02:09,090 --> 00:02:11,400
That's why another name for standardization
46
00:02:11,400 --> 00:02:12,870
is feature scaling.
47
00:02:12,870 --> 00:02:15,420
This will ensure our linear combinations
48
00:02:15,420 --> 00:02:17,460
treat the two variables equally.
49
00:02:17,460 --> 00:02:20,673
Also, it is much easier to make sense of the data.
50
00:02:21,870 --> 00:02:23,760
The transformation of trading volumes
51
00:02:23,760 --> 00:02:25,380
allowed us to transform the volumes
52
00:02:25,380 --> 00:02:30,380
from 110,000, 98,700 and 135,000 to -0.25, -0.85 and 1.1.
53
00:02:35,280 --> 00:02:36,300
In this way,
54
00:02:36,300 --> 00:02:39,330
the third term is considerably higher than the average,
55
00:02:39,330 --> 00:02:42,060
while the first one is around the average.
56
00:02:42,060 --> 00:02:45,780
We can confidently say that 135,000 trades per day
57
00:02:45,780 --> 00:02:46,980
is a high figure,
58
00:02:46,980 --> 00:02:49,950
while 98,700 is low.
59
00:02:49,950 --> 00:02:51,930
Please disregard the simplification
60
00:02:51,930 --> 00:02:54,000
of having just three observations.
61
00:02:54,000 --> 00:02:55,653
That's just an example.
62
00:02:57,360 --> 00:02:58,920
Besides standardization,
63
00:02:58,920 --> 00:03:01,080
there are other popular methods, too.
64
00:03:01,080 --> 00:03:02,610
We will shortly introduce them
65
00:03:02,610 --> 00:03:04,653
without going too much in detail.
66
00:03:06,540 --> 00:03:08,700
Initially, we said that normalization
67
00:03:08,700 --> 00:03:10,740
refers to several concepts.
68
00:03:10,740 --> 00:03:13,230
One of them, which comes up in machine learning
69
00:03:13,230 --> 00:03:15,690
often consists of converting each sample
70
00:03:15,690 --> 00:03:19,593
into a unit length vector using the L1 or L2 norm.
71
00:03:21,060 --> 00:03:23,670
Another pre-processing method is PCA
72
00:03:23,670 --> 00:03:26,460
standing for principal components analysis.
73
00:03:26,460 --> 00:03:28,830
It is a dimension reduction technique
74
00:03:28,830 --> 00:03:31,140
often used when working with several variables
75
00:03:31,140 --> 00:03:34,920
referring to the same bigger concept or latent variable.
76
00:03:34,920 --> 00:03:36,000
For instance,
77
00:03:36,000 --> 00:03:38,100
if we have data about one's religion,
78
00:03:38,100 --> 00:03:39,030
voting history,
79
00:03:39,030 --> 00:03:41,340
participation in different associations,
80
00:03:41,340 --> 00:03:42,360
an upbringing,
81
00:03:42,360 --> 00:03:43,710
we can combine these four
82
00:03:43,710 --> 00:03:46,830
to reflect his or her attitude towards immigration.
83
00:03:46,830 --> 00:03:49,350
This new variable will normally be standardized
84
00:03:49,350 --> 00:03:50,940
in a range with the mean of zero
85
00:03:50,940 --> 00:03:52,803
and a standard deviation of one.
86
00:03:54,330 --> 00:03:56,910
Whitening is another technique frequently used
87
00:03:56,910 --> 00:03:58,440
for pre-processing.
88
00:03:58,440 --> 00:04:00,840
It is often performed after PCA
89
00:04:00,840 --> 00:04:03,150
and removes most of the underlying correlations
90
00:04:03,150 --> 00:04:04,680
between data points.
91
00:04:04,680 --> 00:04:06,960
Whitening can be useful when conceptually,
92
00:04:06,960 --> 00:04:08,730
the data should be uncorrelated.
93
00:04:08,730 --> 00:04:11,283
But that's not reflected in the observations.
94
00:04:12,660 --> 00:04:14,850
We can't cover all the strategies
95
00:04:14,850 --> 00:04:17,610
as each strategy is problem specific.
96
00:04:17,610 --> 00:04:20,760
However, standardization is the most common one
97
00:04:20,760 --> 00:04:22,320
and is the one we will employ
98
00:04:22,320 --> 00:04:25,530
in the practical examples we will face in this course.
99
00:04:25,530 --> 00:04:26,640
In the next lesson,
100
00:04:26,640 --> 00:04:29,640
we will see how to deal with categorical data.
101
00:04:29,640 --> 00:04:30,933
Thanks for watching.
7584
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.