Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,130 --> 00:00:04,785
Let's look at how you can
implement feature scaling,
2
00:00:04,785 --> 00:00:06,360
to take features that take on
3
00:00:06,360 --> 00:00:08,130
very different
ranges of values and
4
00:00:08,130 --> 00:00:10,380
skill them to have
comparable ranges
5
00:00:10,380 --> 00:00:11,865
of values to each other.
6
00:00:11,865 --> 00:00:14,550
How do you actually
scale features?
7
00:00:14,550 --> 00:00:18,675
Well, if x_1 ranges
from 3-2,000,
8
00:00:18,675 --> 00:00:22,035
one way to get a scale
version of x_1 is to take
9
00:00:22,035 --> 00:00:26,760
each original x1_ value
and divide by 2,000,
10
00:00:26,760 --> 00:00:28,545
the maximum of the range.
11
00:00:28,545 --> 00:00:34,140
The scale x_1 will range
from 0.15 up to one.
12
00:00:34,140 --> 00:00:38,235
Similarly, since x_2
ranges from 0-5,
13
00:00:38,235 --> 00:00:41,400
you can calculate a
scale version of x_2 by
14
00:00:41,400 --> 00:00:44,915
taking each original x_2
and dividing by five,
15
00:00:44,915 --> 00:00:46,880
which is again the maximum.
16
00:00:46,880 --> 00:00:51,270
So the scale is x_2 will
now range from 0-1.
17
00:00:51,740 --> 00:00:56,070
If you plot the scale to
x_1 and x_2 on a graph,
18
00:00:56,070 --> 00:00:58,060
it might look like this.
19
00:00:58,060 --> 00:01:01,235
In addition to dividing
by the maximum,
20
00:01:01,235 --> 00:01:04,700
you can also do what's
called mean normalization.
21
00:01:04,700 --> 00:01:06,560
What this looks like is,
22
00:01:06,560 --> 00:01:09,170
you start with the original
features and then you
23
00:01:09,170 --> 00:01:10,880
re-scale them so that both
24
00:01:10,880 --> 00:01:13,105
of them are centered
around zero.
25
00:01:13,105 --> 00:01:16,630
Whereas before they only had
values greater than zero,
26
00:01:16,630 --> 00:01:20,060
now they have both negative
and positive values
27
00:01:20,060 --> 00:01:24,910
that may be usually between
negative one and plus one.
28
00:01:24,910 --> 00:01:28,575
To calculate the mean
normalization of x_1,
29
00:01:28,575 --> 00:01:30,080
first find the average,
30
00:01:30,080 --> 00:01:33,470
also called the mean of
x_1 on your training set,
31
00:01:33,470 --> 00:01:35,975
and let's call this mean Mu_1,
32
00:01:35,975 --> 00:01:39,425
with this being the
Greek alphabets Mu.
33
00:01:39,425 --> 00:01:43,220
For example, you may find that
the average of feature 1,
34
00:01:43,220 --> 00:01:46,400
Mu_1 is 600 square feet.
35
00:01:46,400 --> 00:01:48,485
Let's take each x_1,
36
00:01:48,485 --> 00:01:51,310
subtract the mean Mu_1,
37
00:01:51,310 --> 00:01:56,775
and then let's divide by the
difference 2,000 minus 300,
38
00:01:56,775 --> 00:02:01,440
where 2,000 is the maximum
and 300 the minimum,
39
00:02:01,440 --> 00:02:02,960
and if you do this,
40
00:02:02,960 --> 00:02:05,000
you get the normalized x_1 to
41
00:02:05,000 --> 00:02:10,570
range from negative 0.18-0.82.
42
00:02:10,570 --> 00:02:13,880
Similarly, to mean
normalized x_2,
43
00:02:13,880 --> 00:02:16,925
you can calculate the
average of feature 2.
44
00:02:16,925 --> 00:02:20,350
For instance, Mu_2 may be 2.3.
45
00:02:20,350 --> 00:02:22,980
Then you can take each x_2,
46
00:02:22,980 --> 00:02:27,960
subtract Mu_2 and
divide by 5 minus 0.
47
00:02:27,960 --> 00:02:32,280
Again, the max 5 minus
the mean, which is 0.
48
00:02:32,280 --> 00:02:35,849
The mean normalized
x_2 now ranges
49
00:02:35,849 --> 00:02:41,155
from negative 0.46-0 54.
50
00:02:41,155 --> 00:02:43,205
If you plot the training data
51
00:02:43,205 --> 00:02:45,830
using the mean
normalized x_1 and x_2,
52
00:02:45,830 --> 00:02:47,990
it might look like this.
53
00:02:47,990 --> 00:02:51,020
There's one last
common re-scaling
54
00:02:51,020 --> 00:02:54,010
method call Z-score
normalization.
55
00:02:54,010 --> 00:02:56,360
To implement Z-score
normalization,
56
00:02:56,360 --> 00:02:58,190
you need to calculate
something called
57
00:02:58,190 --> 00:03:00,530
the standard deviation
of each feature.
58
00:03:00,530 --> 00:03:02,945
If you don't know what the
standard deviation is,
59
00:03:02,945 --> 00:03:04,310
don't worry about it, you won't
60
00:03:04,310 --> 00:03:06,130
need to know it for this course.
61
00:03:06,130 --> 00:03:07,700
Or if you've heard of
62
00:03:07,700 --> 00:03:10,280
the normal distribution
or the bell-shaped curve,
63
00:03:10,280 --> 00:03:12,590
sometimes also called the
Gaussian distribution,
64
00:03:12,590 --> 00:03:14,900
this is what the
standard deviation
65
00:03:14,900 --> 00:03:17,495
for the normal
distribution looks like.
66
00:03:17,495 --> 00:03:18,980
But if you haven't
heard of this,
67
00:03:18,980 --> 00:03:20,785
you don't need to worry
about that either.
68
00:03:20,785 --> 00:03:23,990
But if you do know what is
the standard deviation,
69
00:03:23,990 --> 00:03:26,720
then to implement a
Z-score normalization,
70
00:03:26,720 --> 00:03:29,240
you first calculate the mean Mu,
71
00:03:29,240 --> 00:03:31,880
as well as the
standard deviation,
72
00:03:31,880 --> 00:03:33,590
which is often denoted by
73
00:03:33,590 --> 00:03:38,135
the lowercase Greek alphabet
Sigma of each feature.
74
00:03:38,135 --> 00:03:41,270
For instance, maybe
feature 1 has
75
00:03:41,270 --> 00:03:46,405
a standard deviation
of 450 and mean 600,
76
00:03:46,405 --> 00:03:49,740
then to Z-score normalize x_1,
77
00:03:49,740 --> 00:03:51,405
take each x_1,
78
00:03:51,405 --> 00:03:53,900
subtract Mu_1, and
79
00:03:53,900 --> 00:03:56,660
then divide by the
standard deviation,
80
00:03:56,660 --> 00:03:59,620
which I'm going to
denote as Sigma 1.
81
00:03:59,620 --> 00:04:03,555
What you may find is that
the Z-score normalized
82
00:04:03,555 --> 00:04:08,650
x_1 now ranges from
negative 0.67-3.1.
83
00:04:09,650 --> 00:04:12,290
Similarly, if you calculate the
84
00:04:12,290 --> 00:04:14,810
second features
standard deviation
85
00:04:14,810 --> 00:04:19,855
to be 1.4 and mean to be 2.3,
86
00:04:19,855 --> 00:04:25,560
then you can compute x_2 minus
Mu_2 divided by Sigma_2,
87
00:04:25,560 --> 00:04:26,940
and in this case,
88
00:04:26,940 --> 00:04:30,330
the Z-score normalized
by x_2 might now
89
00:04:30,330 --> 00:04:36,060
range from negative 1.6-1.9.
90
00:04:36,060 --> 00:04:37,790
If you plot the training data on
91
00:04:37,790 --> 00:04:40,220
the normalized x_1
and x_2 on a graph,
92
00:04:40,220 --> 00:04:42,570
it might look like this.
93
00:04:42,650 --> 00:04:44,860
As a rule of thumb,
94
00:04:44,860 --> 00:04:47,104
when performing feature scaling,
95
00:04:47,104 --> 00:04:48,860
you might want to
aim for getting
96
00:04:48,860 --> 00:04:51,620
the features to range
from maybe anywhere
97
00:04:51,620 --> 00:04:54,320
around negative one
to somewhere around
98
00:04:54,320 --> 00:04:57,530
plus one for each feature x.
99
00:04:57,530 --> 00:05:00,170
But these values,
negative one and
100
00:05:00,170 --> 00:05:02,930
plus one can be a
little bit loose.
101
00:05:02,930 --> 00:05:06,380
If the features range from
negative three to plus
102
00:05:06,380 --> 00:05:10,445
three or negative
0.3 to plus 0.3,
103
00:05:10,445 --> 00:05:12,440
all of these are
completely okay.
104
00:05:12,440 --> 00:05:14,630
If you have a feature x_1 that
105
00:05:14,630 --> 00:05:17,255
winds up being between
zero and three,
106
00:05:17,255 --> 00:05:18,785
that's not a problem.
107
00:05:18,785 --> 00:05:21,050
You can re-scale it if you want,
108
00:05:21,050 --> 00:05:22,700
but if you don't re-scale it,
109
00:05:22,700 --> 00:05:24,355
it should work okay too.
110
00:05:24,355 --> 00:05:27,785
Or if you have a
different feature, x_2,
111
00:05:27,785 --> 00:05:29,840
whose values are
between negative
112
00:05:29,840 --> 00:05:32,180
2 and plus 0.5, again,
113
00:05:32,180 --> 00:05:34,715
that's okay, no
harm re-scaling it,
114
00:05:34,715 --> 00:05:38,500
but it might be okay if you
leave it alone as well.
115
00:05:38,500 --> 00:05:41,630
But if another feature,
like x_3 here,
116
00:05:41,630 --> 00:05:45,680
ranges from negative
100 to plus 100,
117
00:05:45,680 --> 00:05:48,500
then this takes on a very
different range of values,
118
00:05:48,500 --> 00:05:51,760
say something from around
negative one to plus one.
119
00:05:51,760 --> 00:05:56,330
You're probably better off
re-scaling this feature x_3 so
120
00:05:56,330 --> 00:05:57,770
that it ranges from something
121
00:05:57,770 --> 00:06:01,135
closer to negative
one to plus one.
122
00:06:01,135 --> 00:06:04,140
Similarly, if you have a feature
123
00:06:04,140 --> 00:06:07,055
x_4 that takes on
really small values,
124
00:06:07,055 --> 00:06:11,990
say between negative
0.001 and plus 0.001,
125
00:06:11,990 --> 00:06:14,680
then these values are so small.
126
00:06:14,680 --> 00:06:18,205
That means you may want
to re-scale it as well.
127
00:06:18,205 --> 00:06:21,805
Finally, what if
your feature x_5,
128
00:06:21,805 --> 00:06:23,645
such as measurements of
129
00:06:23,645 --> 00:06:26,195
a hospital patients
by the temperature
130
00:06:26,195 --> 00:06:32,095
ranges from 98.6-105
degrees Fahrenheit?
131
00:06:32,095 --> 00:06:35,690
In this case, these
values are around 100,
132
00:06:35,690 --> 00:06:37,430
which is actually pretty large
133
00:06:37,430 --> 00:06:40,130
compared to other
scale features,
134
00:06:40,130 --> 00:06:41,660
and this will actually cause
135
00:06:41,660 --> 00:06:44,140
gradient descent to
run more slowly.
136
00:06:44,140 --> 00:06:47,960
In this case, feature
re-scaling will likely help.
137
00:06:47,960 --> 00:06:50,360
There's almost never any harm to
138
00:06:50,360 --> 00:06:52,700
carrying out feature re-scaling.
139
00:06:52,700 --> 00:06:56,245
When in doubt, I encourage
you to just carry it out.
140
00:06:56,245 --> 00:06:58,605
That's it for feature scaling.
141
00:06:58,605 --> 00:06:59,900
With this little technique,
142
00:06:59,900 --> 00:07:01,790
you'll often be able to get
143
00:07:01,790 --> 00:07:04,805
gradient descent to
run much faster.
144
00:07:04,805 --> 00:07:07,480
That's features scaling.
145
00:07:07,480 --> 00:07:10,144
With or without feature scaling,
146
00:07:10,144 --> 00:07:11,765
when you run gradient descent,
147
00:07:11,765 --> 00:07:13,610
how can you know,
how can you check
148
00:07:13,610 --> 00:07:15,830
if gradient descent
is really working?
149
00:07:15,830 --> 00:07:17,150
If it is finding you
150
00:07:17,150 --> 00:07:19,975
the global minimum or
something close to it.
151
00:07:19,975 --> 00:07:21,335
In the next video,
152
00:07:21,335 --> 00:07:23,675
let's take a look
at how to recognize
153
00:07:23,675 --> 00:07:26,225
if gradient descent
is converging,
154
00:07:26,225 --> 00:07:28,220
and then in the
video after that,
155
00:07:28,220 --> 00:07:30,710
this will lead to
discussion of how to choose
156
00:07:30,710 --> 00:07:34,440
a good learning rate
for gradient descent.11069
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.