Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:02,440 --> 00:00:04,040
So welcome back.
2
00:00:04,040 --> 00:00:08,540
Let's take a look at some techniques that
make great inter sense work much better.
3
00:00:08,540 --> 00:00:13,000
In this video you see a technique
called feature scaling that will enable
4
00:00:13,000 --> 00:00:15,940
gradient descent to run much faster.
5
00:00:15,940 --> 00:00:20,050
Let's start by taking a look at
the relationship between the size of
6
00:00:20,050 --> 00:00:23,653
a feature that is how big
are the numbers for that feature and
7
00:00:23,653 --> 00:00:26,840
the size of its associated parameter.
8
00:00:26,840 --> 00:00:31,728
As a concrete example, let's predict
the price of a house using two
9
00:00:31,728 --> 00:00:37,240
features x1 the size of the house and
x2 the number of bedrooms.
10
00:00:37,240 --> 00:00:42,670
Let's say that x1 typically ranges
from 300 to 2000 square feet.
11
00:00:42,670 --> 00:00:48,140
And x2 in the data set
ranges from 0 to 5 bedrooms.
12
00:00:48,140 --> 00:00:53,609
So for this example, x1 takes on
a relatively large range of values and
13
00:00:53,609 --> 00:00:58,140
x2 takes on a relatively
small range of values.
14
00:00:58,140 --> 00:01:03,485
Now let's take an example of a house
that has a size of 2000 square
15
00:01:03,485 --> 00:01:08,749
feet has five bedrooms and
a price of 500k or $500,000.
16
00:01:08,749 --> 00:01:13,464
For this one training example,
what do you think
17
00:01:13,464 --> 00:01:20,340
are reasonable values for
the size of parameters w1 and w2?
18
00:01:20,340 --> 00:01:23,640
Well, let's look at one
possible set of parameters.
19
00:01:23,640 --> 00:01:27,816
Say w1 is 50 and w2 is 0.1 and
20
00:01:27,816 --> 00:01:32,861
b is 50 for the purposes of discussion.
21
00:01:34,040 --> 00:01:39,173
So in this case the estimated
price in thousands
22
00:01:39,173 --> 00:01:46,240
of dollars is 100,000k
here plus 0.5 k plus 50 k.
23
00:01:46,240 --> 00:01:50,240
Which is slightly over
100 million dollars.
24
00:01:50,240 --> 00:01:57,540
So that's clearly very far from
the actual price of $500,000.
25
00:01:57,540 --> 00:02:03,540
And so this is not a very good set
of parameter choices for w1 and w2.
26
00:02:03,540 --> 00:02:07,506
Now let's take a look
at another possibility.
27
00:02:07,506 --> 00:02:11,039
Say w1 and w2 were the other way around.
28
00:02:11,039 --> 00:02:15,561
W1 is 0.1 and w2 is 50 and
b is still also 50.
29
00:02:15,561 --> 00:02:21,285
In this choice of w1 and w2,
w1 is relatively small and
30
00:02:21,285 --> 00:02:28,240
w2 is relatively large,
50 is much bigger than 0.1.
31
00:02:28,240 --> 00:02:33,014
So here the predicted price is 0.1 times
32
00:02:33,014 --> 00:02:38,240
2000 plus 50 times five plus 50.
33
00:02:38,240 --> 00:02:45,340
The first term becomes 200k, the second
term becomes 250k, and the plus 50.
34
00:02:45,340 --> 00:02:50,422
So this version of the model predicts
a price of $500,000 which is a much more
35
00:02:50,422 --> 00:02:55,451
reasonable estimate and happens to be the
same price as the true price of the house.
36
00:02:56,540 --> 00:03:01,675
So hopefully you might notice that when
a possible range of values of a feature
37
00:03:01,675 --> 00:03:06,819
is large, like the size and square feet
which goes all the way up to 2000.
38
00:03:06,819 --> 00:03:11,455
It's more likely that a good model will
learn to choose a relatively small
39
00:03:11,455 --> 00:03:14,540
parameter value, like 0.1.
40
00:03:14,540 --> 00:03:18,636
Likewise, when the possible
values of the feature are small,
41
00:03:18,636 --> 00:03:22,345
like the number of bedrooms,
then a reasonable value for
42
00:03:22,345 --> 00:03:26,340
its parameters will be
relatively large like 50.
43
00:03:26,340 --> 00:03:30,240
So how does this relate
to grading descent?
44
00:03:30,240 --> 00:03:35,027
Well, let's take a look at
the scatter plot of the features
45
00:03:35,027 --> 00:03:39,528
where the size square feet is
the horizontal axis x1 and
46
00:03:39,528 --> 00:03:44,640
the number of bedrooms exudes
is on the vertical axis.
47
00:03:44,640 --> 00:03:49,416
If you plot the training data, you notice
that the horizontal axis is on a much
48
00:03:49,416 --> 00:03:54,061
larger scale or much larger range of
values compared to the vertical axis.
49
00:03:55,340 --> 00:04:00,787
Next let's look at how the cost
function might look in a contour plot.
50
00:04:00,787 --> 00:04:06,856
You might see a contour plot where the
horizontal axis has a much narrower range,
51
00:04:06,856 --> 00:04:11,765
say between zero and one,
whereas the vertical axis takes on much
52
00:04:11,765 --> 00:04:15,840
larger values, say between 10 and 100.
53
00:04:15,840 --> 00:04:19,027
So the contours form ovals or ellipses and
54
00:04:19,027 --> 00:04:23,640
they're short on one side and
longer on the other.
55
00:04:23,640 --> 00:04:29,115
And this is because a very small change
to w1 can have a very large impact
56
00:04:29,115 --> 00:04:34,331
on the estimated price and
that's a very large impact on the cost J.
57
00:04:34,331 --> 00:04:38,763
Because w1 tends to be multiplied
by a very large number,
58
00:04:38,763 --> 00:04:41,540
the size and square feet.
59
00:04:41,540 --> 00:04:42,544
In contrast,
60
00:04:42,544 --> 00:04:47,908
it takes a much larger change in w2 in
order to change the predictions much.
61
00:04:47,908 --> 00:04:54,540
And thus small changes to w2, don't
change the cost function nearly as much.
62
00:04:54,540 --> 00:04:56,540
So where does this leave us?
63
00:04:56,540 --> 00:05:01,699
This is what might end up happening
if you were to run great in dissent,
64
00:05:01,699 --> 00:05:04,853
if you were to use your
training data as is.
65
00:05:04,853 --> 00:05:07,442
Because the contours are so tall and
66
00:05:07,442 --> 00:05:11,930
skinny gradient descent may end
up bouncing back and forth for
67
00:05:11,930 --> 00:05:17,340
a long time before it can finally
find its way to the global minimum.
68
00:05:17,340 --> 00:05:22,340
In situations like this, a useful
thing to do is to scale the features.
69
00:05:22,340 --> 00:05:28,242
This means performing some
transformation of your training data so
70
00:05:28,242 --> 00:05:34,990
that x1 say might now range from 0 to
1 and x2 might also range from 0 to 1.
71
00:05:34,990 --> 00:05:39,875
So the data points now look more like this
and you might notice that the scale of
72
00:05:39,875 --> 00:05:43,951
the plot on the bottom is now quite
different than the one on top.
73
00:05:45,040 --> 00:05:48,045
The key point is that the re scale x1 and
74
00:05:48,045 --> 00:05:54,040
x2 are both now taking comparable
ranges of values to each other.
75
00:05:54,040 --> 00:05:58,851
And if you run gradient descent on
a cost function to find on this,
76
00:05:58,851 --> 00:06:03,924
re scaled x1 and x2 using this
transformed data, then the contours
77
00:06:03,924 --> 00:06:08,980
will look more like this more like
circles and less tall and skinny.
78
00:06:08,980 --> 00:06:13,461
And gradient descent can find a much
more direct path to the global minimum.
79
00:06:14,640 --> 00:06:18,875
So to recap, when you have different
features that take on very different
80
00:06:18,875 --> 00:06:22,632
ranges of values, it can cause
gradient descent to run slowly but
81
00:06:22,632 --> 00:06:27,460
re scaling the different features so they
all take on comparable range of values.
82
00:06:27,460 --> 00:06:30,640
because speed, upgrade and
dissent significantly.
83
00:06:30,640 --> 00:06:31,970
How do you actually do this?
84
00:06:31,970 --> 00:06:34,060
Let's take a look at
that in the next video.7743
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.