Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,680 --> 00:00:03,165
When running gradient descent,
2
00:00:03,165 --> 00:00:05,490
how can you tell if
it is converging?
3
00:00:05,490 --> 00:00:07,740
That is, whether it's
helping you to find
4
00:00:07,740 --> 00:00:10,020
parameters close to
the global minimum
5
00:00:10,020 --> 00:00:11,565
of the cost function.
6
00:00:11,565 --> 00:00:13,560
By learning to recognize what
7
00:00:13,560 --> 00:00:15,030
a well-running implementation of
8
00:00:15,030 --> 00:00:16,320
gradient descent looks like,
9
00:00:16,320 --> 00:00:18,630
we will also, in a later video,
10
00:00:18,630 --> 00:00:22,170
be better able to choose a
good learning rate Alpha.
11
00:00:22,170 --> 00:00:24,735
Let's take a look.
As a reminder,
12
00:00:24,735 --> 00:00:26,565
here's the gradient
descent rule.
13
00:00:26,565 --> 00:00:29,040
One of the key choices is
14
00:00:29,040 --> 00:00:32,110
the choice of the
learning rate Alpha.
15
00:00:32,110 --> 00:00:34,760
Here's something that
I often do to make
16
00:00:34,760 --> 00:00:37,580
sure that gradient
descent is working well.
17
00:00:37,580 --> 00:00:39,110
Recall that the job of
18
00:00:39,110 --> 00:00:41,690
gradient descent is
to find parameters w
19
00:00:41,690 --> 00:00:46,075
and b that hopefully minimize
the cost function J.
20
00:00:46,075 --> 00:00:49,865
What I'll often do is
plot the cost function J,
21
00:00:49,865 --> 00:00:52,895
which is calculated
on the training set,
22
00:00:52,895 --> 00:00:55,460
and I plot the value of J at
23
00:00:55,460 --> 00:00:58,550
each iteration of
gradient descent.
24
00:00:58,550 --> 00:01:01,970
Remember that each
iteration means after
25
00:01:01,970 --> 00:01:07,270
each simultaneous update
of the parameters w and b.
26
00:01:07,270 --> 00:01:11,210
In this plot, the
horizontal axis is
27
00:01:11,210 --> 00:01:13,430
the number of iterations of
28
00:01:13,430 --> 00:01:16,720
gradient descent that
you've run so far.
29
00:01:16,720 --> 00:01:20,055
You may get a curve
that looks like this.
30
00:01:20,055 --> 00:01:22,685
Notice that the horizontal axis
31
00:01:22,685 --> 00:01:24,650
is the number of iterations of
32
00:01:24,650 --> 00:01:29,830
gradient descent and not
a parameter like w or b.
33
00:01:29,830 --> 00:01:32,360
This differs from
previous graphs you've
34
00:01:32,360 --> 00:01:35,570
seen where the
vertical axis was cost
35
00:01:35,570 --> 00:01:38,150
J and the horizontal axis was
36
00:01:38,150 --> 00:01:42,125
a single parameter like w or b.
37
00:01:42,125 --> 00:01:46,130
This curve is also
called a learning curve.
38
00:01:46,130 --> 00:01:47,930
Note that there are
39
00:01:47,930 --> 00:01:49,730
a few different
types of learning
40
00:01:49,730 --> 00:01:51,635
curves used in machine learning,
41
00:01:51,635 --> 00:01:53,315
and you see some of the types
42
00:01:53,315 --> 00:01:55,510
later in this course as well.
43
00:01:55,510 --> 00:01:59,900
Concretely, if you look here
at this point on the curve,
44
00:01:59,900 --> 00:02:02,195
this means that after you've run
45
00:02:02,195 --> 00:02:04,880
gradient descent
for 100 iterations,
46
00:02:04,880 --> 00:02:08,515
meaning 100 simultaneous
updates of the parameters,
47
00:02:08,515 --> 00:02:12,495
you have some learned
values for w and b.
48
00:02:12,495 --> 00:02:16,020
If you compute the cost J, w,
49
00:02:16,020 --> 00:02:19,265
b for those values of w and b,
50
00:02:19,265 --> 00:02:21,995
the ones you got
after 100 iterations,
51
00:02:21,995 --> 00:02:25,160
you get this value
for the cost J.
52
00:02:25,160 --> 00:02:29,665
That is this point on
the vertical axis.
53
00:02:29,665 --> 00:02:34,550
This point here corresponds
to the value of J for
54
00:02:34,550 --> 00:02:36,350
the parameters
that you got after
55
00:02:36,350 --> 00:02:39,695
200 iterations of
gradient descent.
56
00:02:39,695 --> 00:02:42,965
Looking at this graph
helps you to see
57
00:02:42,965 --> 00:02:44,900
how your cost J changes
58
00:02:44,900 --> 00:02:47,615
after each iteration
of gradient descent.
59
00:02:47,615 --> 00:02:50,150
If gradient descent
is working properly,
60
00:02:50,150 --> 00:02:51,620
then the cost J should
61
00:02:51,620 --> 00:02:54,604
decrease after every
single iteration.
62
00:02:54,604 --> 00:02:58,880
If J ever increases
after one iteration,
63
00:02:58,880 --> 00:03:02,315
that means either Alpha
is chosen poorly,
64
00:03:02,315 --> 00:03:05,225
and it usually means
Alpha is too large,
65
00:03:05,225 --> 00:03:07,540
or there could be
a bug in the code.
66
00:03:07,540 --> 00:03:10,280
Another useful thing
that this part can tell
67
00:03:10,280 --> 00:03:12,965
you is that if you
look at this curve,
68
00:03:12,965 --> 00:03:16,805
by the time you reach
maybe 300 iterations also,
69
00:03:16,805 --> 00:03:19,145
the cost J is leveling
70
00:03:19,145 --> 00:03:22,690
off and is no longer
decreasing much.
71
00:03:22,690 --> 00:03:24,950
By 400 iterations,
72
00:03:24,950 --> 00:03:27,740
it looks like the curve
has flattened out.
73
00:03:27,740 --> 00:03:31,700
This means that gradient
descent has more or less
74
00:03:31,700 --> 00:03:36,550
converged because the curve
is no longer decreasing.
75
00:03:36,550 --> 00:03:38,840
Looking at this learning curve,
76
00:03:38,840 --> 00:03:41,180
you can try to spot
whether or not
77
00:03:41,180 --> 00:03:44,005
gradient descent is converging.
78
00:03:44,005 --> 00:03:46,310
By the way, the number
79
00:03:46,310 --> 00:03:48,125
of iterations that
gradient descent
80
00:03:48,125 --> 00:03:49,970
takes a conversion can vary
81
00:03:49,970 --> 00:03:52,175
a lot between different
applications.
82
00:03:52,175 --> 00:03:53,840
In one application, it may
83
00:03:53,840 --> 00:03:56,584
converge after just
30 iterations.
84
00:03:56,584 --> 00:03:58,220
For a different application,
85
00:03:58,220 --> 00:04:02,180
it could take 1,000 or
100,000 iterations.
86
00:04:02,180 --> 00:04:06,050
It turns out to be very
difficult to tell in
87
00:04:06,050 --> 00:04:08,405
advance how many iterations
88
00:04:08,405 --> 00:04:10,555
gradient descent
needs to converge,
89
00:04:10,555 --> 00:04:12,710
which is why you can create
90
00:04:12,710 --> 00:04:15,185
a graph like this,
a learning curve.
91
00:04:15,185 --> 00:04:17,360
Try to find out
when you can start
92
00:04:17,360 --> 00:04:20,120
training your particular model.
93
00:04:20,120 --> 00:04:23,900
Another way to decide when
your model is done training
94
00:04:23,900 --> 00:04:28,250
is with an automatic
convergence test.
95
00:04:29,980 --> 00:04:33,805
Here is the Greek
alphabet epsilon.
96
00:04:33,805 --> 00:04:35,710
Let's let epsilon be a
97
00:04:35,710 --> 00:04:37,930
variable representing
a small number,
98
00:04:37,930 --> 00:04:43,220
such as 0.001 or 10^-3.
99
00:04:43,220 --> 00:04:45,970
If the cost J decreases by less
100
00:04:45,970 --> 00:04:48,550
than this number epsilon
on one iteration,
101
00:04:48,550 --> 00:04:51,640
then you're likely on
this flattened part of
102
00:04:51,640 --> 00:04:53,260
the curve that you see on
103
00:04:53,260 --> 00:04:56,105
the left and you can
declare convergence.
104
00:04:56,105 --> 00:04:58,135
Remember, convergence,
105
00:04:58,135 --> 00:05:00,550
hopefully in the case
that you found parameters
106
00:05:00,550 --> 00:05:04,045
w and b that are close to the
minimum possible value of
107
00:05:04,045 --> 00:05:06,850
J. I usually find
108
00:05:06,850 --> 00:05:08,290
that choosing the
right threshold
109
00:05:08,290 --> 00:05:10,195
epsilon is pretty difficult.
110
00:05:10,195 --> 00:05:12,020
I actually tend
to look at graphs
111
00:05:12,020 --> 00:05:13,340
like this one on the left,
112
00:05:13,340 --> 00:05:16,720
rather than rely on
automatic convergence tests.
113
00:05:16,720 --> 00:05:19,860
Looking at the solid
figure can tell you,
114
00:05:19,860 --> 00:05:22,700
I'll give you at some
advanced warning if
115
00:05:22,700 --> 00:05:26,635
maybe gradient descent is not
working correctly as well.
116
00:05:26,635 --> 00:05:29,630
You've now seen what
the learning curve
117
00:05:29,630 --> 00:05:32,450
should look like when gradient
descent is running well.
118
00:05:32,450 --> 00:05:35,300
Let's take these insights
and in the next video,
119
00:05:35,300 --> 00:05:36,500
take a look at how to
120
00:05:36,500 --> 00:05:39,810
choose an appropriate
learning rate.8522
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.