Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:03,410 --> 00:00:06,435
Let's take a look at
how you can actually
2
00:00:06,435 --> 00:00:09,540
implement the gradient
descent algorithm.
3
00:00:09,540 --> 00:00:13,725
Let me write down the
gradient descent algorithm.
4
00:00:13,725 --> 00:00:16,450
Here it is. On each step,
5
00:00:16,450 --> 00:00:18,410
w, the parameter,
6
00:00:18,410 --> 00:00:22,550
is updated to the
old value of w minus
7
00:00:22,550 --> 00:00:27,305
Alpha times this term d/dw
8
00:00:27,305 --> 00:00:30,430
of the cos function J of wb.
9
00:00:30,430 --> 00:00:32,970
What this expression
is saying is,
10
00:00:32,970 --> 00:00:36,860
after your parameter w by
taking the current value of
11
00:00:36,860 --> 00:00:40,805
w and adjusting it
a small amount,
12
00:00:40,805 --> 00:00:43,070
which is this expression
on the right,
13
00:00:43,070 --> 00:00:47,820
minus Alpha times
this term over here.
14
00:00:48,980 --> 00:00:51,440
If you feel like there's a lot
15
00:00:51,440 --> 00:00:53,270
going on in this equation,
16
00:00:53,270 --> 00:00:55,310
it's okay, don't worry about it.
17
00:00:55,310 --> 00:00:57,490
We'll unpack it together.
18
00:00:57,490 --> 00:01:01,140
First, this equal notation here.
19
00:01:01,140 --> 00:01:03,335
Now, since I said
we're assigning
20
00:01:03,335 --> 00:01:06,350
w a value using this equal sign,
21
00:01:06,350 --> 00:01:08,090
so in this context,
22
00:01:08,090 --> 00:01:11,710
this equal sign is the
assignment operator.
23
00:01:11,710 --> 00:01:13,969
Specifically, in this context,
24
00:01:13,969 --> 00:01:17,810
if you write code
that says a equals c,
25
00:01:17,810 --> 00:01:21,964
it means take the value c and
store it in your computer,
26
00:01:21,964 --> 00:01:23,665
in the variable a.
27
00:01:23,665 --> 00:01:26,330
Or if you write a
equals a plus 1,
28
00:01:26,330 --> 00:01:29,900
it means set the value of
a to be equal to a plus 1,
29
00:01:29,900 --> 00:01:33,325
or increments the
value of a by one.
30
00:01:33,325 --> 00:01:36,710
The assignment
operator encoding is
31
00:01:36,710 --> 00:01:41,095
different than truth
assertions in mathematics.
32
00:01:41,095 --> 00:01:43,505
Where if I write a equals c,
33
00:01:43,505 --> 00:01:45,080
I'm asserting, that is,
34
00:01:45,080 --> 00:01:47,240
I'm claiming that the values
35
00:01:47,240 --> 00:01:50,095
of a and c are equal
to each other.
36
00:01:50,095 --> 00:01:53,615
Hopefully, I will never write
a truth assertion a equals
37
00:01:53,615 --> 00:01:57,655
a plus 1 because that just
can't possibly be true.
38
00:01:57,655 --> 00:02:01,205
In Python and in other
programming languages,
39
00:02:01,205 --> 00:02:05,590
truth assertions are sometimes
written as equals equals,
40
00:02:05,590 --> 00:02:06,890
so you may see oh,
41
00:02:06,890 --> 00:02:10,110
that says a equals
equals c if you're
42
00:02:10,110 --> 00:02:14,014
testing whether a is equal
to c. But in math notation,
43
00:02:14,014 --> 00:02:15,860
as we conventionally use it,
44
00:02:15,860 --> 00:02:17,240
like in these videos,
45
00:02:17,240 --> 00:02:19,610
the equal sign can be used for
46
00:02:19,610 --> 00:02:23,215
either assignments or
for truth assertion.
47
00:02:23,215 --> 00:02:25,100
I try to make sure I was clear
48
00:02:25,100 --> 00:02:26,780
when I write an equal sign,
49
00:02:26,780 --> 00:02:29,225
whether we're assigning
a value to a variable,
50
00:02:29,225 --> 00:02:31,700
or whether we're
asserting the truth of
51
00:02:31,700 --> 00:02:34,765
the equality of two values.
52
00:02:34,765 --> 00:02:37,220
Now, this dive more deeply
53
00:02:37,220 --> 00:02:39,965
into what the symbols
in this equation means.
54
00:02:39,965 --> 00:02:45,070
The symbol here is the
Greek alphabet Alpha.
55
00:02:45,070 --> 00:02:50,720
In this equation, Alpha is
also called the learning rate.
56
00:02:50,720 --> 00:02:54,050
The learning rate is usually
a small positive number
57
00:02:54,050 --> 00:02:58,650
between 0 and 1 and it
might be say, 0.01.
58
00:02:58,660 --> 00:03:00,920
What Alpha does is,
59
00:03:00,920 --> 00:03:02,570
it basically controls how
60
00:03:02,570 --> 00:03:05,515
big of a step you take downhill.
61
00:03:05,515 --> 00:03:08,285
If Alpha is very large,
62
00:03:08,285 --> 00:03:09,800
then that corresponds to
63
00:03:09,800 --> 00:03:12,604
a very aggressive gradient
descent procedure
64
00:03:12,604 --> 00:03:15,550
where you're trying to
take huge steps downhill.
65
00:03:15,550 --> 00:03:17,705
If Alpha is very small,
66
00:03:17,705 --> 00:03:20,705
then you'd be taking small
baby steps downhill.
67
00:03:20,705 --> 00:03:23,480
We'll come back later to
dive more deeply into
68
00:03:23,480 --> 00:03:26,560
how to choose a good
learning rate Alpha.
69
00:03:26,560 --> 00:03:29,435
Finally, this term here,
70
00:03:29,435 --> 00:03:33,155
that's the derivative term
of the cost function J.
71
00:03:33,155 --> 00:03:35,150
Let's not worry
about the details
72
00:03:35,150 --> 00:03:36,740
of this derivative right now.
73
00:03:36,740 --> 00:03:38,510
But later on, you'll get to
74
00:03:38,510 --> 00:03:40,750
see more about the
derivative term.
75
00:03:40,750 --> 00:03:42,380
But for now, you can think of
76
00:03:42,380 --> 00:03:44,390
this derivative term that I drew
77
00:03:44,390 --> 00:03:46,685
a magenta box around
as telling you
78
00:03:46,685 --> 00:03:49,660
in which direction you want
to take your baby step.
79
00:03:49,660 --> 00:03:53,135
In combination with the
learning rate Alpha,
80
00:03:53,135 --> 00:03:55,160
it also determines the size
81
00:03:55,160 --> 00:03:57,625
of the steps you want
to take downhill.
82
00:03:57,625 --> 00:04:00,410
Now, I do want to mention that
83
00:04:00,410 --> 00:04:02,825
derivatives come from calculus.
84
00:04:02,825 --> 00:04:04,580
Even if you aren't familiar with
85
00:04:04,580 --> 00:04:06,820
calculus, don't worry about it.
86
00:04:06,820 --> 00:04:09,095
Even without knowing
any calculus,
87
00:04:09,095 --> 00:04:10,760
you'd be able to figure
out all you need
88
00:04:10,760 --> 00:04:12,515
to know about this
derivative term
89
00:04:12,515 --> 00:04:16,470
in this video and the
next. One more thing.
90
00:04:16,470 --> 00:04:19,100
Remember your model
has two parameters,
91
00:04:19,100 --> 00:04:22,040
not just w, but also b.
92
00:04:22,040 --> 00:04:25,040
You also have an
assignment operations
93
00:04:25,040 --> 00:04:28,595
update the parameter b
that looks very similar.
94
00:04:28,595 --> 00:04:33,680
b is assigned the
old value of b minus
95
00:04:33,680 --> 00:04:36,290
the learning rate Alpha times
96
00:04:36,290 --> 00:04:39,920
this slightly different
derivative term,
97
00:04:39,920 --> 00:04:42,895
d/db of J of wb.
98
00:04:42,895 --> 00:04:46,460
Remember in the graph
of the surface plot
99
00:04:46,460 --> 00:04:48,290
where you're taking baby steps
100
00:04:48,290 --> 00:04:50,315
until you get to the
bottom of the value,
101
00:04:50,315 --> 00:04:53,210
well, for the gradient
descent algorithm,
102
00:04:53,210 --> 00:04:54,410
you're going to repeat
103
00:04:54,410 --> 00:04:57,950
these two update steps until
the algorithm converges.
104
00:04:57,950 --> 00:05:00,710
By converges, I mean that you
105
00:05:00,710 --> 00:05:03,470
reach the point at a
local minimum where
106
00:05:03,470 --> 00:05:05,990
the parameters w and b no longer
107
00:05:05,990 --> 00:05:09,830
change much with each
additional step that you take.
108
00:05:09,830 --> 00:05:13,054
Now, there's one
more subtle detail
109
00:05:13,054 --> 00:05:16,340
about how to correctly in
semantic gradient descent,
110
00:05:16,340 --> 00:05:20,465
you're going to update
two parameters, w and b.
111
00:05:20,465 --> 00:05:25,655
This update takes place for
both parameters, w and b.
112
00:05:25,655 --> 00:05:30,905
One important detail is
that for gradient descent,
113
00:05:30,905 --> 00:05:35,615
you want to simultaneously
update w and b,
114
00:05:35,615 --> 00:05:37,205
meaning you want to update
115
00:05:37,205 --> 00:05:39,710
both parameters
at the same time.
116
00:05:39,710 --> 00:05:43,325
What I mean by that, is
that in this expression,
117
00:05:43,325 --> 00:05:48,290
you're going to update w
from the old w to a new w,
118
00:05:48,290 --> 00:05:50,840
and you're also updating b from
119
00:05:50,840 --> 00:05:54,815
its oldest value to
a new value of b.
120
00:05:54,815 --> 00:05:58,850
The way to implement this is
to compute the right side,
121
00:05:58,850 --> 00:06:02,570
computing this
thing for w and b,
122
00:06:02,570 --> 00:06:05,630
and simultaneously
at the same time,
123
00:06:05,630 --> 00:06:10,745
update w and b to
the new values.
124
00:06:10,745 --> 00:06:13,955
Let's take a look
at what this means.
125
00:06:13,955 --> 00:06:16,340
Here's the correct
way to implement
126
00:06:16,340 --> 00:06:19,790
gradient descent which does
a simultaneous update.
127
00:06:19,790 --> 00:06:24,620
This sets a variable temp_w
equal to that expression,
128
00:06:24,620 --> 00:06:27,470
which is w minus that term here.
129
00:06:27,470 --> 00:06:31,475
There's also a set in another
variable temp_b to that,
130
00:06:31,475 --> 00:06:33,755
which is b minus that term.
131
00:06:33,755 --> 00:06:36,875
You compute both for hand
sides, both updates,
132
00:06:36,875 --> 00:06:40,775
and store them into
variables temp_w and temp_b.
133
00:06:40,775 --> 00:06:46,655
Then you copy the value
of temp_w into w,
134
00:06:46,655 --> 00:06:51,215
and you also copy the
value of temp_b into b.
135
00:06:51,215 --> 00:06:55,490
Now, one thing you may
notice is that this value of
136
00:06:55,490 --> 00:07:00,005
w is from the for
w gets updated.
137
00:07:00,005 --> 00:07:03,695
Here, I noticed that
the pre-update w
138
00:07:03,695 --> 00:07:07,640
is where it goes into the
derivative term over here.
139
00:07:07,640 --> 00:07:10,190
In contrast, here is
140
00:07:10,190 --> 00:07:12,080
an incorrect implementation of
141
00:07:12,080 --> 00:07:15,665
gradient descent that does
not do a simultaneous update.
142
00:07:15,665 --> 00:07:18,350
In this incorrect
implementation,
143
00:07:18,350 --> 00:07:20,315
we compute temp_w,
144
00:07:20,315 --> 00:07:23,505
same as before, so
far that's okay.
145
00:07:23,505 --> 00:07:26,725
Now here's where things
start to differ.
146
00:07:26,725 --> 00:07:29,800
We then update w
with the value in
147
00:07:29,800 --> 00:07:33,670
temp_w before calculating
the new value
148
00:07:33,670 --> 00:07:35,550
for the other parameter to be.
149
00:07:35,550 --> 00:07:40,460
Next, we calculate temp_b
as b minus that term here,
150
00:07:40,460 --> 00:07:45,095
and finally, we update b
with the value in temp_b.
151
00:07:45,095 --> 00:07:47,510
The difference between
the right-hand side and
152
00:07:47,510 --> 00:07:49,160
the left-hand side
implementations
153
00:07:49,160 --> 00:07:51,305
is that if you look over here,
154
00:07:51,305 --> 00:07:55,865
this w has already been
updated to this new value,
155
00:07:55,865 --> 00:07:58,940
and this is updated
w that actually
156
00:07:58,940 --> 00:08:02,945
goes into the cost
function j of w, b.
157
00:08:02,945 --> 00:08:05,900
It means that this term here
on the right is not the
158
00:08:05,900 --> 00:08:10,655
same as this term over here
that you see on the left.
159
00:08:10,655 --> 00:08:15,080
That also means
this temp_b term on
160
00:08:15,080 --> 00:08:16,850
the right is not quite the
161
00:08:16,850 --> 00:08:20,015
same as the temp b
term on the left,
162
00:08:20,015 --> 00:08:22,130
and thus this updated value for
163
00:08:22,130 --> 00:08:24,215
b on the right is not the same
164
00:08:24,215 --> 00:08:29,165
as this updated value for
variable b on the left.
165
00:08:29,165 --> 00:08:32,105
The way that gradient descent
is implemented in code,
166
00:08:32,105 --> 00:08:35,360
it actually turns out to be
more natural to implement
167
00:08:35,360 --> 00:08:39,680
it the correct way with
simultaneous updates.
168
00:08:39,680 --> 00:08:42,695
When you hear someone talk
about gradient descent,
169
00:08:42,695 --> 00:08:44,990
they always mean the
gradient descents where you
170
00:08:44,990 --> 00:08:48,335
perform a simultaneous
update of the parameters.
171
00:08:48,335 --> 00:08:50,720
If however, you were
172
00:08:50,720 --> 00:08:53,375
to implement
non-simultaneous update,
173
00:08:53,375 --> 00:08:57,230
it turns out it will probably
work more or less anyway.
174
00:08:57,230 --> 00:08:59,000
But doing it this way isn't
175
00:08:59,000 --> 00:09:00,965
really the correct
way to implement it,
176
00:09:00,965 --> 00:09:02,750
is actually some other algorithm
177
00:09:02,750 --> 00:09:04,385
with different properties.
178
00:09:04,385 --> 00:09:06,500
I would advise you
to just stick to
179
00:09:06,500 --> 00:09:08,990
the correct
simultaneous update and
180
00:09:08,990 --> 00:09:12,560
not use this incorrect
version on the right.
181
00:09:12,560 --> 00:09:14,780
That's gradient descent.
182
00:09:14,780 --> 00:09:16,070
In the next video,
183
00:09:16,070 --> 00:09:17,450
we'll go into details of
184
00:09:17,450 --> 00:09:20,240
the derivative term which
you saw in this video,
185
00:09:20,240 --> 00:09:22,895
but that we didn't really
talk about in detail.
186
00:09:22,895 --> 00:09:25,475
Derivatives are
part of calculus,
187
00:09:25,475 --> 00:09:27,125
and again, if you're
not familiar with
188
00:09:27,125 --> 00:09:29,115
calculus, don't worry about it.
189
00:09:29,115 --> 00:09:31,720
You don't need to know
calculus at all in order to
190
00:09:31,720 --> 00:09:34,540
complete this course or
this specialization,
191
00:09:34,540 --> 00:09:36,835
and you have all the
information you need
192
00:09:36,835 --> 00:09:39,550
in order to implement
gradient descent.
193
00:09:39,550 --> 00:09:41,440
Coming up in the next video,
194
00:09:41,440 --> 00:09:43,810
we'll go over
derivatives together,
195
00:09:43,810 --> 00:09:45,280
and you come away with
196
00:09:45,280 --> 00:09:47,740
the intuition and
knowledge you need to
197
00:09:47,740 --> 00:09:51,975
be able to implement and apply
gradient descent yourself.
198
00:09:51,975 --> 00:09:53,945
I think that'll be
an exciting thing
199
00:09:53,945 --> 00:09:55,580
for you to know
how to implement.
200
00:09:55,580 --> 00:09:59,760
Let's go on to the next
video to see how to do that.14581
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.