Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,130 --> 00:00:06,810
Hello and welcome back so of course on deep learning today we talk about the Kostic gradient descent.
2
00:00:07,220 --> 00:00:14,450
Previously we learned about gradient descent and we found out that it is a very efficient method to
3
00:00:14,450 --> 00:00:19,590
solve our optimization problem where we're trying to minimize the cost function.
4
00:00:19,640 --> 00:00:29,030
It basically takes us from 10 to the power of 57 years to solving a problem within minutes or hours
5
00:00:29,480 --> 00:00:30,940
or within a day or so.
6
00:00:31,100 --> 00:00:37,490
And it really helps speed things up because we can see which way is downhill and we can just go in that
7
00:00:37,490 --> 00:00:41,400
direction and take steps and get to the minimum faster.
8
00:00:41,600 --> 00:00:50,030
But the thing with the stick with gradient descent is that this method requires for the cost function
9
00:00:50,030 --> 00:00:50,990
to be convex.
10
00:00:51,140 --> 00:00:57,710
And as you can see here we've specifically chosen a convex cost function basically convex means that
11
00:00:58,160 --> 00:01:05,510
the function looks similar to what we are seeing now that it's just kind of vext into one direction
12
00:01:05,510 --> 00:01:09,220
and that in essence has one global minimum.
13
00:01:09,380 --> 00:01:11,560
And that's the one that we're going to find.
14
00:01:11,630 --> 00:01:14,060
But what if our function is not convex.
15
00:01:14,060 --> 00:01:16,250
What if our cost function is not correct.
16
00:01:16,370 --> 00:01:17,810
What if it looks something like this.
17
00:01:18,020 --> 00:01:19,660
Well first of all how could that happen.
18
00:01:19,880 --> 00:01:27,950
Well that could happen because if we first of all choose a cost function which is not the square difference
19
00:01:28,010 --> 00:01:33,850
between why how and why or if we do choose the cost function which is like that.
20
00:01:33,860 --> 00:01:39,650
But then in a multi dimensional space it can actually turn into something that is not convex.
21
00:01:39,780 --> 00:01:45,410
And so what would happen in this case if we just tried to apply our normal gradient decent method something
22
00:01:45,410 --> 00:01:46,390
like this could happen.
23
00:01:46,520 --> 00:01:51,230
We could find a local minimum of the cost function rather than the global one.
24
00:01:51,230 --> 00:01:57,730
So this one was the best one and we found the wrong one and therefore we don't have the correct weight.
25
00:01:57,740 --> 00:01:59,940
We don't have an optimized neural network.
26
00:02:00,230 --> 00:02:02,480
We have a subpar neural network.
27
00:02:02,610 --> 00:02:04,470
And so what do we do in this case.
28
00:02:04,670 --> 00:02:09,110
Well the answer here is stochastic.
29
00:02:09,110 --> 00:02:10,050
Gradient descent.
30
00:02:10,070 --> 00:02:15,260
And it turns out the sarcastic gradient descent doesn't require for the cause function to be convex.
31
00:02:15,380 --> 00:02:20,120
So let's have a look at the two differences between the normal gradient descent that we talked about
32
00:02:20,150 --> 00:02:21,600
and the stochastic range.
33
00:02:21,860 --> 00:02:27,920
So normal green descent is when we take all of our rows we plug them into our neural network and once
34
00:02:27,920 --> 00:02:33,890
again here we've got the neural network copied over several times but the rows are being plugged into
35
00:02:33,890 --> 00:02:36,050
that same neural network every time.
36
00:02:36,050 --> 00:02:39,200
So there's only one year old trick this is just for Kissel's action purposes.
37
00:02:39,350 --> 00:02:43,880
And then once we plug them in we've calculated our cost function based on the formula right and looking
38
00:02:43,880 --> 00:02:49,400
at the chart on the at the bottom and then we adjust the weights then this is called the gradient descent
39
00:02:49,400 --> 00:02:54,480
method or it's also the proper term is that batch gradient descent method.
40
00:02:54,470 --> 00:03:01,940
So we take the whole batch of from our sample we apply it and then we run that the stochastic gradient
41
00:03:01,940 --> 00:03:03,730
descent method is a bit different.
42
00:03:03,800 --> 00:03:10,880
Here we take the rows one by one so we take this row we run our neural network and then we adjust the
43
00:03:10,880 --> 00:03:12,020
weights.
44
00:03:12,020 --> 00:03:16,420
Then we move onto the second row we take the second row we run our neural network.
45
00:03:16,580 --> 00:03:21,640
We look at the cost function and then we adjust the weights again and then we take another Rohtak rose
46
00:03:21,640 --> 00:03:25,430
three we run our neural network will look at the cost function we adjust the weight.
47
00:03:25,430 --> 00:03:32,660
So basically we're looking at we're adjusting the weights after every single row rather than doing everything
48
00:03:32,660 --> 00:03:36,080
together and then testing weights two different approaches.
49
00:03:36,230 --> 00:03:39,710
And now we're going to just compare the two side by side.
50
00:03:39,710 --> 00:03:42,920
So here they are this is how to visually remember them.
51
00:03:42,920 --> 00:03:49,490
So you've got the best gradient descent where you are adjusting the weights after you've run them after
52
00:03:49,490 --> 00:03:55,370
you've run all of the rows in your neural network and then basically just the weights and you run the
53
00:03:55,370 --> 00:04:00,500
whole thing again iteration iteration iteration in the sixth grade in December and you run one row at
54
00:04:00,500 --> 00:04:06,650
a time and you adjust the weights just the way it's just the weights and then you do everything again
55
00:04:06,770 --> 00:04:10,040
and again and that is called discussing.
56
00:04:10,080 --> 00:04:16,580
And you said that the main two differences are that the sarcastic gradient descent method helps you
57
00:04:16,580 --> 00:04:27,470
avoid the problem where you find those local extremities or local minimums rather than the overall overall
58
00:04:27,470 --> 00:04:28,620
global minimum.
59
00:04:29,030 --> 00:04:34,850
And the reason for that in simple terms is that there is video of the stochastic gradient descent method
60
00:04:35,150 --> 00:04:38,220
has much higher fluctuations because it can afford them.
61
00:04:38,210 --> 00:04:43,650
It's doing one iteration or one row at a time and therefore the fluctuations are much higher and it
62
00:04:43,650 --> 00:04:49,440
is much more likely to find the global minimum rather than just the local minimum.
63
00:04:49,460 --> 00:04:56,480
And the other thing about the sarcastic gradient descent I think is a bad gradient is the it's foster
64
00:04:56,480 --> 00:05:01,670
like the first impression that you might have is because it's doing grow one at a time it is slower
65
00:05:01,730 --> 00:05:09,050
but actually in fact it is faster because it is it doesn't have to load up all the data into memory
66
00:05:09,080 --> 00:05:12,610
and run and wait until all of those rules are on altogether.
67
00:05:12,710 --> 00:05:16,780
You can just roll around them one by one so it's a much lighter algorithm is much faster in that sense
68
00:05:16,790 --> 00:05:24,020
so though it has way more in that sense as it has more advantages over the bad.
69
00:05:24,110 --> 00:05:25,320
Gradient descent method.
70
00:05:25,430 --> 00:05:31,310
The main advantage of or domain kind of like profer the bad gradient descent method is that it is a
71
00:05:31,310 --> 00:05:37,250
deterministic algorithm or other than to cast a gradient descent being a sarcastic algorithm meaning
72
00:05:37,250 --> 00:05:44,570
it's random and with the best gradient and method as long as you have the same starting weights for
73
00:05:44,570 --> 00:05:45,430
your neural network.
74
00:05:45,500 --> 00:05:52,300
Every time you run the batch gradient descent method you will get the same iterations the same results
75
00:05:52,300 --> 00:05:57,960
for you all the way your weights are being updated for us to have for the sarcastic gradient decent
76
00:05:57,980 --> 00:05:58,300
method.
77
00:05:58,310 --> 00:06:04,550
You won't get that because it is a stochastic method you're picking your roles possibly at random and
78
00:06:04,570 --> 00:06:10,940
you are updating your neural network in a sarcastic manner and therefore you're just going to every
79
00:06:10,940 --> 00:06:15,380
single time you run the category a decent method even if you have the same weights at the start you're
80
00:06:15,380 --> 00:06:20,770
going to have a different process and different iterations to get there.
81
00:06:20,780 --> 00:06:28,100
So that's in a nutshell what's to castigate and dissent is also there's a method in-between the two
82
00:06:28,100 --> 00:06:34,520
called the Mini batch gradient descent method where you combine the two and you basically run rather
83
00:06:34,520 --> 00:06:37,640
than running a whole batch of running one at a time.
84
00:06:37,640 --> 00:06:44,150
You run batches of rows maybe 5 10 100 however many rows you decide to set you run those that number
85
00:06:44,150 --> 00:06:47,690
of rows at a time then you update your way single digits and so on.
86
00:06:47,900 --> 00:06:52,670
And that's called the Mini Bache gradient descent method if you'd like to learn more about gradient
87
00:06:52,670 --> 00:06:56,630
descent there's a great article which you can have a look at.
88
00:06:56,660 --> 00:07:04,940
It's called a neural network in 13 lines of Python part to great and descend by Andrew Trask and the
89
00:07:04,940 --> 00:07:12,840
links below it's an good 12 15 article very well-written very very simple terms.
90
00:07:12,920 --> 00:07:21,860
It's got some interesting philosophical or just interesting thoughts on how to apply green decent water
91
00:07:22,340 --> 00:07:28,460
you know advantages and disadvantages and how to be how to do things in certain situations so you got
92
00:07:28,460 --> 00:07:30,730
some very cool tips tricks and hacks.
93
00:07:31,370 --> 00:07:33,620
Very easy read so definitely check that out.
94
00:07:33,800 --> 00:07:37,010
And another one a bit more heavier read.
95
00:07:37,010 --> 00:07:41,930
For those of you who are into mathematics who want to get to the bottom of the mathematics why.
96
00:07:41,930 --> 00:07:45,180
Gradient descent is that specific.
97
00:07:45,260 --> 00:07:49,200
What are the formulas that are driving gradings And how is it calculate and so on.
98
00:07:49,220 --> 00:07:51,610
Check out the article or actually the book.
99
00:07:51,620 --> 00:07:57,160
It's a free online book called neural networks and deep learning by Michael Nielsen 2015 book.
100
00:07:57,160 --> 00:08:02,190
It's just basically it's all on line you can go ahead and check it out there.
101
00:08:02,450 --> 00:08:05,870
And there again very soft introduction to the mathematics.
102
00:08:05,870 --> 00:08:12,260
But then for a mother the math but the mathematics are pretty heavy as you go along as you read through
103
00:08:12,530 --> 00:08:13,340
the article.
104
00:08:13,610 --> 00:08:20,240
But at the same time it gets you into into that mood I think you mean has like a warm up chapter where
105
00:08:20,240 --> 00:08:25,370
you first warm up the math and then you jump into I'm so interested in math then this is the article
106
00:08:25,370 --> 00:08:26,110
to go to.
107
00:08:26,540 --> 00:08:32,780
And there we go so that's in a nutshell the difference between Graney sense to cast the gradient descent
108
00:08:32,810 --> 00:08:36,360
and how to work.
109
00:08:36,410 --> 00:08:39,830
And on that note we're going to wrap up today said Tauriel.
110
00:08:39,840 --> 00:08:42,000
I look forward to seeing you on the next one.
111
00:08:42,020 --> 00:08:44,090
And until then enjoy deep learning.
12880
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.