subtitlecat.com

All language subtitles for 067 Stochastic Gradient Descent-en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese Download

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:01,130 --> 00:00:06,810 Hello and welcome back so of course on deep learning today we talk about the Kostic gradient descent. 2 00:00:07,220 --> 00:00:14,450 Previously we learned about gradient descent and we found out that it is a very efficient method to 3 00:00:14,450 --> 00:00:19,590 solve our optimization problem where we're trying to minimize the cost function. 4 00:00:19,640 --> 00:00:29,030 It basically takes us from 10 to the power of 57 years to solving a problem within minutes or hours 5 00:00:29,480 --> 00:00:30,940 or within a day or so. 6 00:00:31,100 --> 00:00:37,490 And it really helps speed things up because we can see which way is downhill and we can just go in that 7 00:00:37,490 --> 00:00:41,400 direction and take steps and get to the minimum faster. 8 00:00:41,600 --> 00:00:50,030 But the thing with the stick with gradient descent is that this method requires for the cost function 9 00:00:50,030 --> 00:00:50,990 to be convex. 10 00:00:51,140 --> 00:00:57,710 And as you can see here we've specifically chosen a convex cost function basically convex means that 11 00:00:58,160 --> 00:01:05,510 the function looks similar to what we are seeing now that it's just kind of vext into one direction 12 00:01:05,510 --> 00:01:09,220 and that in essence has one global minimum. 13 00:01:09,380 --> 00:01:11,560 And that's the one that we're going to find. 14 00:01:11,630 --> 00:01:14,060 But what if our function is not convex. 15 00:01:14,060 --> 00:01:16,250 What if our cost function is not correct. 16 00:01:16,370 --> 00:01:17,810 What if it looks something like this. 17 00:01:18,020 --> 00:01:19,660 Well first of all how could that happen. 18 00:01:19,880 --> 00:01:27,950 Well that could happen because if we first of all choose a cost function which is not the square difference 19 00:01:28,010 --> 00:01:33,850 between why how and why or if we do choose the cost function which is like that. 20 00:01:33,860 --> 00:01:39,650 But then in a multi dimensional space it can actually turn into something that is not convex. 21 00:01:39,780 --> 00:01:45,410 And so what would happen in this case if we just tried to apply our normal gradient decent method something 22 00:01:45,410 --> 00:01:46,390 like this could happen. 23 00:01:46,520 --> 00:01:51,230 We could find a local minimum of the cost function rather than the global one. 24 00:01:51,230 --> 00:01:57,730 So this one was the best one and we found the wrong one and therefore we don't have the correct weight. 25 00:01:57,740 --> 00:01:59,940 We don't have an optimized neural network. 26 00:02:00,230 --> 00:02:02,480 We have a subpar neural network. 27 00:02:02,610 --> 00:02:04,470 And so what do we do in this case. 28 00:02:04,670 --> 00:02:09,110 Well the answer here is stochastic. 29 00:02:09,110 --> 00:02:10,050 Gradient descent. 30 00:02:10,070 --> 00:02:15,260 And it turns out the sarcastic gradient descent doesn't require for the cause function to be convex. 31 00:02:15,380 --> 00:02:20,120 So let's have a look at the two differences between the normal gradient descent that we talked about 32 00:02:20,150 --> 00:02:21,600 and the stochastic range. 33 00:02:21,860 --> 00:02:27,920 So normal green descent is when we take all of our rows we plug them into our neural network and once 34 00:02:27,920 --> 00:02:33,890 again here we've got the neural network copied over several times but the rows are being plugged into 35 00:02:33,890 --> 00:02:36,050 that same neural network every time. 36 00:02:36,050 --> 00:02:39,200 So there's only one year old trick this is just for Kissel's action purposes. 37 00:02:39,350 --> 00:02:43,880 And then once we plug them in we've calculated our cost function based on the formula right and looking 38 00:02:43,880 --> 00:02:49,400 at the chart on the at the bottom and then we adjust the weights then this is called the gradient descent 39 00:02:49,400 --> 00:02:54,480 method or it's also the proper term is that batch gradient descent method. 40 00:02:54,470 --> 00:03:01,940 So we take the whole batch of from our sample we apply it and then we run that the stochastic gradient 41 00:03:01,940 --> 00:03:03,730 descent method is a bit different. 42 00:03:03,800 --> 00:03:10,880 Here we take the rows one by one so we take this row we run our neural network and then we adjust the 43 00:03:10,880 --> 00:03:12,020 weights. 44 00:03:12,020 --> 00:03:16,420 Then we move onto the second row we take the second row we run our neural network. 45 00:03:16,580 --> 00:03:21,640 We look at the cost function and then we adjust the weights again and then we take another Rohtak rose 46 00:03:21,640 --> 00:03:25,430 three we run our neural network will look at the cost function we adjust the weight. 47 00:03:25,430 --> 00:03:32,660 So basically we're looking at we're adjusting the weights after every single row rather than doing everything 48 00:03:32,660 --> 00:03:36,080 together and then testing weights two different approaches. 49 00:03:36,230 --> 00:03:39,710 And now we're going to just compare the two side by side. 50 00:03:39,710 --> 00:03:42,920 So here they are this is how to visually remember them. 51 00:03:42,920 --> 00:03:49,490 So you've got the best gradient descent where you are adjusting the weights after you've run them after 52 00:03:49,490 --> 00:03:55,370 you've run all of the rows in your neural network and then basically just the weights and you run the 53 00:03:55,370 --> 00:04:00,500 whole thing again iteration iteration iteration in the sixth grade in December and you run one row at 54 00:04:00,500 --> 00:04:06,650 a time and you adjust the weights just the way it's just the weights and then you do everything again 55 00:04:06,770 --> 00:04:10,040 and again and that is called discussing. 56 00:04:10,080 --> 00:04:16,580 And you said that the main two differences are that the sarcastic gradient descent method helps you 57 00:04:16,580 --> 00:04:27,470 avoid the problem where you find those local extremities or local minimums rather than the overall overall 58 00:04:27,470 --> 00:04:28,620 global minimum. 59 00:04:29,030 --> 00:04:34,850 And the reason for that in simple terms is that there is video of the stochastic gradient descent method 60 00:04:35,150 --> 00:04:38,220 has much higher fluctuations because it can afford them. 61 00:04:38,210 --> 00:04:43,650 It's doing one iteration or one row at a time and therefore the fluctuations are much higher and it 62 00:04:43,650 --> 00:04:49,440 is much more likely to find the global minimum rather than just the local minimum. 63 00:04:49,460 --> 00:04:56,480 And the other thing about the sarcastic gradient descent I think is a bad gradient is the it's foster 64 00:04:56,480 --> 00:05:01,670 like the first impression that you might have is because it's doing grow one at a time it is slower 65 00:05:01,730 --> 00:05:09,050 but actually in fact it is faster because it is it doesn't have to load up all the data into memory 66 00:05:09,080 --> 00:05:12,610 and run and wait until all of those rules are on altogether. 67 00:05:12,710 --> 00:05:16,780 You can just roll around them one by one so it's a much lighter algorithm is much faster in that sense 68 00:05:16,790 --> 00:05:24,020 so though it has way more in that sense as it has more advantages over the bad. 69 00:05:24,110 --> 00:05:25,320 Gradient descent method. 70 00:05:25,430 --> 00:05:31,310 The main advantage of or domain kind of like profer the bad gradient descent method is that it is a 71 00:05:31,310 --> 00:05:37,250 deterministic algorithm or other than to cast a gradient descent being a sarcastic algorithm meaning 72 00:05:37,250 --> 00:05:44,570 it's random and with the best gradient and method as long as you have the same starting weights for 73 00:05:44,570 --> 00:05:45,430 your neural network. 74 00:05:45,500 --> 00:05:52,300 Every time you run the batch gradient descent method you will get the same iterations the same results 75 00:05:52,300 --> 00:05:57,960 for you all the way your weights are being updated for us to have for the sarcastic gradient decent 76 00:05:57,980 --> 00:05:58,300 method. 77 00:05:58,310 --> 00:06:04,550 You won't get that because it is a stochastic method you're picking your roles possibly at random and 78 00:06:04,570 --> 00:06:10,940 you are updating your neural network in a sarcastic manner and therefore you're just going to every 79 00:06:10,940 --> 00:06:15,380 single time you run the category a decent method even if you have the same weights at the start you're 80 00:06:15,380 --> 00:06:20,770 going to have a different process and different iterations to get there. 81 00:06:20,780 --> 00:06:28,100 So that's in a nutshell what's to castigate and dissent is also there's a method in-between the two 82 00:06:28,100 --> 00:06:34,520 called the Mini batch gradient descent method where you combine the two and you basically run rather 83 00:06:34,520 --> 00:06:37,640 than running a whole batch of running one at a time. 84 00:06:37,640 --> 00:06:44,150 You run batches of rows maybe 5 10 100 however many rows you decide to set you run those that number 85 00:06:44,150 --> 00:06:47,690 of rows at a time then you update your way single digits and so on. 86 00:06:47,900 --> 00:06:52,670 And that's called the Mini Bache gradient descent method if you'd like to learn more about gradient 87 00:06:52,670 --> 00:06:56,630 descent there's a great article which you can have a look at. 88 00:06:56,660 --> 00:07:04,940 It's called a neural network in 13 lines of Python part to great and descend by Andrew Trask and the 89 00:07:04,940 --> 00:07:12,840 links below it's an good 12 15 article very well-written very very simple terms. 90 00:07:12,920 --> 00:07:21,860 It's got some interesting philosophical or just interesting thoughts on how to apply green decent water 91 00:07:22,340 --> 00:07:28,460 you know advantages and disadvantages and how to be how to do things in certain situations so you got 92 00:07:28,460 --> 00:07:30,730 some very cool tips tricks and hacks. 93 00:07:31,370 --> 00:07:33,620 Very easy read so definitely check that out. 94 00:07:33,800 --> 00:07:37,010 And another one a bit more heavier read. 95 00:07:37,010 --> 00:07:41,930 For those of you who are into mathematics who want to get to the bottom of the mathematics why. 96 00:07:41,930 --> 00:07:45,180 Gradient descent is that specific. 97 00:07:45,260 --> 00:07:49,200 What are the formulas that are driving gradings And how is it calculate and so on. 98 00:07:49,220 --> 00:07:51,610 Check out the article or actually the book. 99 00:07:51,620 --> 00:07:57,160 It's a free online book called neural networks and deep learning by Michael Nielsen 2015 book. 100 00:07:57,160 --> 00:08:02,190 It's just basically it's all on line you can go ahead and check it out there. 101 00:08:02,450 --> 00:08:05,870 And there again very soft introduction to the mathematics. 102 00:08:05,870 --> 00:08:12,260 But then for a mother the math but the mathematics are pretty heavy as you go along as you read through 103 00:08:12,530 --> 00:08:13,340 the article. 104 00:08:13,610 --> 00:08:20,240 But at the same time it gets you into into that mood I think you mean has like a warm up chapter where 105 00:08:20,240 --> 00:08:25,370 you first warm up the math and then you jump into I'm so interested in math then this is the article 106 00:08:25,370 --> 00:08:26,110 to go to. 107 00:08:26,540 --> 00:08:32,780 And there we go so that's in a nutshell the difference between Graney sense to cast the gradient descent 108 00:08:32,810 --> 00:08:36,360 and how to work. 109 00:08:36,410 --> 00:08:39,830 And on that note we're going to wrap up today said Tauriel. 110 00:08:39,840 --> 00:08:42,000 I look forward to seeing you on the next one. 111 00:08:42,020 --> 00:08:44,090 And until then enjoy deep learning. 12880