subtitlecat.com

All language subtitles for 03_checking-gradient-descent-for-convergence.en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,680 --> 00:00:03,165 When running gradient descent, 2 00:00:03,165 --> 00:00:05,490 how can you tell if it is converging? 3 00:00:05,490 --> 00:00:07,740 That is, whether it's helping you to find 4 00:00:07,740 --> 00:00:10,020 parameters close to the global minimum 5 00:00:10,020 --> 00:00:11,565 of the cost function. 6 00:00:11,565 --> 00:00:13,560 By learning to recognize what 7 00:00:13,560 --> 00:00:15,030 a well-running implementation of 8 00:00:15,030 --> 00:00:16,320 gradient descent looks like, 9 00:00:16,320 --> 00:00:18,630 we will also, in a later video, 10 00:00:18,630 --> 00:00:22,170 be better able to choose a good learning rate Alpha. 11 00:00:22,170 --> 00:00:24,735 Let's take a look. As a reminder, 12 00:00:24,735 --> 00:00:26,565 here's the gradient descent rule. 13 00:00:26,565 --> 00:00:29,040 One of the key choices is 14 00:00:29,040 --> 00:00:32,110 the choice of the learning rate Alpha. 15 00:00:32,110 --> 00:00:34,760 Here's something that I often do to make 16 00:00:34,760 --> 00:00:37,580 sure that gradient descent is working well. 17 00:00:37,580 --> 00:00:39,110 Recall that the job of 18 00:00:39,110 --> 00:00:41,690 gradient descent is to find parameters w 19 00:00:41,690 --> 00:00:46,075 and b that hopefully minimize the cost function J. 20 00:00:46,075 --> 00:00:49,865 What I'll often do is plot the cost function J, 21 00:00:49,865 --> 00:00:52,895 which is calculated on the training set, 22 00:00:52,895 --> 00:00:55,460 and I plot the value of J at 23 00:00:55,460 --> 00:00:58,550 each iteration of gradient descent. 24 00:00:58,550 --> 00:01:01,970 Remember that each iteration means after 25 00:01:01,970 --> 00:01:07,270 each simultaneous update of the parameters w and b. 26 00:01:07,270 --> 00:01:11,210 In this plot, the horizontal axis is 27 00:01:11,210 --> 00:01:13,430 the number of iterations of 28 00:01:13,430 --> 00:01:16,720 gradient descent that you've run so far. 29 00:01:16,720 --> 00:01:20,055 You may get a curve that looks like this. 30 00:01:20,055 --> 00:01:22,685 Notice that the horizontal axis 31 00:01:22,685 --> 00:01:24,650 is the number of iterations of 32 00:01:24,650 --> 00:01:29,830 gradient descent and not a parameter like w or b. 33 00:01:29,830 --> 00:01:32,360 This differs from previous graphs you've 34 00:01:32,360 --> 00:01:35,570 seen where the vertical axis was cost 35 00:01:35,570 --> 00:01:38,150 J and the horizontal axis was 36 00:01:38,150 --> 00:01:42,125 a single parameter like w or b. 37 00:01:42,125 --> 00:01:46,130 This curve is also called a learning curve. 38 00:01:46,130 --> 00:01:47,930 Note that there are 39 00:01:47,930 --> 00:01:49,730 a few different types of learning 40 00:01:49,730 --> 00:01:51,635 curves used in machine learning, 41 00:01:51,635 --> 00:01:53,315 and you see some of the types 42 00:01:53,315 --> 00:01:55,510 later in this course as well. 43 00:01:55,510 --> 00:01:59,900 Concretely, if you look here at this point on the curve, 44 00:01:59,900 --> 00:02:02,195 this means that after you've run 45 00:02:02,195 --> 00:02:04,880 gradient descent for 100 iterations, 46 00:02:04,880 --> 00:02:08,515 meaning 100 simultaneous updates of the parameters, 47 00:02:08,515 --> 00:02:12,495 you have some learned values for w and b. 48 00:02:12,495 --> 00:02:16,020 If you compute the cost J, w, 49 00:02:16,020 --> 00:02:19,265 b for those values of w and b, 50 00:02:19,265 --> 00:02:21,995 the ones you got after 100 iterations, 51 00:02:21,995 --> 00:02:25,160 you get this value for the cost J. 52 00:02:25,160 --> 00:02:29,665 That is this point on the vertical axis. 53 00:02:29,665 --> 00:02:34,550 This point here corresponds to the value of J for 54 00:02:34,550 --> 00:02:36,350 the parameters that you got after 55 00:02:36,350 --> 00:02:39,695 200 iterations of gradient descent. 56 00:02:39,695 --> 00:02:42,965 Looking at this graph helps you to see 57 00:02:42,965 --> 00:02:44,900 how your cost J changes 58 00:02:44,900 --> 00:02:47,615 after each iteration of gradient descent. 59 00:02:47,615 --> 00:02:50,150 If gradient descent is working properly, 60 00:02:50,150 --> 00:02:51,620 then the cost J should 61 00:02:51,620 --> 00:02:54,604 decrease after every single iteration. 62 00:02:54,604 --> 00:02:58,880 If J ever increases after one iteration, 63 00:02:58,880 --> 00:03:02,315 that means either Alpha is chosen poorly, 64 00:03:02,315 --> 00:03:05,225 and it usually means Alpha is too large, 65 00:03:05,225 --> 00:03:07,540 or there could be a bug in the code. 66 00:03:07,540 --> 00:03:10,280 Another useful thing that this part can tell 67 00:03:10,280 --> 00:03:12,965 you is that if you look at this curve, 68 00:03:12,965 --> 00:03:16,805 by the time you reach maybe 300 iterations also, 69 00:03:16,805 --> 00:03:19,145 the cost J is leveling 70 00:03:19,145 --> 00:03:22,690 off and is no longer decreasing much. 71 00:03:22,690 --> 00:03:24,950 By 400 iterations, 72 00:03:24,950 --> 00:03:27,740 it looks like the curve has flattened out. 73 00:03:27,740 --> 00:03:31,700 This means that gradient descent has more or less 74 00:03:31,700 --> 00:03:36,550 converged because the curve is no longer decreasing. 75 00:03:36,550 --> 00:03:38,840 Looking at this learning curve, 76 00:03:38,840 --> 00:03:41,180 you can try to spot whether or not 77 00:03:41,180 --> 00:03:44,005 gradient descent is converging. 78 00:03:44,005 --> 00:03:46,310 By the way, the number 79 00:03:46,310 --> 00:03:48,125 of iterations that gradient descent 80 00:03:48,125 --> 00:03:49,970 takes a conversion can vary 81 00:03:49,970 --> 00:03:52,175 a lot between different applications. 82 00:03:52,175 --> 00:03:53,840 In one application, it may 83 00:03:53,840 --> 00:03:56,584 converge after just 30 iterations. 84 00:03:56,584 --> 00:03:58,220 For a different application, 85 00:03:58,220 --> 00:04:02,180 it could take 1,000 or 100,000 iterations. 86 00:04:02,180 --> 00:04:06,050 It turns out to be very difficult to tell in 87 00:04:06,050 --> 00:04:08,405 advance how many iterations 88 00:04:08,405 --> 00:04:10,555 gradient descent needs to converge, 89 00:04:10,555 --> 00:04:12,710 which is why you can create 90 00:04:12,710 --> 00:04:15,185 a graph like this, a learning curve. 91 00:04:15,185 --> 00:04:17,360 Try to find out when you can start 92 00:04:17,360 --> 00:04:20,120 training your particular model. 93 00:04:20,120 --> 00:04:23,900 Another way to decide when your model is done training 94 00:04:23,900 --> 00:04:28,250 is with an automatic convergence test. 95 00:04:29,980 --> 00:04:33,805 Here is the Greek alphabet epsilon. 96 00:04:33,805 --> 00:04:35,710 Let's let epsilon be a 97 00:04:35,710 --> 00:04:37,930 variable representing a small number, 98 00:04:37,930 --> 00:04:43,220 such as 0.001 or 10^-3. 99 00:04:43,220 --> 00:04:45,970 If the cost J decreases by less 100 00:04:45,970 --> 00:04:48,550 than this number epsilon on one iteration, 101 00:04:48,550 --> 00:04:51,640 then you're likely on this flattened part of 102 00:04:51,640 --> 00:04:53,260 the curve that you see on 103 00:04:53,260 --> 00:04:56,105 the left and you can declare convergence. 104 00:04:56,105 --> 00:04:58,135 Remember, convergence, 105 00:04:58,135 --> 00:05:00,550 hopefully in the case that you found parameters 106 00:05:00,550 --> 00:05:04,045 w and b that are close to the minimum possible value of 107 00:05:04,045 --> 00:05:06,850 J. I usually find 108 00:05:06,850 --> 00:05:08,290 that choosing the right threshold 109 00:05:08,290 --> 00:05:10,195 epsilon is pretty difficult. 110 00:05:10,195 --> 00:05:12,020 I actually tend to look at graphs 111 00:05:12,020 --> 00:05:13,340 like this one on the left, 112 00:05:13,340 --> 00:05:16,720 rather than rely on automatic convergence tests. 113 00:05:16,720 --> 00:05:19,860 Looking at the solid figure can tell you, 114 00:05:19,860 --> 00:05:22,700 I'll give you at some advanced warning if 115 00:05:22,700 --> 00:05:26,635 maybe gradient descent is not working correctly as well. 116 00:05:26,635 --> 00:05:29,630 You've now seen what the learning curve 117 00:05:29,630 --> 00:05:32,450 should look like when gradient descent is running well. 118 00:05:32,450 --> 00:05:35,300 Let's take these insights and in the next video, 119 00:05:35,300 --> 00:05:36,500 take a look at how to 120 00:05:36,500 --> 00:05:39,810 choose an appropriate learning rate.8522