subtitlecat.com

All language subtitles for 02_addressing-overfitting.en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:01,400 --> 00:00:04,080 Later in this specialization, 2 00:00:04,080 --> 00:00:05,880 we'll talk about debugging and 3 00:00:05,880 --> 00:00:07,350 diagnosing things that can go 4 00:00:07,350 --> 00:00:09,090 wrong with learning algorithms. 5 00:00:09,090 --> 00:00:11,730 You'll also learn about specific tools to 6 00:00:11,730 --> 00:00:13,860 recognize when overfitting and 7 00:00:13,860 --> 00:00:16,005 underfitting may be occurring. 8 00:00:16,005 --> 00:00:19,320 But for now, when you think overfitting has occurred, 9 00:00:19,320 --> 00:00:21,870 lets talk about what you can do to address it. 10 00:00:21,870 --> 00:00:24,035 Let's say you fit a model 11 00:00:24,035 --> 00:00:27,110 and it has high variance, is overfit. 12 00:00:27,110 --> 00:00:31,375 Here's our overfit house price prediction model. 13 00:00:31,375 --> 00:00:34,880 One way to address this problem is to 14 00:00:34,880 --> 00:00:38,680 collect more training data, that's one option. 15 00:00:38,680 --> 00:00:40,835 If you're able to get more data, 16 00:00:40,835 --> 00:00:42,875 that is more training examples 17 00:00:42,875 --> 00:00:45,595 on sizes and prices of houses, 18 00:00:45,595 --> 00:00:48,485 then with the larger training set, 19 00:00:48,485 --> 00:00:50,705 the learning algorithm will learn to 20 00:00:50,705 --> 00:00:53,770 fit a function that is less wiggly. 21 00:00:53,770 --> 00:00:55,580 You can continue to fit 22 00:00:55,580 --> 00:00:57,200 a high order polynomial 23 00:00:57,200 --> 00:00:59,675 or some of the function with a lot of features, 24 00:00:59,675 --> 00:01:02,135 and if you have enough training examples, 25 00:01:02,135 --> 00:01:04,135 it will still do okay. 26 00:01:04,135 --> 00:01:08,050 To summarize, the number one tool you can 27 00:01:08,050 --> 00:01:11,700 use against overfitting is to get more training data. 28 00:01:11,700 --> 00:01:14,970 Now, getting more data isn't always an option. 29 00:01:14,970 --> 00:01:16,660 Maybe only so many houses have 30 00:01:16,660 --> 00:01:18,460 been sold in this location, 31 00:01:18,460 --> 00:01:21,280 so maybe there just isn't more data to be add. 32 00:01:21,280 --> 00:01:22,945 But when the data is available, 33 00:01:22,945 --> 00:01:24,440 this can work really well. 34 00:01:24,440 --> 00:01:26,800 A second option for addressing 35 00:01:26,800 --> 00:01:30,785 overfitting is to see if you can use fewer features. 36 00:01:30,785 --> 00:01:33,145 In the previous video, 37 00:01:33,145 --> 00:01:36,490 our models features included the size x, 38 00:01:36,490 --> 00:01:39,580 as well as the size squared, and this x squared, 39 00:01:39,580 --> 00:01:43,895 and x cubed and x^4 and so on. 40 00:01:43,895 --> 00:01:47,560 These were a lot of polynomial features. 41 00:01:47,560 --> 00:01:51,190 In that case, one way to reduce overfitting is to 42 00:01:51,190 --> 00:01:55,000 just not use so many of these polynomial features. 43 00:01:55,000 --> 00:01:57,655 But now let's look at a different example. 44 00:01:57,655 --> 00:02:00,280 Maybe you have a lot of different features of 45 00:02:00,280 --> 00:02:02,845 a house of which to try to predict its price, 46 00:02:02,845 --> 00:02:05,170 ranging from the size, number of bedrooms, 47 00:02:05,170 --> 00:02:06,895 number of floors, the age, 48 00:02:06,895 --> 00:02:08,740 average income of the neighborhood, 49 00:02:08,740 --> 00:02:10,090 and so on and so forth, 50 00:02:10,090 --> 00:02:12,910 total distance to the nearest coffee shop. 51 00:02:12,910 --> 00:02:16,150 It turns out that if you have a lot of features like 52 00:02:16,150 --> 00:02:19,420 these but don't have enough training data, 53 00:02:19,420 --> 00:02:21,100 then your learning algorithm may 54 00:02:21,100 --> 00:02:23,695 also overfit to your training set. 55 00:02:23,695 --> 00:02:26,875 Now instead of using all 100 features, 56 00:02:26,875 --> 00:02:30,160 if we were to pick just a subset of the most useful ones, 57 00:02:30,160 --> 00:02:33,740 maybe size, bedrooms, 58 00:02:33,740 --> 00:02:35,900 and the age of the house. 59 00:02:35,900 --> 00:02:38,545 If you think those are the most relevant features, 60 00:02:38,545 --> 00:02:41,605 then using just that smallest subset of features, 61 00:02:41,605 --> 00:02:45,815 you may find that your model no longer overfits as badly. 62 00:02:45,815 --> 00:02:48,860 Choosing the most appropriate set of features to 63 00:02:48,860 --> 00:02:52,240 use is sometimes also called feature selection. 64 00:02:52,240 --> 00:02:54,700 One way you could do so is to use 65 00:02:54,700 --> 00:02:56,500 your intuition to choose what you 66 00:02:56,500 --> 00:02:58,360 think is the best set of features, 67 00:02:58,360 --> 00:03:01,070 what's most relevant for predicting the price. 68 00:03:01,070 --> 00:03:04,570 Now, one disadvantage of feature selection 69 00:03:04,570 --> 00:03:08,095 is that by using only a subset of the features, 70 00:03:08,095 --> 00:03:10,420 the algorithm is throwing away some of 71 00:03:10,420 --> 00:03:12,875 the information that you have about the houses. 72 00:03:12,875 --> 00:03:15,420 For example, maybe all of these features, 73 00:03:15,420 --> 00:03:17,620 all 100 of them are actually 74 00:03:17,620 --> 00:03:20,125 useful for predicting the price of a house. 75 00:03:20,125 --> 00:03:22,390 Maybe you don't want to throw away some of 76 00:03:22,390 --> 00:03:25,820 the information by throwing away some of the features. 77 00:03:25,820 --> 00:03:27,495 Later in Course 2, 78 00:03:27,495 --> 00:03:30,610 you'll also see some algorithms for automatically 79 00:03:30,610 --> 00:03:32,620 choosing the most appropriate set of 80 00:03:32,620 --> 00:03:35,150 features to use for our prediction task. 81 00:03:35,150 --> 00:03:36,910 Now, this takes us to 82 00:03:36,910 --> 00:03:39,295 the third option for reducing overfitting. 83 00:03:39,295 --> 00:03:42,610 This technique, which we'll look at in even greater depth 84 00:03:42,610 --> 00:03:46,400 in the next video is called regularization. 85 00:03:46,400 --> 00:03:50,274 If you look at an overfit model, 86 00:03:50,274 --> 00:03:53,725 here's a model using polynomial features: x, 87 00:03:53,725 --> 00:03:55,570 x squared, x cubed, and so on. 88 00:03:55,570 --> 00:03:59,665 You find that the parameters are often relatively large. 89 00:03:59,665 --> 00:04:01,730 Now if you were to 90 00:04:01,730 --> 00:04:04,100 eliminate some of these features, say, 91 00:04:04,100 --> 00:04:07,100 if you were to eliminate the feature x4, 92 00:04:07,100 --> 00:04:12,220 that corresponds to setting this parameter to 0. 93 00:04:12,220 --> 00:04:15,140 So setting a parameter to 0 94 00:04:15,140 --> 00:04:17,660 is equivalent to eliminating a feature, 95 00:04:17,660 --> 00:04:20,515 which is what we saw on the previous slide. 96 00:04:20,515 --> 00:04:22,940 It turns out that regularization 97 00:04:22,940 --> 00:04:25,700 is a way to more gently reduce 98 00:04:25,700 --> 00:04:28,310 the impacts of some of the features without 99 00:04:28,310 --> 00:04:31,825 doing something as harsh as eliminating it outright. 100 00:04:31,825 --> 00:04:34,540 What regularization does is encourage 101 00:04:34,540 --> 00:04:37,295 the learning algorithm to shrink the values of 102 00:04:37,295 --> 00:04:39,470 the parameters without necessarily 103 00:04:39,470 --> 00:04:43,505 demanding that the parameter is set to exactly 0. 104 00:04:43,505 --> 00:04:45,920 It turns out that even if you fit 105 00:04:45,920 --> 00:04:48,355 a higher order polynomial like this, 106 00:04:48,355 --> 00:04:50,750 so long as you can get the algorithm to use 107 00:04:50,750 --> 00:04:53,000 smaller parameter values: w1, 108 00:04:53,000 --> 00:04:55,175 w2, w3, w4. 109 00:04:55,175 --> 00:04:57,575 You end up with a curve that ends up fitting 110 00:04:57,575 --> 00:05:00,275 the training data much better. 111 00:05:00,275 --> 00:05:02,210 So what regularization does, 112 00:05:02,210 --> 00:05:04,730 is it lets you keep all of your features, 113 00:05:04,730 --> 00:05:07,190 but they just prevents the features from 114 00:05:07,190 --> 00:05:09,920 having an overly large effect, 115 00:05:09,920 --> 00:05:13,720 which is what sometimes can cause overfitting. 116 00:05:13,720 --> 00:05:15,995 By the way, by convention, 117 00:05:15,995 --> 00:05:20,960 we normally just reduce the size of the wj parameters, 118 00:05:20,960 --> 00:05:23,125 that is w1 through wn. 119 00:05:23,125 --> 00:05:25,970 It doesn't make a huge difference whether you 120 00:05:25,970 --> 00:05:28,835 regularize the parameter b as well, 121 00:05:28,835 --> 00:05:31,370 you could do so if you want or not if you don't. 122 00:05:31,370 --> 00:05:33,650 I usually don't and it's just 123 00:05:33,650 --> 00:05:35,965 fine to regularize w1, w2, 124 00:05:35,965 --> 00:05:37,710 all the way to wn, 125 00:05:37,710 --> 00:05:41,265 but not really encourage b to become smaller. 126 00:05:41,265 --> 00:05:43,775 In practice, it should make very little difference 127 00:05:43,775 --> 00:05:47,035 whether you also regularize b or not. 128 00:05:47,035 --> 00:05:49,940 To recap, these are 129 00:05:49,940 --> 00:05:51,710 the three ways you saw in 130 00:05:51,710 --> 00:05:54,275 this video for addressing overfitting. 131 00:05:54,275 --> 00:05:56,765 One, collect more data. 132 00:05:56,765 --> 00:05:58,955 If you can get more data, 133 00:05:58,955 --> 00:06:01,615 this can really help reduce overfitting. 134 00:06:01,615 --> 00:06:03,800 Sometimes that's not possible. 135 00:06:03,800 --> 00:06:07,145 In which case, some of the options are, two, 136 00:06:07,145 --> 00:06:11,735 try selecting and using only a subset of the features. 137 00:06:11,735 --> 00:06:16,315 You'll learn more about feature selection in Course 2. 138 00:06:16,315 --> 00:06:19,685 Three would be to 139 00:06:19,685 --> 00:06:23,210 reduce the size of the parameters using regularization. 140 00:06:23,210 --> 00:06:26,470 This will be the subject of the next video as well. 141 00:06:26,470 --> 00:06:29,675 Just for myself, I use regularization all the time. 142 00:06:29,675 --> 00:06:31,580 So this is a very useful technique 143 00:06:31,580 --> 00:06:33,320 for training learning algorithms, 144 00:06:33,320 --> 00:06:35,705 including neural networks specifically, 145 00:06:35,705 --> 00:06:38,515 which you'll see later in this specialization as well. 146 00:06:38,515 --> 00:06:40,475 I hope you'll also check out 147 00:06:40,475 --> 00:06:43,820 the optional lab on overfitting. 148 00:06:43,820 --> 00:06:47,525 In the lab, you'll be able to see different examples of 149 00:06:47,525 --> 00:06:50,060 overfitting and adjust those examples 150 00:06:50,060 --> 00:06:52,660 by clicking on options in the plots. 151 00:06:52,660 --> 00:06:54,360 You'll also be able to add 152 00:06:54,360 --> 00:06:56,060 your own data points by clicking on 153 00:06:56,060 --> 00:07:00,835 the plot and see how that changes the curve that is fit. 154 00:07:00,835 --> 00:07:04,610 You can also try examples for both regression and 155 00:07:04,610 --> 00:07:07,070 classification and you will 156 00:07:07,070 --> 00:07:10,160 change the degree of the polynomial to be x, 157 00:07:10,160 --> 00:07:13,105 x squared, x cubed, and so on. 158 00:07:13,105 --> 00:07:15,980 The lab also lets you play with 159 00:07:15,980 --> 00:07:18,850 two different options for addressing overfitting. 160 00:07:18,850 --> 00:07:21,470 You can add additional training data to 161 00:07:21,470 --> 00:07:24,560 reduce overfitting and you can also select which 162 00:07:24,560 --> 00:07:27,095 features to include or to exclude 163 00:07:27,095 --> 00:07:30,790 as another way to try to reduce overfitting. 164 00:07:30,790 --> 00:07:32,525 Please take a look at a lab, 165 00:07:32,525 --> 00:07:35,750 which I hope will help you build your intuition about 166 00:07:35,750 --> 00:07:39,670 overfitting as well as some methods for addressing it. 167 00:07:39,670 --> 00:07:42,620 In this video, you also saw the idea of 168 00:07:42,620 --> 00:07:45,650 regularization at a relatively high level. 169 00:07:45,650 --> 00:07:48,380 I realize that all of these details on 170 00:07:48,380 --> 00:07:51,650 regularization may not fully make sense to you yet. 171 00:07:51,650 --> 00:07:53,180 But in the next video, 172 00:07:53,180 --> 00:07:55,970 we'll start to formulate exactly how to apply 173 00:07:55,970 --> 00:07:59,850 regularization and exactly what regularization means. 174 00:07:59,850 --> 00:08:03,590 Then we'll start to figure out how to make this work with 175 00:08:03,590 --> 00:08:05,510 our learning algorithms to make 176 00:08:05,510 --> 00:08:08,060 linear regression and logistic regression, 177 00:08:08,060 --> 00:08:09,680 and in the future, other algorithms 178 00:08:09,680 --> 00:08:11,750 as well avoid overfitting. 179 00:08:11,750 --> 00:08:15,000 Let's take a look at that in the next video.13175