subtitlecat.com

All language subtitles for 01_feature-scaling-part-1.en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:02,440 --> 00:00:04,040 So welcome back. 2 00:00:04,040 --> 00:00:08,540 Let's take a look at some techniques that make great inter sense work much better. 3 00:00:08,540 --> 00:00:13,000 In this video you see a technique called feature scaling that will enable 4 00:00:13,000 --> 00:00:15,940 gradient descent to run much faster. 5 00:00:15,940 --> 00:00:20,050 Let's start by taking a look at the relationship between the size of 6 00:00:20,050 --> 00:00:23,653 a feature that is how big are the numbers for that feature and 7 00:00:23,653 --> 00:00:26,840 the size of its associated parameter. 8 00:00:26,840 --> 00:00:31,728 As a concrete example, let's predict the price of a house using two 9 00:00:31,728 --> 00:00:37,240 features x1 the size of the house and x2 the number of bedrooms. 10 00:00:37,240 --> 00:00:42,670 Let's say that x1 typically ranges from 300 to 2000 square feet. 11 00:00:42,670 --> 00:00:48,140 And x2 in the data set ranges from 0 to 5 bedrooms. 12 00:00:48,140 --> 00:00:53,609 So for this example, x1 takes on a relatively large range of values and 13 00:00:53,609 --> 00:00:58,140 x2 takes on a relatively small range of values. 14 00:00:58,140 --> 00:01:03,485 Now let's take an example of a house that has a size of 2000 square 15 00:01:03,485 --> 00:01:08,749 feet has five bedrooms and a price of 500k or $500,000. 16 00:01:08,749 --> 00:01:13,464 For this one training example, what do you think 17 00:01:13,464 --> 00:01:20,340 are reasonable values for the size of parameters w1 and w2? 18 00:01:20,340 --> 00:01:23,640 Well, let's look at one possible set of parameters. 19 00:01:23,640 --> 00:01:27,816 Say w1 is 50 and w2 is 0.1 and 20 00:01:27,816 --> 00:01:32,861 b is 50 for the purposes of discussion. 21 00:01:34,040 --> 00:01:39,173 So in this case the estimated price in thousands 22 00:01:39,173 --> 00:01:46,240 of dollars is 100,000k here plus 0.5 k plus 50 k. 23 00:01:46,240 --> 00:01:50,240 Which is slightly over 100 million dollars. 24 00:01:50,240 --> 00:01:57,540 So that's clearly very far from the actual price of $500,000. 25 00:01:57,540 --> 00:02:03,540 And so this is not a very good set of parameter choices for w1 and w2. 26 00:02:03,540 --> 00:02:07,506 Now let's take a look at another possibility. 27 00:02:07,506 --> 00:02:11,039 Say w1 and w2 were the other way around. 28 00:02:11,039 --> 00:02:15,561 W1 is 0.1 and w2 is 50 and b is still also 50. 29 00:02:15,561 --> 00:02:21,285 In this choice of w1 and w2, w1 is relatively small and 30 00:02:21,285 --> 00:02:28,240 w2 is relatively large, 50 is much bigger than 0.1. 31 00:02:28,240 --> 00:02:33,014 So here the predicted price is 0.1 times 32 00:02:33,014 --> 00:02:38,240 2000 plus 50 times five plus 50. 33 00:02:38,240 --> 00:02:45,340 The first term becomes 200k, the second term becomes 250k, and the plus 50. 34 00:02:45,340 --> 00:02:50,422 So this version of the model predicts a price of $500,000 which is a much more 35 00:02:50,422 --> 00:02:55,451 reasonable estimate and happens to be the same price as the true price of the house. 36 00:02:56,540 --> 00:03:01,675 So hopefully you might notice that when a possible range of values of a feature 37 00:03:01,675 --> 00:03:06,819 is large, like the size and square feet which goes all the way up to 2000. 38 00:03:06,819 --> 00:03:11,455 It's more likely that a good model will learn to choose a relatively small 39 00:03:11,455 --> 00:03:14,540 parameter value, like 0.1. 40 00:03:14,540 --> 00:03:18,636 Likewise, when the possible values of the feature are small, 41 00:03:18,636 --> 00:03:22,345 like the number of bedrooms, then a reasonable value for 42 00:03:22,345 --> 00:03:26,340 its parameters will be relatively large like 50. 43 00:03:26,340 --> 00:03:30,240 So how does this relate to grading descent? 44 00:03:30,240 --> 00:03:35,027 Well, let's take a look at the scatter plot of the features 45 00:03:35,027 --> 00:03:39,528 where the size square feet is the horizontal axis x1 and 46 00:03:39,528 --> 00:03:44,640 the number of bedrooms exudes is on the vertical axis. 47 00:03:44,640 --> 00:03:49,416 If you plot the training data, you notice that the horizontal axis is on a much 48 00:03:49,416 --> 00:03:54,061 larger scale or much larger range of values compared to the vertical axis. 49 00:03:55,340 --> 00:04:00,787 Next let's look at how the cost function might look in a contour plot. 50 00:04:00,787 --> 00:04:06,856 You might see a contour plot where the horizontal axis has a much narrower range, 51 00:04:06,856 --> 00:04:11,765 say between zero and one, whereas the vertical axis takes on much 52 00:04:11,765 --> 00:04:15,840 larger values, say between 10 and 100. 53 00:04:15,840 --> 00:04:19,027 So the contours form ovals or ellipses and 54 00:04:19,027 --> 00:04:23,640 they're short on one side and longer on the other. 55 00:04:23,640 --> 00:04:29,115 And this is because a very small change to w1 can have a very large impact 56 00:04:29,115 --> 00:04:34,331 on the estimated price and that's a very large impact on the cost J. 57 00:04:34,331 --> 00:04:38,763 Because w1 tends to be multiplied by a very large number, 58 00:04:38,763 --> 00:04:41,540 the size and square feet. 59 00:04:41,540 --> 00:04:42,544 In contrast, 60 00:04:42,544 --> 00:04:47,908 it takes a much larger change in w2 in order to change the predictions much. 61 00:04:47,908 --> 00:04:54,540 And thus small changes to w2, don't change the cost function nearly as much. 62 00:04:54,540 --> 00:04:56,540 So where does this leave us? 63 00:04:56,540 --> 00:05:01,699 This is what might end up happening if you were to run great in dissent, 64 00:05:01,699 --> 00:05:04,853 if you were to use your training data as is. 65 00:05:04,853 --> 00:05:07,442 Because the contours are so tall and 66 00:05:07,442 --> 00:05:11,930 skinny gradient descent may end up bouncing back and forth for 67 00:05:11,930 --> 00:05:17,340 a long time before it can finally find its way to the global minimum. 68 00:05:17,340 --> 00:05:22,340 In situations like this, a useful thing to do is to scale the features. 69 00:05:22,340 --> 00:05:28,242 This means performing some transformation of your training data so 70 00:05:28,242 --> 00:05:34,990 that x1 say might now range from 0 to 1 and x2 might also range from 0 to 1. 71 00:05:34,990 --> 00:05:39,875 So the data points now look more like this and you might notice that the scale of 72 00:05:39,875 --> 00:05:43,951 the plot on the bottom is now quite different than the one on top. 73 00:05:45,040 --> 00:05:48,045 The key point is that the re scale x1 and 74 00:05:48,045 --> 00:05:54,040 x2 are both now taking comparable ranges of values to each other. 75 00:05:54,040 --> 00:05:58,851 And if you run gradient descent on a cost function to find on this, 76 00:05:58,851 --> 00:06:03,924 re scaled x1 and x2 using this transformed data, then the contours 77 00:06:03,924 --> 00:06:08,980 will look more like this more like circles and less tall and skinny. 78 00:06:08,980 --> 00:06:13,461 And gradient descent can find a much more direct path to the global minimum. 79 00:06:14,640 --> 00:06:18,875 So to recap, when you have different features that take on very different 80 00:06:18,875 --> 00:06:22,632 ranges of values, it can cause gradient descent to run slowly but 81 00:06:22,632 --> 00:06:27,460 re scaling the different features so they all take on comparable range of values. 82 00:06:27,460 --> 00:06:30,640 because speed, upgrade and dissent significantly. 83 00:06:30,640 --> 00:06:31,970 How do you actually do this? 84 00:06:31,970 --> 00:06:34,060 Let's take a look at that in the next video.7743