All language subtitles for 01_feature-scaling-part-1.en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian Download
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:02,440 --> 00:00:04,040 So welcome back. 2 00:00:04,040 --> 00:00:08,540 Let's take a look at some techniques that make great inter sense work much better. 3 00:00:08,540 --> 00:00:13,000 In this video you see a technique called feature scaling that will enable 4 00:00:13,000 --> 00:00:15,940 gradient descent to run much faster. 5 00:00:15,940 --> 00:00:20,050 Let's start by taking a look at the relationship between the size of 6 00:00:20,050 --> 00:00:23,653 a feature that is how big are the numbers for that feature and 7 00:00:23,653 --> 00:00:26,840 the size of its associated parameter. 8 00:00:26,840 --> 00:00:31,728 As a concrete example, let's predict the price of a house using two 9 00:00:31,728 --> 00:00:37,240 features x1 the size of the house and x2 the number of bedrooms. 10 00:00:37,240 --> 00:00:42,670 Let's say that x1 typically ranges from 300 to 2000 square feet. 11 00:00:42,670 --> 00:00:48,140 And x2 in the data set ranges from 0 to 5 bedrooms. 12 00:00:48,140 --> 00:00:53,609 So for this example, x1 takes on a relatively large range of values and 13 00:00:53,609 --> 00:00:58,140 x2 takes on a relatively small range of values. 14 00:00:58,140 --> 00:01:03,485 Now let's take an example of a house that has a size of 2000 square 15 00:01:03,485 --> 00:01:08,749 feet has five bedrooms and a price of 500k or $500,000. 16 00:01:08,749 --> 00:01:13,464 For this one training example, what do you think 17 00:01:13,464 --> 00:01:20,340 are reasonable values for the size of parameters w1 and w2? 18 00:01:20,340 --> 00:01:23,640 Well, let's look at one possible set of parameters. 19 00:01:23,640 --> 00:01:27,816 Say w1 is 50 and w2 is 0.1 and 20 00:01:27,816 --> 00:01:32,861 b is 50 for the purposes of discussion. 21 00:01:34,040 --> 00:01:39,173 So in this case the estimated price in thousands 22 00:01:39,173 --> 00:01:46,240 of dollars is 100,000k here plus 0.5 k plus 50 k. 23 00:01:46,240 --> 00:01:50,240 Which is slightly over 100 million dollars. 24 00:01:50,240 --> 00:01:57,540 So that's clearly very far from the actual price of $500,000. 25 00:01:57,540 --> 00:02:03,540 And so this is not a very good set of parameter choices for w1 and w2. 26 00:02:03,540 --> 00:02:07,506 Now let's take a look at another possibility. 27 00:02:07,506 --> 00:02:11,039 Say w1 and w2 were the other way around. 28 00:02:11,039 --> 00:02:15,561 W1 is 0.1 and w2 is 50 and b is still also 50. 29 00:02:15,561 --> 00:02:21,285 In this choice of w1 and w2, w1 is relatively small and 30 00:02:21,285 --> 00:02:28,240 w2 is relatively large, 50 is much bigger than 0.1. 31 00:02:28,240 --> 00:02:33,014 So here the predicted price is 0.1 times 32 00:02:33,014 --> 00:02:38,240 2000 plus 50 times five plus 50. 33 00:02:38,240 --> 00:02:45,340 The first term becomes 200k, the second term becomes 250k, and the plus 50. 34 00:02:45,340 --> 00:02:50,422 So this version of the model predicts a price of $500,000 which is a much more 35 00:02:50,422 --> 00:02:55,451 reasonable estimate and happens to be the same price as the true price of the house. 36 00:02:56,540 --> 00:03:01,675 So hopefully you might notice that when a possible range of values of a feature 37 00:03:01,675 --> 00:03:06,819 is large, like the size and square feet which goes all the way up to 2000. 38 00:03:06,819 --> 00:03:11,455 It's more likely that a good model will learn to choose a relatively small 39 00:03:11,455 --> 00:03:14,540 parameter value, like 0.1. 40 00:03:14,540 --> 00:03:18,636 Likewise, when the possible values of the feature are small, 41 00:03:18,636 --> 00:03:22,345 like the number of bedrooms, then a reasonable value for 42 00:03:22,345 --> 00:03:26,340 its parameters will be relatively large like 50. 43 00:03:26,340 --> 00:03:30,240 So how does this relate to grading descent? 44 00:03:30,240 --> 00:03:35,027 Well, let's take a look at the scatter plot of the features 45 00:03:35,027 --> 00:03:39,528 where the size square feet is the horizontal axis x1 and 46 00:03:39,528 --> 00:03:44,640 the number of bedrooms exudes is on the vertical axis. 47 00:03:44,640 --> 00:03:49,416 If you plot the training data, you notice that the horizontal axis is on a much 48 00:03:49,416 --> 00:03:54,061 larger scale or much larger range of values compared to the vertical axis. 49 00:03:55,340 --> 00:04:00,787 Next let's look at how the cost function might look in a contour plot. 50 00:04:00,787 --> 00:04:06,856 You might see a contour plot where the horizontal axis has a much narrower range, 51 00:04:06,856 --> 00:04:11,765 say between zero and one, whereas the vertical axis takes on much 52 00:04:11,765 --> 00:04:15,840 larger values, say between 10 and 100. 53 00:04:15,840 --> 00:04:19,027 So the contours form ovals or ellipses and 54 00:04:19,027 --> 00:04:23,640 they're short on one side and longer on the other. 55 00:04:23,640 --> 00:04:29,115 And this is because a very small change to w1 can have a very large impact 56 00:04:29,115 --> 00:04:34,331 on the estimated price and that's a very large impact on the cost J. 57 00:04:34,331 --> 00:04:38,763 Because w1 tends to be multiplied by a very large number, 58 00:04:38,763 --> 00:04:41,540 the size and square feet. 59 00:04:41,540 --> 00:04:42,544 In contrast, 60 00:04:42,544 --> 00:04:47,908 it takes a much larger change in w2 in order to change the predictions much. 61 00:04:47,908 --> 00:04:54,540 And thus small changes to w2, don't change the cost function nearly as much. 62 00:04:54,540 --> 00:04:56,540 So where does this leave us? 63 00:04:56,540 --> 00:05:01,699 This is what might end up happening if you were to run great in dissent, 64 00:05:01,699 --> 00:05:04,853 if you were to use your training data as is. 65 00:05:04,853 --> 00:05:07,442 Because the contours are so tall and 66 00:05:07,442 --> 00:05:11,930 skinny gradient descent may end up bouncing back and forth for 67 00:05:11,930 --> 00:05:17,340 a long time before it can finally find its way to the global minimum. 68 00:05:17,340 --> 00:05:22,340 In situations like this, a useful thing to do is to scale the features. 69 00:05:22,340 --> 00:05:28,242 This means performing some transformation of your training data so 70 00:05:28,242 --> 00:05:34,990 that x1 say might now range from 0 to 1 and x2 might also range from 0 to 1. 71 00:05:34,990 --> 00:05:39,875 So the data points now look more like this and you might notice that the scale of 72 00:05:39,875 --> 00:05:43,951 the plot on the bottom is now quite different than the one on top. 73 00:05:45,040 --> 00:05:48,045 The key point is that the re scale x1 and 74 00:05:48,045 --> 00:05:54,040 x2 are both now taking comparable ranges of values to each other. 75 00:05:54,040 --> 00:05:58,851 And if you run gradient descent on a cost function to find on this, 76 00:05:58,851 --> 00:06:03,924 re scaled x1 and x2 using this transformed data, then the contours 77 00:06:03,924 --> 00:06:08,980 will look more like this more like circles and less tall and skinny. 78 00:06:08,980 --> 00:06:13,461 And gradient descent can find a much more direct path to the global minimum. 79 00:06:14,640 --> 00:06:18,875 So to recap, when you have different features that take on very different 80 00:06:18,875 --> 00:06:22,632 ranges of values, it can cause gradient descent to run slowly but 81 00:06:22,632 --> 00:06:27,460 re scaling the different features so they all take on comparable range of values. 82 00:06:27,460 --> 00:06:30,640 because speed, upgrade and dissent significantly. 83 00:06:30,640 --> 00:06:31,970 How do you actually do this? 84 00:06:31,970 --> 00:06:34,060 Let's take a look at that in the next video.7743

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.