subtitlecat.com

All language subtitles for 02_feature-scaling-part-2.en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:01,130 --> 00:00:04,785 Let's look at how you can implement feature scaling, 2 00:00:04,785 --> 00:00:06,360 to take features that take on 3 00:00:06,360 --> 00:00:08,130 very different ranges of values and 4 00:00:08,130 --> 00:00:10,380 skill them to have comparable ranges 5 00:00:10,380 --> 00:00:11,865 of values to each other. 6 00:00:11,865 --> 00:00:14,550 How do you actually scale features? 7 00:00:14,550 --> 00:00:18,675 Well, if x_1 ranges from 3-2,000, 8 00:00:18,675 --> 00:00:22,035 one way to get a scale version of x_1 is to take 9 00:00:22,035 --> 00:00:26,760 each original x1_ value and divide by 2,000, 10 00:00:26,760 --> 00:00:28,545 the maximum of the range. 11 00:00:28,545 --> 00:00:34,140 The scale x_1 will range from 0.15 up to one. 12 00:00:34,140 --> 00:00:38,235 Similarly, since x_2 ranges from 0-5, 13 00:00:38,235 --> 00:00:41,400 you can calculate a scale version of x_2 by 14 00:00:41,400 --> 00:00:44,915 taking each original x_2 and dividing by five, 15 00:00:44,915 --> 00:00:46,880 which is again the maximum. 16 00:00:46,880 --> 00:00:51,270 So the scale is x_2 will now range from 0-1. 17 00:00:51,740 --> 00:00:56,070 If you plot the scale to x_1 and x_2 on a graph, 18 00:00:56,070 --> 00:00:58,060 it might look like this. 19 00:00:58,060 --> 00:01:01,235 In addition to dividing by the maximum, 20 00:01:01,235 --> 00:01:04,700 you can also do what's called mean normalization. 21 00:01:04,700 --> 00:01:06,560 What this looks like is, 22 00:01:06,560 --> 00:01:09,170 you start with the original features and then you 23 00:01:09,170 --> 00:01:10,880 re-scale them so that both 24 00:01:10,880 --> 00:01:13,105 of them are centered around zero. 25 00:01:13,105 --> 00:01:16,630 Whereas before they only had values greater than zero, 26 00:01:16,630 --> 00:01:20,060 now they have both negative and positive values 27 00:01:20,060 --> 00:01:24,910 that may be usually between negative one and plus one. 28 00:01:24,910 --> 00:01:28,575 To calculate the mean normalization of x_1, 29 00:01:28,575 --> 00:01:30,080 first find the average, 30 00:01:30,080 --> 00:01:33,470 also called the mean of x_1 on your training set, 31 00:01:33,470 --> 00:01:35,975 and let's call this mean Mu_1, 32 00:01:35,975 --> 00:01:39,425 with this being the Greek alphabets Mu. 33 00:01:39,425 --> 00:01:43,220 For example, you may find that the average of feature 1, 34 00:01:43,220 --> 00:01:46,400 Mu_1 is 600 square feet. 35 00:01:46,400 --> 00:01:48,485 Let's take each x_1, 36 00:01:48,485 --> 00:01:51,310 subtract the mean Mu_1, 37 00:01:51,310 --> 00:01:56,775 and then let's divide by the difference 2,000 minus 300, 38 00:01:56,775 --> 00:02:01,440 where 2,000 is the maximum and 300 the minimum, 39 00:02:01,440 --> 00:02:02,960 and if you do this, 40 00:02:02,960 --> 00:02:05,000 you get the normalized x_1 to 41 00:02:05,000 --> 00:02:10,570 range from negative 0.18-0.82. 42 00:02:10,570 --> 00:02:13,880 Similarly, to mean normalized x_2, 43 00:02:13,880 --> 00:02:16,925 you can calculate the average of feature 2. 44 00:02:16,925 --> 00:02:20,350 For instance, Mu_2 may be 2.3. 45 00:02:20,350 --> 00:02:22,980 Then you can take each x_2, 46 00:02:22,980 --> 00:02:27,960 subtract Mu_2 and divide by 5 minus 0. 47 00:02:27,960 --> 00:02:32,280 Again, the max 5 minus the mean, which is 0. 48 00:02:32,280 --> 00:02:35,849 The mean normalized x_2 now ranges 49 00:02:35,849 --> 00:02:41,155 from negative 0.46-0 54. 50 00:02:41,155 --> 00:02:43,205 If you plot the training data 51 00:02:43,205 --> 00:02:45,830 using the mean normalized x_1 and x_2, 52 00:02:45,830 --> 00:02:47,990 it might look like this. 53 00:02:47,990 --> 00:02:51,020 There's one last common re-scaling 54 00:02:51,020 --> 00:02:54,010 method call Z-score normalization. 55 00:02:54,010 --> 00:02:56,360 To implement Z-score normalization, 56 00:02:56,360 --> 00:02:58,190 you need to calculate something called 57 00:02:58,190 --> 00:03:00,530 the standard deviation of each feature. 58 00:03:00,530 --> 00:03:02,945 If you don't know what the standard deviation is, 59 00:03:02,945 --> 00:03:04,310 don't worry about it, you won't 60 00:03:04,310 --> 00:03:06,130 need to know it for this course. 61 00:03:06,130 --> 00:03:07,700 Or if you've heard of 62 00:03:07,700 --> 00:03:10,280 the normal distribution or the bell-shaped curve, 63 00:03:10,280 --> 00:03:12,590 sometimes also called the Gaussian distribution, 64 00:03:12,590 --> 00:03:14,900 this is what the standard deviation 65 00:03:14,900 --> 00:03:17,495 for the normal distribution looks like. 66 00:03:17,495 --> 00:03:18,980 But if you haven't heard of this, 67 00:03:18,980 --> 00:03:20,785 you don't need to worry about that either. 68 00:03:20,785 --> 00:03:23,990 But if you do know what is the standard deviation, 69 00:03:23,990 --> 00:03:26,720 then to implement a Z-score normalization, 70 00:03:26,720 --> 00:03:29,240 you first calculate the mean Mu, 71 00:03:29,240 --> 00:03:31,880 as well as the standard deviation, 72 00:03:31,880 --> 00:03:33,590 which is often denoted by 73 00:03:33,590 --> 00:03:38,135 the lowercase Greek alphabet Sigma of each feature. 74 00:03:38,135 --> 00:03:41,270 For instance, maybe feature 1 has 75 00:03:41,270 --> 00:03:46,405 a standard deviation of 450 and mean 600, 76 00:03:46,405 --> 00:03:49,740 then to Z-score normalize x_1, 77 00:03:49,740 --> 00:03:51,405 take each x_1, 78 00:03:51,405 --> 00:03:53,900 subtract Mu_1, and 79 00:03:53,900 --> 00:03:56,660 then divide by the standard deviation, 80 00:03:56,660 --> 00:03:59,620 which I'm going to denote as Sigma 1. 81 00:03:59,620 --> 00:04:03,555 What you may find is that the Z-score normalized 82 00:04:03,555 --> 00:04:08,650 x_1 now ranges from negative 0.67-3.1. 83 00:04:09,650 --> 00:04:12,290 Similarly, if you calculate the 84 00:04:12,290 --> 00:04:14,810 second features standard deviation 85 00:04:14,810 --> 00:04:19,855 to be 1.4 and mean to be 2.3, 86 00:04:19,855 --> 00:04:25,560 then you can compute x_2 minus Mu_2 divided by Sigma_2, 87 00:04:25,560 --> 00:04:26,940 and in this case, 88 00:04:26,940 --> 00:04:30,330 the Z-score normalized by x_2 might now 89 00:04:30,330 --> 00:04:36,060 range from negative 1.6-1.9. 90 00:04:36,060 --> 00:04:37,790 If you plot the training data on 91 00:04:37,790 --> 00:04:40,220 the normalized x_1 and x_2 on a graph, 92 00:04:40,220 --> 00:04:42,570 it might look like this. 93 00:04:42,650 --> 00:04:44,860 As a rule of thumb, 94 00:04:44,860 --> 00:04:47,104 when performing feature scaling, 95 00:04:47,104 --> 00:04:48,860 you might want to aim for getting 96 00:04:48,860 --> 00:04:51,620 the features to range from maybe anywhere 97 00:04:51,620 --> 00:04:54,320 around negative one to somewhere around 98 00:04:54,320 --> 00:04:57,530 plus one for each feature x. 99 00:04:57,530 --> 00:05:00,170 But these values, negative one and 100 00:05:00,170 --> 00:05:02,930 plus one can be a little bit loose. 101 00:05:02,930 --> 00:05:06,380 If the features range from negative three to plus 102 00:05:06,380 --> 00:05:10,445 three or negative 0.3 to plus 0.3, 103 00:05:10,445 --> 00:05:12,440 all of these are completely okay. 104 00:05:12,440 --> 00:05:14,630 If you have a feature x_1 that 105 00:05:14,630 --> 00:05:17,255 winds up being between zero and three, 106 00:05:17,255 --> 00:05:18,785 that's not a problem. 107 00:05:18,785 --> 00:05:21,050 You can re-scale it if you want, 108 00:05:21,050 --> 00:05:22,700 but if you don't re-scale it, 109 00:05:22,700 --> 00:05:24,355 it should work okay too. 110 00:05:24,355 --> 00:05:27,785 Or if you have a different feature, x_2, 111 00:05:27,785 --> 00:05:29,840 whose values are between negative 112 00:05:29,840 --> 00:05:32,180 2 and plus 0.5, again, 113 00:05:32,180 --> 00:05:34,715 that's okay, no harm re-scaling it, 114 00:05:34,715 --> 00:05:38,500 but it might be okay if you leave it alone as well. 115 00:05:38,500 --> 00:05:41,630 But if another feature, like x_3 here, 116 00:05:41,630 --> 00:05:45,680 ranges from negative 100 to plus 100, 117 00:05:45,680 --> 00:05:48,500 then this takes on a very different range of values, 118 00:05:48,500 --> 00:05:51,760 say something from around negative one to plus one. 119 00:05:51,760 --> 00:05:56,330 You're probably better off re-scaling this feature x_3 so 120 00:05:56,330 --> 00:05:57,770 that it ranges from something 121 00:05:57,770 --> 00:06:01,135 closer to negative one to plus one. 122 00:06:01,135 --> 00:06:04,140 Similarly, if you have a feature 123 00:06:04,140 --> 00:06:07,055 x_4 that takes on really small values, 124 00:06:07,055 --> 00:06:11,990 say between negative 0.001 and plus 0.001, 125 00:06:11,990 --> 00:06:14,680 then these values are so small. 126 00:06:14,680 --> 00:06:18,205 That means you may want to re-scale it as well. 127 00:06:18,205 --> 00:06:21,805 Finally, what if your feature x_5, 128 00:06:21,805 --> 00:06:23,645 such as measurements of 129 00:06:23,645 --> 00:06:26,195 a hospital patients by the temperature 130 00:06:26,195 --> 00:06:32,095 ranges from 98.6-105 degrees Fahrenheit? 131 00:06:32,095 --> 00:06:35,690 In this case, these values are around 100, 132 00:06:35,690 --> 00:06:37,430 which is actually pretty large 133 00:06:37,430 --> 00:06:40,130 compared to other scale features, 134 00:06:40,130 --> 00:06:41,660 and this will actually cause 135 00:06:41,660 --> 00:06:44,140 gradient descent to run more slowly. 136 00:06:44,140 --> 00:06:47,960 In this case, feature re-scaling will likely help. 137 00:06:47,960 --> 00:06:50,360 There's almost never any harm to 138 00:06:50,360 --> 00:06:52,700 carrying out feature re-scaling. 139 00:06:52,700 --> 00:06:56,245 When in doubt, I encourage you to just carry it out. 140 00:06:56,245 --> 00:06:58,605 That's it for feature scaling. 141 00:06:58,605 --> 00:06:59,900 With this little technique, 142 00:06:59,900 --> 00:07:01,790 you'll often be able to get 143 00:07:01,790 --> 00:07:04,805 gradient descent to run much faster. 144 00:07:04,805 --> 00:07:07,480 That's features scaling. 145 00:07:07,480 --> 00:07:10,144 With or without feature scaling, 146 00:07:10,144 --> 00:07:11,765 when you run gradient descent, 147 00:07:11,765 --> 00:07:13,610 how can you know, how can you check 148 00:07:13,610 --> 00:07:15,830 if gradient descent is really working? 149 00:07:15,830 --> 00:07:17,150 If it is finding you 150 00:07:17,150 --> 00:07:19,975 the global minimum or something close to it. 151 00:07:19,975 --> 00:07:21,335 In the next video, 152 00:07:21,335 --> 00:07:23,675 let's take a look at how to recognize 153 00:07:23,675 --> 00:07:26,225 if gradient descent is converging, 154 00:07:26,225 --> 00:07:28,220 and then in the video after that, 155 00:07:28,220 --> 00:07:30,710 this will lead to discussion of how to choose 156 00:07:30,710 --> 00:07:34,440 a good learning rate for gradient descent.11069