subtitlecat.com

All language subtitles for 003 Standardization_en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,780 --> 00:00:02,130 Instructor: The most common problem 2 00:00:02,130 --> 00:00:03,840 when working with numerical data 3 00:00:03,840 --> 00:00:05,610 is about the difference in magnitudes 4 00:00:05,610 --> 00:00:07,620 as we mentioned in the first lesson. 5 00:00:07,620 --> 00:00:10,680 An easy fix for this issue is standardization. 6 00:00:10,680 --> 00:00:13,050 Other names by which you may have heard this term 7 00:00:13,050 --> 00:00:16,050 are feature scaling and normalization. 8 00:00:16,050 --> 00:00:18,510 However, normalization could refer 9 00:00:18,510 --> 00:00:21,570 to a few additional concepts even within machine learning 10 00:00:21,570 --> 00:00:24,270 which is why we'll stick with the term standardization 11 00:00:24,270 --> 00:00:25,443 and feature scaling. 12 00:00:26,940 --> 00:00:29,130 Standardization or feature scaling 13 00:00:29,130 --> 00:00:31,140 is the process of transforming the data 14 00:00:31,140 --> 00:00:33,423 we are working with into a standard scale. 15 00:00:34,530 --> 00:00:36,780 A very common way to approach this problem 16 00:00:36,780 --> 00:00:38,070 is by subtracting the mean 17 00:00:38,070 --> 00:00:40,650 and dividing by the standard deviation. 18 00:00:40,650 --> 00:00:41,640 In this way, 19 00:00:41,640 --> 00:00:43,470 regardless of the data set, 20 00:00:43,470 --> 00:00:46,620 we will always obtain a distribution with a mean of zero 21 00:00:46,620 --> 00:00:48,450 and a standard deviation of one, 22 00:00:48,450 --> 00:00:50,133 which could easily be proven. 23 00:00:51,360 --> 00:00:54,240 Let's show that with an FX example. 24 00:00:54,240 --> 00:00:57,090 Say our algorithm has two input variables, 25 00:00:57,090 --> 00:01:00,003 Euro dollar exchange rate and the daily trading volume. 26 00:01:01,470 --> 00:01:04,440 We have three days worth of observations. 27 00:01:04,440 --> 00:01:07,623 First day, 1.3 and 110,000, 28 00:01:08,850 --> 00:01:13,850 second day, 1.34 and 98,700, 29 00:01:13,920 --> 00:01:18,003 and the third day, 1.25 and 135,000. 30 00:01:19,260 --> 00:01:21,900 The first value shows the Euro dollar exchange rate, 31 00:01:21,900 --> 00:01:25,320 while the second one shows the daily trading volume. 32 00:01:25,320 --> 00:01:27,480 Let's standardize these figures. 33 00:01:27,480 --> 00:01:29,820 We standardize the Euro dollar exchange rates 34 00:01:29,820 --> 00:01:32,790 regarding the other Euro dollar exchange rates. 35 00:01:32,790 --> 00:01:37,740 So, we look at 1.3, 1.34 and 1.25. 36 00:01:37,740 --> 00:01:39,639 The mean is 1.3, 37 00:01:39,639 --> 00:01:42,993 while the standard deviation 0.045. 38 00:01:44,370 --> 00:01:47,040 Going through the above mentioned transformation, 39 00:01:47,040 --> 00:01:52,040 these values become 0.07, 0.96 and -1.03 respectively. 40 00:01:56,010 --> 00:01:57,750 Standardizing trading volumes, 41 00:01:57,750 --> 00:02:02,750 we obtain -0.25, -0.85 and 1.1. 42 00:02:04,410 --> 00:02:05,280 In this way, 43 00:02:05,280 --> 00:02:07,740 we have focused figures of very different scales 44 00:02:07,740 --> 00:02:09,090 to appear similar. 45 00:02:09,090 --> 00:02:11,400 That's why another name for standardization 46 00:02:11,400 --> 00:02:12,870 is feature scaling. 47 00:02:12,870 --> 00:02:15,420 This will ensure our linear combinations 48 00:02:15,420 --> 00:02:17,460 treat the two variables equally. 49 00:02:17,460 --> 00:02:20,673 Also, it is much easier to make sense of the data. 50 00:02:21,870 --> 00:02:23,760 The transformation of trading volumes 51 00:02:23,760 --> 00:02:25,380 allowed us to transform the volumes 52 00:02:25,380 --> 00:02:30,380 from 110,000, 98,700 and 135,000 to -0.25, -0.85 and 1.1. 53 00:02:35,280 --> 00:02:36,300 In this way, 54 00:02:36,300 --> 00:02:39,330 the third term is considerably higher than the average, 55 00:02:39,330 --> 00:02:42,060 while the first one is around the average. 56 00:02:42,060 --> 00:02:45,780 We can confidently say that 135,000 trades per day 57 00:02:45,780 --> 00:02:46,980 is a high figure, 58 00:02:46,980 --> 00:02:49,950 while 98,700 is low. 59 00:02:49,950 --> 00:02:51,930 Please disregard the simplification 60 00:02:51,930 --> 00:02:54,000 of having just three observations. 61 00:02:54,000 --> 00:02:55,653 That's just an example. 62 00:02:57,360 --> 00:02:58,920 Besides standardization, 63 00:02:58,920 --> 00:03:01,080 there are other popular methods, too. 64 00:03:01,080 --> 00:03:02,610 We will shortly introduce them 65 00:03:02,610 --> 00:03:04,653 without going too much in detail. 66 00:03:06,540 --> 00:03:08,700 Initially, we said that normalization 67 00:03:08,700 --> 00:03:10,740 refers to several concepts. 68 00:03:10,740 --> 00:03:13,230 One of them, which comes up in machine learning 69 00:03:13,230 --> 00:03:15,690 often consists of converting each sample 70 00:03:15,690 --> 00:03:19,593 into a unit length vector using the L1 or L2 norm. 71 00:03:21,060 --> 00:03:23,670 Another pre-processing method is PCA 72 00:03:23,670 --> 00:03:26,460 standing for principal components analysis. 73 00:03:26,460 --> 00:03:28,830 It is a dimension reduction technique 74 00:03:28,830 --> 00:03:31,140 often used when working with several variables 75 00:03:31,140 --> 00:03:34,920 referring to the same bigger concept or latent variable. 76 00:03:34,920 --> 00:03:36,000 For instance, 77 00:03:36,000 --> 00:03:38,100 if we have data about one's religion, 78 00:03:38,100 --> 00:03:39,030 voting history, 79 00:03:39,030 --> 00:03:41,340 participation in different associations, 80 00:03:41,340 --> 00:03:42,360 an upbringing, 81 00:03:42,360 --> 00:03:43,710 we can combine these four 82 00:03:43,710 --> 00:03:46,830 to reflect his or her attitude towards immigration. 83 00:03:46,830 --> 00:03:49,350 This new variable will normally be standardized 84 00:03:49,350 --> 00:03:50,940 in a range with the mean of zero 85 00:03:50,940 --> 00:03:52,803 and a standard deviation of one. 86 00:03:54,330 --> 00:03:56,910 Whitening is another technique frequently used 87 00:03:56,910 --> 00:03:58,440 for pre-processing. 88 00:03:58,440 --> 00:04:00,840 It is often performed after PCA 89 00:04:00,840 --> 00:04:03,150 and removes most of the underlying correlations 90 00:04:03,150 --> 00:04:04,680 between data points. 91 00:04:04,680 --> 00:04:06,960 Whitening can be useful when conceptually, 92 00:04:06,960 --> 00:04:08,730 the data should be uncorrelated. 93 00:04:08,730 --> 00:04:11,283 But that's not reflected in the observations. 94 00:04:12,660 --> 00:04:14,850 We can't cover all the strategies 95 00:04:14,850 --> 00:04:17,610 as each strategy is problem specific. 96 00:04:17,610 --> 00:04:20,760 However, standardization is the most common one 97 00:04:20,760 --> 00:04:22,320 and is the one we will employ 98 00:04:22,320 --> 00:04:25,530 in the practical examples we will face in this course. 99 00:04:25,530 --> 00:04:26,640 In the next lesson, 100 00:04:26,640 --> 00:04:29,640 we will see how to deal with categorical data. 101 00:04:29,640 --> 00:04:30,933 Thanks for watching. 7584