All language subtitles for 009 To Standardize or not to Standardize_en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian Download
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,270 --> 00:00:02,430 Tutor: Should we standardize? 2 00:00:02,430 --> 00:00:05,760 I avoided preparing this lecture for a couple of days. 3 00:00:05,760 --> 00:00:07,620 Today I was drinking some coffee 4 00:00:07,620 --> 00:00:09,667 and explaining to a colleague of mine, 5 00:00:09,667 --> 00:00:12,180 "I really wanna elaborate on standardization, 6 00:00:12,180 --> 00:00:15,090 but I don't think the students will be interested. 7 00:00:15,090 --> 00:00:18,390 Moreover, there is a dispute on the topic." 8 00:00:18,390 --> 00:00:20,647 My colleague then looked at me and said, 9 00:00:20,647 --> 00:00:22,710 "Then tell that to the students. 10 00:00:22,710 --> 00:00:24,480 Show them both sides" 11 00:00:24,480 --> 00:00:27,273 And that's how he closed the topic and got rid of me. 12 00:00:28,140 --> 00:00:31,500 So to standardize or not to standardize? 13 00:00:31,500 --> 00:00:33,840 That is the question. 14 00:00:33,840 --> 00:00:36,780 Let's explore a simple example. 15 00:00:36,780 --> 00:00:39,480 Here's a scatter plot with four apartments. 16 00:00:39,480 --> 00:00:41,550 The X axis shows the size, 17 00:00:41,550 --> 00:00:44,220 while the Y axis, the price. 18 00:00:44,220 --> 00:00:47,070 That's a very common regression relationship, 19 00:00:47,070 --> 00:00:48,450 but we are doing clustering here, 20 00:00:48,450 --> 00:00:50,400 so instead of causality, 21 00:00:50,400 --> 00:00:52,983 think about how we can group the four observations. 22 00:00:53,910 --> 00:00:58,910 A is a 500 square-foot apartment that is worth $50,000. 23 00:00:59,400 --> 00:01:04,400 B is a 500 square-foot apartment that is worth $100,000. 24 00:01:04,860 --> 00:01:09,210 C is a 1,200 square-foot apartment worth $50,000, 25 00:01:09,210 --> 00:01:10,770 and D has the same size, 26 00:01:10,770 --> 00:01:13,050 but is twice as expensive. 27 00:01:13,050 --> 00:01:14,850 If we were to create two clusters, 28 00:01:14,850 --> 00:01:16,500 just by looking at the plot, 29 00:01:16,500 --> 00:01:19,863 they are likely to be AB and CD, right? 30 00:01:21,210 --> 00:01:25,143 Now, what if we standardize the X axis, size that is? 31 00:01:26,790 --> 00:01:29,040 Without taking you through the calculations, 32 00:01:29,040 --> 00:01:30,990 that's the new situation. 33 00:01:30,990 --> 00:01:35,220 The X axis of these points are either minus 1 or 1. 34 00:01:35,220 --> 00:01:37,260 How would we group them now? 35 00:01:37,260 --> 00:01:41,460 Well, AC and BD looks reasonable, right? 36 00:01:41,460 --> 00:01:42,603 Yes, it does. 37 00:01:44,220 --> 00:01:48,153 Finally, let's also standardize the Y axis or price. 38 00:01:49,470 --> 00:01:53,283 Now, the Y axis only minus 1 or 1, too. 39 00:01:54,570 --> 00:01:57,150 What we see is a perfect square. 40 00:01:57,150 --> 00:01:58,380 We have no way of deciding 41 00:01:58,380 --> 00:02:03,360 if the clusters should be AB and CD, or AC and BD. 42 00:02:03,360 --> 00:02:05,610 So we went from one solution, 43 00:02:05,610 --> 00:02:07,410 through a totally different one 44 00:02:07,410 --> 00:02:10,080 to no solution whatsoever. 45 00:02:10,080 --> 00:02:11,940 Why did that happen? 46 00:02:11,940 --> 00:02:13,830 The ultimate aim of standardization 47 00:02:13,830 --> 00:02:16,140 is to reduce the weight of higher numbers, 48 00:02:16,140 --> 00:02:18,240 and increase that of lower ones. 49 00:02:18,240 --> 00:02:20,883 Now, let's see the first graph once again. 50 00:02:22,260 --> 00:02:24,690 If both axes had the same scale, 51 00:02:24,690 --> 00:02:28,680 so from 0 to 100,000, we would get something like this, 52 00:02:28,680 --> 00:02:31,260 but even more dramatic. 53 00:02:31,260 --> 00:02:33,840 A K-means algorithm would immediately cluster 54 00:02:33,840 --> 00:02:36,810 A with C and B with D, 55 00:02:36,810 --> 00:02:39,210 just because the scale of price was so different 56 00:02:39,210 --> 00:02:40,170 compared to size, 57 00:02:40,170 --> 00:02:42,210 in terms of mere numbers. 58 00:02:42,210 --> 00:02:43,803 So, scale matters. 59 00:02:45,510 --> 00:02:48,180 Finally, the last graph resulted in a square 60 00:02:48,180 --> 00:02:51,630 because there were only two values for each axis. 61 00:02:51,630 --> 00:02:54,120 Logically, every rectangle on a graph 62 00:02:54,120 --> 00:02:55,380 after being standardized 63 00:02:55,380 --> 00:02:57,270 turns into a square. 64 00:02:57,270 --> 00:02:59,490 So no matter how I chose the axes 65 00:02:59,490 --> 00:03:01,830 or how far off they were from each other, 66 00:03:01,830 --> 00:03:04,140 as long as they were in the shape of a rectangle, 67 00:03:04,140 --> 00:03:07,309 the standardized output would've been a square. 68 00:03:07,309 --> 00:03:08,142 With that said, 69 00:03:08,142 --> 00:03:10,020 by standardizing both axes, 70 00:03:10,020 --> 00:03:13,233 we remove the weight introduced by the high price values. 71 00:03:14,790 --> 00:03:17,040 To sum up, if we don't standardize, 72 00:03:17,040 --> 00:03:19,080 the range of the values will serve as weights 73 00:03:19,080 --> 00:03:20,670 for each variable. 74 00:03:20,670 --> 00:03:22,620 Price had much higher values, 75 00:03:22,620 --> 00:03:24,180 which would indicate to K-means 76 00:03:24,180 --> 00:03:26,400 that price is more important. 77 00:03:26,400 --> 00:03:31,260 This would lead to clusters based on price, AC and BD, 78 00:03:31,260 --> 00:03:34,470 the Economy Cluster and the Luxury Cluster. 79 00:03:34,470 --> 00:03:36,210 Note that the clustering would barely, 80 00:03:36,210 --> 00:03:38,640 if at all, care about size. 81 00:03:38,640 --> 00:03:40,560 So, if we don't standardize, 82 00:03:40,560 --> 00:03:43,743 we are not taking advantage of the size data whatsoever. 83 00:03:44,580 --> 00:03:47,460 Therefore, it is a good practice to standardize the data 84 00:03:47,460 --> 00:03:50,433 before clustering, especially for beginners. 85 00:03:51,300 --> 00:03:52,770 The final note I'll leave you with 86 00:03:52,770 --> 00:03:55,110 is when you should not standardize. 87 00:03:55,110 --> 00:03:57,750 As standardization is trying to put all variables 88 00:03:57,750 --> 00:03:59,040 on equal footing, 89 00:03:59,040 --> 00:04:01,920 In some cases, we don't need to do that. 90 00:04:01,920 --> 00:04:03,120 If we know that one variable 91 00:04:03,120 --> 00:04:05,520 is inherently more important than another, 92 00:04:05,520 --> 00:04:08,490 then standardization shouldn't be used. 93 00:04:08,490 --> 00:04:11,700 Our price/size relationship could be one of those. 94 00:04:11,700 --> 00:04:13,410 Most people are affected by the price 95 00:04:13,410 --> 00:04:15,960 much more than the size, aren't they? 96 00:04:15,960 --> 00:04:17,339 If you can't afford the price 97 00:04:17,339 --> 00:04:20,047 you won't care about the size, right? 98 00:04:20,047 --> 00:04:21,060 "How can you know that 99 00:04:21,060 --> 00:04:23,340 prior to clustering," you may ask? 100 00:04:23,340 --> 00:04:25,410 Experience plays a big role, 101 00:04:25,410 --> 00:04:27,303 so practice when the time comes. 102 00:04:28,170 --> 00:04:31,500 We will discuss this a bit more in the next lecture. 103 00:04:31,500 --> 00:04:32,500 Thanks for watching. 7735

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.