subtitlecat.com

All language subtitles for 009 To Standardize or not to Standardize_en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,270 --> 00:00:02,430 Tutor: Should we standardize? 2 00:00:02,430 --> 00:00:05,760 I avoided preparing this lecture for a couple of days. 3 00:00:05,760 --> 00:00:07,620 Today I was drinking some coffee 4 00:00:07,620 --> 00:00:09,667 and explaining to a colleague of mine, 5 00:00:09,667 --> 00:00:12,180 "I really wanna elaborate on standardization, 6 00:00:12,180 --> 00:00:15,090 but I don't think the students will be interested. 7 00:00:15,090 --> 00:00:18,390 Moreover, there is a dispute on the topic." 8 00:00:18,390 --> 00:00:20,647 My colleague then looked at me and said, 9 00:00:20,647 --> 00:00:22,710 "Then tell that to the students. 10 00:00:22,710 --> 00:00:24,480 Show them both sides" 11 00:00:24,480 --> 00:00:27,273 And that's how he closed the topic and got rid of me. 12 00:00:28,140 --> 00:00:31,500 So to standardize or not to standardize? 13 00:00:31,500 --> 00:00:33,840 That is the question. 14 00:00:33,840 --> 00:00:36,780 Let's explore a simple example. 15 00:00:36,780 --> 00:00:39,480 Here's a scatter plot with four apartments. 16 00:00:39,480 --> 00:00:41,550 The X axis shows the size, 17 00:00:41,550 --> 00:00:44,220 while the Y axis, the price. 18 00:00:44,220 --> 00:00:47,070 That's a very common regression relationship, 19 00:00:47,070 --> 00:00:48,450 but we are doing clustering here, 20 00:00:48,450 --> 00:00:50,400 so instead of causality, 21 00:00:50,400 --> 00:00:52,983 think about how we can group the four observations. 22 00:00:53,910 --> 00:00:58,910 A is a 500 square-foot apartment that is worth $50,000. 23 00:00:59,400 --> 00:01:04,400 B is a 500 square-foot apartment that is worth $100,000. 24 00:01:04,860 --> 00:01:09,210 C is a 1,200 square-foot apartment worth $50,000, 25 00:01:09,210 --> 00:01:10,770 and D has the same size, 26 00:01:10,770 --> 00:01:13,050 but is twice as expensive. 27 00:01:13,050 --> 00:01:14,850 If we were to create two clusters, 28 00:01:14,850 --> 00:01:16,500 just by looking at the plot, 29 00:01:16,500 --> 00:01:19,863 they are likely to be AB and CD, right? 30 00:01:21,210 --> 00:01:25,143 Now, what if we standardize the X axis, size that is? 31 00:01:26,790 --> 00:01:29,040 Without taking you through the calculations, 32 00:01:29,040 --> 00:01:30,990 that's the new situation. 33 00:01:30,990 --> 00:01:35,220 The X axis of these points are either minus 1 or 1. 34 00:01:35,220 --> 00:01:37,260 How would we group them now? 35 00:01:37,260 --> 00:01:41,460 Well, AC and BD looks reasonable, right? 36 00:01:41,460 --> 00:01:42,603 Yes, it does. 37 00:01:44,220 --> 00:01:48,153 Finally, let's also standardize the Y axis or price. 38 00:01:49,470 --> 00:01:53,283 Now, the Y axis only minus 1 or 1, too. 39 00:01:54,570 --> 00:01:57,150 What we see is a perfect square. 40 00:01:57,150 --> 00:01:58,380 We have no way of deciding 41 00:01:58,380 --> 00:02:03,360 if the clusters should be AB and CD, or AC and BD. 42 00:02:03,360 --> 00:02:05,610 So we went from one solution, 43 00:02:05,610 --> 00:02:07,410 through a totally different one 44 00:02:07,410 --> 00:02:10,080 to no solution whatsoever. 45 00:02:10,080 --> 00:02:11,940 Why did that happen? 46 00:02:11,940 --> 00:02:13,830 The ultimate aim of standardization 47 00:02:13,830 --> 00:02:16,140 is to reduce the weight of higher numbers, 48 00:02:16,140 --> 00:02:18,240 and increase that of lower ones. 49 00:02:18,240 --> 00:02:20,883 Now, let's see the first graph once again. 50 00:02:22,260 --> 00:02:24,690 If both axes had the same scale, 51 00:02:24,690 --> 00:02:28,680 so from 0 to 100,000, we would get something like this, 52 00:02:28,680 --> 00:02:31,260 but even more dramatic. 53 00:02:31,260 --> 00:02:33,840 A K-means algorithm would immediately cluster 54 00:02:33,840 --> 00:02:36,810 A with C and B with D, 55 00:02:36,810 --> 00:02:39,210 just because the scale of price was so different 56 00:02:39,210 --> 00:02:40,170 compared to size, 57 00:02:40,170 --> 00:02:42,210 in terms of mere numbers. 58 00:02:42,210 --> 00:02:43,803 So, scale matters. 59 00:02:45,510 --> 00:02:48,180 Finally, the last graph resulted in a square 60 00:02:48,180 --> 00:02:51,630 because there were only two values for each axis. 61 00:02:51,630 --> 00:02:54,120 Logically, every rectangle on a graph 62 00:02:54,120 --> 00:02:55,380 after being standardized 63 00:02:55,380 --> 00:02:57,270 turns into a square. 64 00:02:57,270 --> 00:02:59,490 So no matter how I chose the axes 65 00:02:59,490 --> 00:03:01,830 or how far off they were from each other, 66 00:03:01,830 --> 00:03:04,140 as long as they were in the shape of a rectangle, 67 00:03:04,140 --> 00:03:07,309 the standardized output would've been a square. 68 00:03:07,309 --> 00:03:08,142 With that said, 69 00:03:08,142 --> 00:03:10,020 by standardizing both axes, 70 00:03:10,020 --> 00:03:13,233 we remove the weight introduced by the high price values. 71 00:03:14,790 --> 00:03:17,040 To sum up, if we don't standardize, 72 00:03:17,040 --> 00:03:19,080 the range of the values will serve as weights 73 00:03:19,080 --> 00:03:20,670 for each variable. 74 00:03:20,670 --> 00:03:22,620 Price had much higher values, 75 00:03:22,620 --> 00:03:24,180 which would indicate to K-means 76 00:03:24,180 --> 00:03:26,400 that price is more important. 77 00:03:26,400 --> 00:03:31,260 This would lead to clusters based on price, AC and BD, 78 00:03:31,260 --> 00:03:34,470 the Economy Cluster and the Luxury Cluster. 79 00:03:34,470 --> 00:03:36,210 Note that the clustering would barely, 80 00:03:36,210 --> 00:03:38,640 if at all, care about size. 81 00:03:38,640 --> 00:03:40,560 So, if we don't standardize, 82 00:03:40,560 --> 00:03:43,743 we are not taking advantage of the size data whatsoever. 83 00:03:44,580 --> 00:03:47,460 Therefore, it is a good practice to standardize the data 84 00:03:47,460 --> 00:03:50,433 before clustering, especially for beginners. 85 00:03:51,300 --> 00:03:52,770 The final note I'll leave you with 86 00:03:52,770 --> 00:03:55,110 is when you should not standardize. 87 00:03:55,110 --> 00:03:57,750 As standardization is trying to put all variables 88 00:03:57,750 --> 00:03:59,040 on equal footing, 89 00:03:59,040 --> 00:04:01,920 In some cases, we don't need to do that. 90 00:04:01,920 --> 00:04:03,120 If we know that one variable 91 00:04:03,120 --> 00:04:05,520 is inherently more important than another, 92 00:04:05,520 --> 00:04:08,490 then standardization shouldn't be used. 93 00:04:08,490 --> 00:04:11,700 Our price/size relationship could be one of those. 94 00:04:11,700 --> 00:04:13,410 Most people are affected by the price 95 00:04:13,410 --> 00:04:15,960 much more than the size, aren't they? 96 00:04:15,960 --> 00:04:17,339 If you can't afford the price 97 00:04:17,339 --> 00:04:20,047 you won't care about the size, right? 98 00:04:20,047 --> 00:04:21,060 "How can you know that 99 00:04:21,060 --> 00:04:23,340 prior to clustering," you may ask? 100 00:04:23,340 --> 00:04:25,410 Experience plays a big role, 101 00:04:25,410 --> 00:04:27,303 so practice when the time comes. 102 00:04:28,170 --> 00:04:31,500 We will discuss this a bit more in the next lecture. 103 00:04:31,500 --> 00:04:32,500 Thanks for watching. 7735