subtitlecat.com

All language subtitles for 8. Forecasting Metrics

Afrikaans

Akan

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:11,060 --> 00:00:18,140 So in this lecture, we are going to discuss error metrics commonly used in Time series analysis now 2 00:00:18,140 --> 00:00:21,140 because Time series forecasting is essentially regression. 3 00:00:21,470 --> 00:00:25,880 You'll find that if you've ever studied regression, these error metrics are the same as what you've 4 00:00:25,880 --> 00:00:26,930 encountered before. 5 00:00:27,440 --> 00:00:33,260 So one metric that shows up very often in statistics, machine learning, deep learning, engineering 6 00:00:33,260 --> 00:00:35,840 and so forth is the sum of squared errors. 7 00:00:36,710 --> 00:00:43,210 Suppose that we have end predictions so we have AI going from one up to N for the sum of squared errors. 8 00:00:43,220 --> 00:00:49,310 We simply take the difference between each with Sabai and we had Hasbi Square that difference and add 9 00:00:49,310 --> 00:00:50,830 all the square differences together. 10 00:00:51,260 --> 00:00:51,990 Pretty simple. 11 00:00:52,910 --> 00:00:58,850 The reason why we want to square these differences is because sometimes the prediction may be less than 12 00:00:58,850 --> 00:01:02,780 the target, but other times the target may be less than the prediction. 13 00:01:03,410 --> 00:01:08,420 Since we don't want them to cancel, scoring them ensures that the error is always non-negative. 14 00:01:09,320 --> 00:01:14,840 One bonus to using the squared error is that it coincides with maximizing the Gaussian likelihood. 15 00:01:15,380 --> 00:01:20,590 That is, it's the correct error metric to minimize when your errors are normally distributed. 16 00:01:20,960 --> 00:01:25,670 Since this is a pretty common assumption, the squared error or the variance of it that we're about 17 00:01:25,670 --> 00:01:28,100 to discuss make a lot of sense to use. 18 00:01:32,730 --> 00:01:37,830 Now, one downside to the sum of squared errors is that it depends on the number of data points you 19 00:01:37,830 --> 00:01:41,640 have supposed that you have an equals one hundred predictions. 20 00:01:42,000 --> 00:01:47,160 You can imagine that if you have an equals one thousand predictions, this new era will be a lot bigger, 21 00:01:47,310 --> 00:01:51,170 simply due to the fact that you had to make ten times more predictions. 22 00:01:51,630 --> 00:01:57,120 So it's not easy to compare, say, two different data sets with a different number of samples using 23 00:01:57,120 --> 00:01:58,470 the sum of squared errors. 24 00:01:59,070 --> 00:02:03,270 However, there is an easy fix for this, which is to use the mean squared error. 25 00:02:03,960 --> 00:02:04,920 It's very simple. 26 00:02:05,160 --> 00:02:08,940 Just divide the sum of squared errors by the number of samples in. 27 00:02:09,540 --> 00:02:14,130 By doing this, you make the error metric invariant to the number of samples. 28 00:02:14,910 --> 00:02:20,100 One advantage of this is that it serves to represent the sample mean of the squared errors. 29 00:02:20,550 --> 00:02:24,750 That is, it's an estimate of the expected value of the square error. 30 00:02:25,380 --> 00:02:29,580 For many algorithms, their objective is to minimize this expected value. 31 00:02:34,270 --> 00:02:36,890 So we can build on the squared error a little more. 32 00:02:37,360 --> 00:02:42,610 One downside to both the sum of squared errors and the means square is that they don't have intuitive 33 00:02:42,610 --> 00:02:43,300 units. 34 00:02:44,200 --> 00:02:50,020 Imagine you're forecasting temperature in Calvin's, but using the squared error, we'll give you Kelvin 35 00:02:50,200 --> 00:02:50,850 squared. 36 00:02:51,280 --> 00:02:56,620 I don't know about you, but I have no intuition about the meaning of a squared Kelvin or a squared 37 00:02:56,620 --> 00:02:57,490 temperature unit. 38 00:02:58,210 --> 00:03:04,420 So one way to express this error metric in units that make sense is to take the square root of the mean 39 00:03:04,420 --> 00:03:05,070 squared error. 40 00:03:05,650 --> 00:03:10,110 We call this the root mean squared error or messy for obvious reasons. 41 00:03:10,810 --> 00:03:15,700 The advantage of this error metric is that it's on the same scale as the original data. 42 00:03:16,390 --> 00:03:22,750 So if you're predicting the price of a house, it makes more sense to say My arm AC is one hundred dollars 43 00:03:23,050 --> 00:03:26,510 instead of my mzee is 10000 square dollars. 44 00:03:27,820 --> 00:03:32,950 Note that this can still be a bit unintuitive when you're comparing numbers to see this. 45 00:03:32,950 --> 00:03:36,960 Consider what happens when you take the square root of a number bigger than one. 46 00:03:37,510 --> 00:03:39,720 In this case, the value will get smaller. 47 00:03:40,270 --> 00:03:44,610 But if we take the square root of a number less than one, the value actually gets bigger. 48 00:03:44,980 --> 00:03:47,340 So it's kind of a strange function in that sense. 49 00:03:51,940 --> 00:03:57,730 Now, you might wonder, why should we work with Squarer is at all if we want positive values, why 50 00:03:57,730 --> 00:03:59,570 not simply take the absolute value? 51 00:04:00,100 --> 00:04:06,460 In fact, this is entirely possible if we take the average absolute difference between our targets and 52 00:04:06,460 --> 00:04:09,200 our predictions, we get the mean absolute error. 53 00:04:09,610 --> 00:04:11,430 So this should be pretty intuitive. 54 00:04:11,770 --> 00:04:14,740 You can see right away some advantages of this error metric. 55 00:04:15,190 --> 00:04:20,090 Clearly, it's immediately on the same scale as our data, so there's no need to take a square root. 56 00:04:21,130 --> 00:04:27,670 It also happens to have a probabilistic interpretation specifically, whereas the squarer coincides 57 00:04:27,670 --> 00:04:34,150 with optimizing a Gaussian likelihood, the absolute error coincides with optimizing a distributed likelihood. 58 00:04:34,840 --> 00:04:37,390 Now the details are outside the scope of this course. 59 00:04:37,720 --> 00:04:43,420 But essentially, if you optimize this lost function, your model will be less influenced by outliers, 60 00:04:43,600 --> 00:04:46,750 which could be a good thing in practice. 61 00:04:46,750 --> 00:04:51,640 Something I find quite interesting is that people will train their model by using the squared error, 62 00:04:51,820 --> 00:04:54,420 but then report the absolute error as a metric. 63 00:04:54,910 --> 00:04:59,650 Some of the libraries we will use in this course won't give you a choice, but it's my opinion that 64 00:04:59,830 --> 00:05:04,870 if you're going to pick some error to minimize, then it makes more sense to also report that error 65 00:05:04,870 --> 00:05:05,440 metric. 66 00:05:10,160 --> 00:05:12,590 So let's see how we can take things a little bit further. 67 00:05:13,580 --> 00:05:19,190 One downside to both the mean square error and the mean absolute error is that they depend on the scale 68 00:05:19,190 --> 00:05:19,850 of the data. 69 00:05:20,630 --> 00:05:25,970 For example, if you're trying to predict house prices, house prices are on the scale of hundreds of 70 00:05:25,970 --> 00:05:30,660 thousands to millions of dollars, your error will be proportionally large. 71 00:05:31,250 --> 00:05:34,430 On the other hand, if you're trying to predict daily stock returns. 72 00:05:34,670 --> 00:05:38,440 These are very minuscule on the order of fractions of a percent. 73 00:05:38,990 --> 00:05:42,500 So it's not straightforward to compare which of these tasks is easier. 74 00:05:42,980 --> 00:05:48,170 Your error for stock returns might be a percent of a percent, but your error for house prices might 75 00:05:48,170 --> 00:05:49,660 be in the thousands of dollars. 76 00:05:50,150 --> 00:05:55,040 But this doesn't imply that predicting house prices is harder than predicting stock returns. 77 00:05:55,640 --> 00:06:00,410 Note that this is unlike tasks like classification, where you're either correct or incorrect. 78 00:06:00,710 --> 00:06:04,720 If you're correct 80 percent of the time, then your accuracy is 80 percent. 79 00:06:05,300 --> 00:06:07,300 And this is the case no matter the data set. 80 00:06:08,180 --> 00:06:12,380 So it seems like it would be pretty useful to have a scale invariant metric. 81 00:06:17,070 --> 00:06:20,530 One common metric that is scale and invariant is the R squared. 82 00:06:21,120 --> 00:06:25,950 Note that the R squared is not like an error in that we want this to be bigger, not smaller. 83 00:06:26,640 --> 00:06:32,370 One simple way to express the R squared is by taking the ratio between the sum of squared errors called 84 00:06:32,370 --> 00:06:38,340 the SC and the total sum of squares called the S.T. and then we subtract that from one. 85 00:06:39,300 --> 00:06:45,690 So to explain this further, the S is essentially what we would get if our prediction was the mean and 86 00:06:45,690 --> 00:06:48,270 we took the sum of squared errors of that prediction. 87 00:06:49,140 --> 00:06:54,960 Another way to think of this is that if we divide both the top and bottom by N, we get the mean squared 88 00:06:54,960 --> 00:06:57,930 error divided by the sample variance of the targets. 89 00:06:58,980 --> 00:07:02,070 So this is one way to think of how good your model is. 90 00:07:02,430 --> 00:07:07,770 If your model has perfect predictions, then your MSE will be zero and your R-squared will be one. 91 00:07:09,180 --> 00:07:14,400 If your model is terrible and you can only predict the average of the targets, then your mzee will 92 00:07:14,400 --> 00:07:18,450 be equal to the sample variance and you'll have one minus one, which is zero. 93 00:07:19,620 --> 00:07:26,370 So just to drill that in and I'm scared of one is a model with perfect predictions and I sort of zero 94 00:07:26,370 --> 00:07:29,880 is a model that does no better than simply predicting the mean. 95 00:07:31,470 --> 00:07:36,480 Clearly, this is invariant to the scale of the data, which was our original motivation. 96 00:07:41,090 --> 00:07:44,960 One thing you should note is that it's possible for the R-squared to be negative. 97 00:07:45,440 --> 00:07:50,610 Imagine, for example, your predictions are worse than simply predicting the mean of the targets. 98 00:07:51,170 --> 00:07:56,290 In this case, the difference between why and why hat will be bigger than that of why and why bother. 99 00:07:56,960 --> 00:08:02,030 And so the numerator will be bigger than the denominator and the whole thing will be bigger than one. 100 00:08:02,780 --> 00:08:07,040 One minus a number bigger than one will be negative, giving you a negative R squared. 101 00:08:08,240 --> 00:08:11,680 In fact, the R squared is unbounded in the negative direction. 102 00:08:12,170 --> 00:08:16,640 So this is unlike classification accuracy, which must be between zero and one. 103 00:08:21,350 --> 00:08:26,420 It's worth noting that with Saikat Learn, which is probably the most popular machine learning library, 104 00:08:26,750 --> 00:08:30,350 the score function computes the R-squared by default for regression. 105 00:08:31,040 --> 00:08:33,990 For classification, you get the classification accuracy. 106 00:08:34,400 --> 00:08:38,060 So just something to keep in mind for later when we use IQ, learn. 107 00:08:42,860 --> 00:08:48,470 OK, so we're still not quite done since the Field of Time series analysis for some reason likes to 108 00:08:48,470 --> 00:08:49,790 have lots of metrics. 109 00:08:50,630 --> 00:08:53,570 So we're on the topic of scale invariant metrics. 110 00:08:54,020 --> 00:09:00,740 One obvious way to think of how accurate your model is, is with a percentage if a house is one million 111 00:09:00,740 --> 00:09:01,250 dollars. 112 00:09:01,250 --> 00:09:03,350 But my prediction is one million. 113 00:09:03,350 --> 00:09:04,430 One thousand dollars. 114 00:09:04,580 --> 00:09:08,420 I don't mind because that's only a zero point one percent difference. 115 00:09:09,440 --> 00:09:14,030 On the other hand, if I'm predicting the price of something that costs one thousand dollars and I'm 116 00:09:14,030 --> 00:09:18,280 off by one thousand dollars, that's a huge error because I'm off by 100 percent. 117 00:09:19,460 --> 00:09:23,990 So the mean absolute percentage error or the map expresses this idea. 118 00:09:29,010 --> 00:09:35,130 Now, one downside to the map is that it's not symmetric as an example, if your target is 10 and your 119 00:09:35,130 --> 00:09:36,150 prediction is 11. 120 00:09:36,630 --> 00:09:40,780 This leads to a different value when your prediction is 10 and your target is 11. 121 00:09:41,400 --> 00:09:46,890 Of course, some smart person has thought of this already and come up with the symmetric map or SMAP. 122 00:09:47,320 --> 00:09:52,530 As you can see, this just takes the average of why and why hat in the denominator so that the result 123 00:09:52,530 --> 00:09:53,400 is symmetric. 124 00:09:54,000 --> 00:09:57,640 The reason I mention this one is that it shows up in a paper we're going to look at. 125 00:09:57,840 --> 00:09:59,400 So it's nice to cover now. 126 00:10:04,050 --> 00:10:08,680 So despite the map and the map being somewhat popular, there is one problem. 127 00:10:09,300 --> 00:10:11,770 What happens when the denominator is zero? 128 00:10:12,360 --> 00:10:15,040 The result is that the error explodes to infinity. 129 00:10:15,570 --> 00:10:20,910 Of course, this makes no sense, since the error should not explode to infinity simply because the 130 00:10:20,910 --> 00:10:22,680 data takes on certain values. 131 00:10:23,150 --> 00:10:28,320 It should ideally only explode to infinity if your target in your prediction are very far apart. 132 00:10:29,250 --> 00:10:32,700 Nonetheless, these are popular metrics, so they are worth knowing. 133 00:10:37,430 --> 00:10:42,110 So you must be wondering, what is the point of having so many metrics to choose from? 134 00:10:42,710 --> 00:10:48,110 Well, the goal is to give you exposure to this field so that when you're reading papers or communicating 135 00:10:48,110 --> 00:10:51,100 with other professionals, you share a common language. 136 00:10:51,590 --> 00:10:56,240 And again, a common theme of this course is that there are many options for you to try. 137 00:10:56,630 --> 00:11:01,910 This leads to a combinatorial explosion of options to choose from in this course. 138 00:11:01,910 --> 00:11:07,760 We will probably never use all of these metrics in the same example, if we tried every technique every 139 00:11:07,760 --> 00:11:09,770 time, this course would never end. 140 00:11:10,220 --> 00:11:15,200 So again, the purpose of this is to make you aware of these tools so that you can apply them in your 141 00:11:15,200 --> 00:11:17,400 work if you think that they would be useful. 15597