All language subtitles for 8. Forecasting Metrics

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic Download
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:11,060 --> 00:00:18,140 So in this lecture, we are going to discuss error metrics commonly used in Time series analysis now 2 00:00:18,140 --> 00:00:21,140 because Time series forecasting is essentially regression. 3 00:00:21,470 --> 00:00:25,880 You'll find that if you've ever studied regression, these error metrics are the same as what you've 4 00:00:25,880 --> 00:00:26,930 encountered before. 5 00:00:27,440 --> 00:00:33,260 So one metric that shows up very often in statistics, machine learning, deep learning, engineering 6 00:00:33,260 --> 00:00:35,840 and so forth is the sum of squared errors. 7 00:00:36,710 --> 00:00:43,210 Suppose that we have end predictions so we have AI going from one up to N for the sum of squared errors. 8 00:00:43,220 --> 00:00:49,310 We simply take the difference between each with Sabai and we had Hasbi Square that difference and add 9 00:00:49,310 --> 00:00:50,830 all the square differences together. 10 00:00:51,260 --> 00:00:51,990 Pretty simple. 11 00:00:52,910 --> 00:00:58,850 The reason why we want to square these differences is because sometimes the prediction may be less than 12 00:00:58,850 --> 00:01:02,780 the target, but other times the target may be less than the prediction. 13 00:01:03,410 --> 00:01:08,420 Since we don't want them to cancel, scoring them ensures that the error is always non-negative. 14 00:01:09,320 --> 00:01:14,840 One bonus to using the squared error is that it coincides with maximizing the Gaussian likelihood. 15 00:01:15,380 --> 00:01:20,590 That is, it's the correct error metric to minimize when your errors are normally distributed. 16 00:01:20,960 --> 00:01:25,670 Since this is a pretty common assumption, the squared error or the variance of it that we're about 17 00:01:25,670 --> 00:01:28,100 to discuss make a lot of sense to use. 18 00:01:32,730 --> 00:01:37,830 Now, one downside to the sum of squared errors is that it depends on the number of data points you 19 00:01:37,830 --> 00:01:41,640 have supposed that you have an equals one hundred predictions. 20 00:01:42,000 --> 00:01:47,160 You can imagine that if you have an equals one thousand predictions, this new era will be a lot bigger, 21 00:01:47,310 --> 00:01:51,170 simply due to the fact that you had to make ten times more predictions. 22 00:01:51,630 --> 00:01:57,120 So it's not easy to compare, say, two different data sets with a different number of samples using 23 00:01:57,120 --> 00:01:58,470 the sum of squared errors. 24 00:01:59,070 --> 00:02:03,270 However, there is an easy fix for this, which is to use the mean squared error. 25 00:02:03,960 --> 00:02:04,920 It's very simple. 26 00:02:05,160 --> 00:02:08,940 Just divide the sum of squared errors by the number of samples in. 27 00:02:09,540 --> 00:02:14,130 By doing this, you make the error metric invariant to the number of samples. 28 00:02:14,910 --> 00:02:20,100 One advantage of this is that it serves to represent the sample mean of the squared errors. 29 00:02:20,550 --> 00:02:24,750 That is, it's an estimate of the expected value of the square error. 30 00:02:25,380 --> 00:02:29,580 For many algorithms, their objective is to minimize this expected value. 31 00:02:34,270 --> 00:02:36,890 So we can build on the squared error a little more. 32 00:02:37,360 --> 00:02:42,610 One downside to both the sum of squared errors and the means square is that they don't have intuitive 33 00:02:42,610 --> 00:02:43,300 units. 34 00:02:44,200 --> 00:02:50,020 Imagine you're forecasting temperature in Calvin's, but using the squared error, we'll give you Kelvin 35 00:02:50,200 --> 00:02:50,850 squared. 36 00:02:51,280 --> 00:02:56,620 I don't know about you, but I have no intuition about the meaning of a squared Kelvin or a squared 37 00:02:56,620 --> 00:02:57,490 temperature unit. 38 00:02:58,210 --> 00:03:04,420 So one way to express this error metric in units that make sense is to take the square root of the mean 39 00:03:04,420 --> 00:03:05,070 squared error. 40 00:03:05,650 --> 00:03:10,110 We call this the root mean squared error or messy for obvious reasons. 41 00:03:10,810 --> 00:03:15,700 The advantage of this error metric is that it's on the same scale as the original data. 42 00:03:16,390 --> 00:03:22,750 So if you're predicting the price of a house, it makes more sense to say My arm AC is one hundred dollars 43 00:03:23,050 --> 00:03:26,510 instead of my mzee is 10000 square dollars. 44 00:03:27,820 --> 00:03:32,950 Note that this can still be a bit unintuitive when you're comparing numbers to see this. 45 00:03:32,950 --> 00:03:36,960 Consider what happens when you take the square root of a number bigger than one. 46 00:03:37,510 --> 00:03:39,720 In this case, the value will get smaller. 47 00:03:40,270 --> 00:03:44,610 But if we take the square root of a number less than one, the value actually gets bigger. 48 00:03:44,980 --> 00:03:47,340 So it's kind of a strange function in that sense. 49 00:03:51,940 --> 00:03:57,730 Now, you might wonder, why should we work with Squarer is at all if we want positive values, why 50 00:03:57,730 --> 00:03:59,570 not simply take the absolute value? 51 00:04:00,100 --> 00:04:06,460 In fact, this is entirely possible if we take the average absolute difference between our targets and 52 00:04:06,460 --> 00:04:09,200 our predictions, we get the mean absolute error. 53 00:04:09,610 --> 00:04:11,430 So this should be pretty intuitive. 54 00:04:11,770 --> 00:04:14,740 You can see right away some advantages of this error metric. 55 00:04:15,190 --> 00:04:20,090 Clearly, it's immediately on the same scale as our data, so there's no need to take a square root. 56 00:04:21,130 --> 00:04:27,670 It also happens to have a probabilistic interpretation specifically, whereas the squarer coincides 57 00:04:27,670 --> 00:04:34,150 with optimizing a Gaussian likelihood, the absolute error coincides with optimizing a distributed likelihood. 58 00:04:34,840 --> 00:04:37,390 Now the details are outside the scope of this course. 59 00:04:37,720 --> 00:04:43,420 But essentially, if you optimize this lost function, your model will be less influenced by outliers, 60 00:04:43,600 --> 00:04:46,750 which could be a good thing in practice. 61 00:04:46,750 --> 00:04:51,640 Something I find quite interesting is that people will train their model by using the squared error, 62 00:04:51,820 --> 00:04:54,420 but then report the absolute error as a metric. 63 00:04:54,910 --> 00:04:59,650 Some of the libraries we will use in this course won't give you a choice, but it's my opinion that 64 00:04:59,830 --> 00:05:04,870 if you're going to pick some error to minimize, then it makes more sense to also report that error 65 00:05:04,870 --> 00:05:05,440 metric. 66 00:05:10,160 --> 00:05:12,590 So let's see how we can take things a little bit further. 67 00:05:13,580 --> 00:05:19,190 One downside to both the mean square error and the mean absolute error is that they depend on the scale 68 00:05:19,190 --> 00:05:19,850 of the data. 69 00:05:20,630 --> 00:05:25,970 For example, if you're trying to predict house prices, house prices are on the scale of hundreds of 70 00:05:25,970 --> 00:05:30,660 thousands to millions of dollars, your error will be proportionally large. 71 00:05:31,250 --> 00:05:34,430 On the other hand, if you're trying to predict daily stock returns. 72 00:05:34,670 --> 00:05:38,440 These are very minuscule on the order of fractions of a percent. 73 00:05:38,990 --> 00:05:42,500 So it's not straightforward to compare which of these tasks is easier. 74 00:05:42,980 --> 00:05:48,170 Your error for stock returns might be a percent of a percent, but your error for house prices might 75 00:05:48,170 --> 00:05:49,660 be in the thousands of dollars. 76 00:05:50,150 --> 00:05:55,040 But this doesn't imply that predicting house prices is harder than predicting stock returns. 77 00:05:55,640 --> 00:06:00,410 Note that this is unlike tasks like classification, where you're either correct or incorrect. 78 00:06:00,710 --> 00:06:04,720 If you're correct 80 percent of the time, then your accuracy is 80 percent. 79 00:06:05,300 --> 00:06:07,300 And this is the case no matter the data set. 80 00:06:08,180 --> 00:06:12,380 So it seems like it would be pretty useful to have a scale invariant metric. 81 00:06:17,070 --> 00:06:20,530 One common metric that is scale and invariant is the R squared. 82 00:06:21,120 --> 00:06:25,950 Note that the R squared is not like an error in that we want this to be bigger, not smaller. 83 00:06:26,640 --> 00:06:32,370 One simple way to express the R squared is by taking the ratio between the sum of squared errors called 84 00:06:32,370 --> 00:06:38,340 the SC and the total sum of squares called the S.T. and then we subtract that from one. 85 00:06:39,300 --> 00:06:45,690 So to explain this further, the S is essentially what we would get if our prediction was the mean and 86 00:06:45,690 --> 00:06:48,270 we took the sum of squared errors of that prediction. 87 00:06:49,140 --> 00:06:54,960 Another way to think of this is that if we divide both the top and bottom by N, we get the mean squared 88 00:06:54,960 --> 00:06:57,930 error divided by the sample variance of the targets. 89 00:06:58,980 --> 00:07:02,070 So this is one way to think of how good your model is. 90 00:07:02,430 --> 00:07:07,770 If your model has perfect predictions, then your MSE will be zero and your R-squared will be one. 91 00:07:09,180 --> 00:07:14,400 If your model is terrible and you can only predict the average of the targets, then your mzee will 92 00:07:14,400 --> 00:07:18,450 be equal to the sample variance and you'll have one minus one, which is zero. 93 00:07:19,620 --> 00:07:26,370 So just to drill that in and I'm scared of one is a model with perfect predictions and I sort of zero 94 00:07:26,370 --> 00:07:29,880 is a model that does no better than simply predicting the mean. 95 00:07:31,470 --> 00:07:36,480 Clearly, this is invariant to the scale of the data, which was our original motivation. 96 00:07:41,090 --> 00:07:44,960 One thing you should note is that it's possible for the R-squared to be negative. 97 00:07:45,440 --> 00:07:50,610 Imagine, for example, your predictions are worse than simply predicting the mean of the targets. 98 00:07:51,170 --> 00:07:56,290 In this case, the difference between why and why hat will be bigger than that of why and why bother. 99 00:07:56,960 --> 00:08:02,030 And so the numerator will be bigger than the denominator and the whole thing will be bigger than one. 100 00:08:02,780 --> 00:08:07,040 One minus a number bigger than one will be negative, giving you a negative R squared. 101 00:08:08,240 --> 00:08:11,680 In fact, the R squared is unbounded in the negative direction. 102 00:08:12,170 --> 00:08:16,640 So this is unlike classification accuracy, which must be between zero and one. 103 00:08:21,350 --> 00:08:26,420 It's worth noting that with Saikat Learn, which is probably the most popular machine learning library, 104 00:08:26,750 --> 00:08:30,350 the score function computes the R-squared by default for regression. 105 00:08:31,040 --> 00:08:33,990 For classification, you get the classification accuracy. 106 00:08:34,400 --> 00:08:38,060 So just something to keep in mind for later when we use IQ, learn. 107 00:08:42,860 --> 00:08:48,470 OK, so we're still not quite done since the Field of Time series analysis for some reason likes to 108 00:08:48,470 --> 00:08:49,790 have lots of metrics. 109 00:08:50,630 --> 00:08:53,570 So we're on the topic of scale invariant metrics. 110 00:08:54,020 --> 00:09:00,740 One obvious way to think of how accurate your model is, is with a percentage if a house is one million 111 00:09:00,740 --> 00:09:01,250 dollars. 112 00:09:01,250 --> 00:09:03,350 But my prediction is one million. 113 00:09:03,350 --> 00:09:04,430 One thousand dollars. 114 00:09:04,580 --> 00:09:08,420 I don't mind because that's only a zero point one percent difference. 115 00:09:09,440 --> 00:09:14,030 On the other hand, if I'm predicting the price of something that costs one thousand dollars and I'm 116 00:09:14,030 --> 00:09:18,280 off by one thousand dollars, that's a huge error because I'm off by 100 percent. 117 00:09:19,460 --> 00:09:23,990 So the mean absolute percentage error or the map expresses this idea. 118 00:09:29,010 --> 00:09:35,130 Now, one downside to the map is that it's not symmetric as an example, if your target is 10 and your 119 00:09:35,130 --> 00:09:36,150 prediction is 11. 120 00:09:36,630 --> 00:09:40,780 This leads to a different value when your prediction is 10 and your target is 11. 121 00:09:41,400 --> 00:09:46,890 Of course, some smart person has thought of this already and come up with the symmetric map or SMAP. 122 00:09:47,320 --> 00:09:52,530 As you can see, this just takes the average of why and why hat in the denominator so that the result 123 00:09:52,530 --> 00:09:53,400 is symmetric. 124 00:09:54,000 --> 00:09:57,640 The reason I mention this one is that it shows up in a paper we're going to look at. 125 00:09:57,840 --> 00:09:59,400 So it's nice to cover now. 126 00:10:04,050 --> 00:10:08,680 So despite the map and the map being somewhat popular, there is one problem. 127 00:10:09,300 --> 00:10:11,770 What happens when the denominator is zero? 128 00:10:12,360 --> 00:10:15,040 The result is that the error explodes to infinity. 129 00:10:15,570 --> 00:10:20,910 Of course, this makes no sense, since the error should not explode to infinity simply because the 130 00:10:20,910 --> 00:10:22,680 data takes on certain values. 131 00:10:23,150 --> 00:10:28,320 It should ideally only explode to infinity if your target in your prediction are very far apart. 132 00:10:29,250 --> 00:10:32,700 Nonetheless, these are popular metrics, so they are worth knowing. 133 00:10:37,430 --> 00:10:42,110 So you must be wondering, what is the point of having so many metrics to choose from? 134 00:10:42,710 --> 00:10:48,110 Well, the goal is to give you exposure to this field so that when you're reading papers or communicating 135 00:10:48,110 --> 00:10:51,100 with other professionals, you share a common language. 136 00:10:51,590 --> 00:10:56,240 And again, a common theme of this course is that there are many options for you to try. 137 00:10:56,630 --> 00:11:01,910 This leads to a combinatorial explosion of options to choose from in this course. 138 00:11:01,910 --> 00:11:07,760 We will probably never use all of these metrics in the same example, if we tried every technique every 139 00:11:07,760 --> 00:11:09,770 time, this course would never end. 140 00:11:10,220 --> 00:11:15,200 So again, the purpose of this is to make you aware of these tools so that you can apply them in your 141 00:11:15,200 --> 00:11:17,400 work if you think that they would be useful. 15597

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.