Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:11,060 --> 00:00:18,140
So in this lecture, we are going to discuss error metrics commonly used in Time series analysis now
2
00:00:18,140 --> 00:00:21,140
because Time series forecasting is essentially regression.
3
00:00:21,470 --> 00:00:25,880
You'll find that if you've ever studied regression, these error metrics are the same as what you've
4
00:00:25,880 --> 00:00:26,930
encountered before.
5
00:00:27,440 --> 00:00:33,260
So one metric that shows up very often in statistics, machine learning, deep learning, engineering
6
00:00:33,260 --> 00:00:35,840
and so forth is the sum of squared errors.
7
00:00:36,710 --> 00:00:43,210
Suppose that we have end predictions so we have AI going from one up to N for the sum of squared errors.
8
00:00:43,220 --> 00:00:49,310
We simply take the difference between each with Sabai and we had Hasbi Square that difference and add
9
00:00:49,310 --> 00:00:50,830
all the square differences together.
10
00:00:51,260 --> 00:00:51,990
Pretty simple.
11
00:00:52,910 --> 00:00:58,850
The reason why we want to square these differences is because sometimes the prediction may be less than
12
00:00:58,850 --> 00:01:02,780
the target, but other times the target may be less than the prediction.
13
00:01:03,410 --> 00:01:08,420
Since we don't want them to cancel, scoring them ensures that the error is always non-negative.
14
00:01:09,320 --> 00:01:14,840
One bonus to using the squared error is that it coincides with maximizing the Gaussian likelihood.
15
00:01:15,380 --> 00:01:20,590
That is, it's the correct error metric to minimize when your errors are normally distributed.
16
00:01:20,960 --> 00:01:25,670
Since this is a pretty common assumption, the squared error or the variance of it that we're about
17
00:01:25,670 --> 00:01:28,100
to discuss make a lot of sense to use.
18
00:01:32,730 --> 00:01:37,830
Now, one downside to the sum of squared errors is that it depends on the number of data points you
19
00:01:37,830 --> 00:01:41,640
have supposed that you have an equals one hundred predictions.
20
00:01:42,000 --> 00:01:47,160
You can imagine that if you have an equals one thousand predictions, this new era will be a lot bigger,
21
00:01:47,310 --> 00:01:51,170
simply due to the fact that you had to make ten times more predictions.
22
00:01:51,630 --> 00:01:57,120
So it's not easy to compare, say, two different data sets with a different number of samples using
23
00:01:57,120 --> 00:01:58,470
the sum of squared errors.
24
00:01:59,070 --> 00:02:03,270
However, there is an easy fix for this, which is to use the mean squared error.
25
00:02:03,960 --> 00:02:04,920
It's very simple.
26
00:02:05,160 --> 00:02:08,940
Just divide the sum of squared errors by the number of samples in.
27
00:02:09,540 --> 00:02:14,130
By doing this, you make the error metric invariant to the number of samples.
28
00:02:14,910 --> 00:02:20,100
One advantage of this is that it serves to represent the sample mean of the squared errors.
29
00:02:20,550 --> 00:02:24,750
That is, it's an estimate of the expected value of the square error.
30
00:02:25,380 --> 00:02:29,580
For many algorithms, their objective is to minimize this expected value.
31
00:02:34,270 --> 00:02:36,890
So we can build on the squared error a little more.
32
00:02:37,360 --> 00:02:42,610
One downside to both the sum of squared errors and the means square is that they don't have intuitive
33
00:02:42,610 --> 00:02:43,300
units.
34
00:02:44,200 --> 00:02:50,020
Imagine you're forecasting temperature in Calvin's, but using the squared error, we'll give you Kelvin
35
00:02:50,200 --> 00:02:50,850
squared.
36
00:02:51,280 --> 00:02:56,620
I don't know about you, but I have no intuition about the meaning of a squared Kelvin or a squared
37
00:02:56,620 --> 00:02:57,490
temperature unit.
38
00:02:58,210 --> 00:03:04,420
So one way to express this error metric in units that make sense is to take the square root of the mean
39
00:03:04,420 --> 00:03:05,070
squared error.
40
00:03:05,650 --> 00:03:10,110
We call this the root mean squared error or messy for obvious reasons.
41
00:03:10,810 --> 00:03:15,700
The advantage of this error metric is that it's on the same scale as the original data.
42
00:03:16,390 --> 00:03:22,750
So if you're predicting the price of a house, it makes more sense to say My arm AC is one hundred dollars
43
00:03:23,050 --> 00:03:26,510
instead of my mzee is 10000 square dollars.
44
00:03:27,820 --> 00:03:32,950
Note that this can still be a bit unintuitive when you're comparing numbers to see this.
45
00:03:32,950 --> 00:03:36,960
Consider what happens when you take the square root of a number bigger than one.
46
00:03:37,510 --> 00:03:39,720
In this case, the value will get smaller.
47
00:03:40,270 --> 00:03:44,610
But if we take the square root of a number less than one, the value actually gets bigger.
48
00:03:44,980 --> 00:03:47,340
So it's kind of a strange function in that sense.
49
00:03:51,940 --> 00:03:57,730
Now, you might wonder, why should we work with Squarer is at all if we want positive values, why
50
00:03:57,730 --> 00:03:59,570
not simply take the absolute value?
51
00:04:00,100 --> 00:04:06,460
In fact, this is entirely possible if we take the average absolute difference between our targets and
52
00:04:06,460 --> 00:04:09,200
our predictions, we get the mean absolute error.
53
00:04:09,610 --> 00:04:11,430
So this should be pretty intuitive.
54
00:04:11,770 --> 00:04:14,740
You can see right away some advantages of this error metric.
55
00:04:15,190 --> 00:04:20,090
Clearly, it's immediately on the same scale as our data, so there's no need to take a square root.
56
00:04:21,130 --> 00:04:27,670
It also happens to have a probabilistic interpretation specifically, whereas the squarer coincides
57
00:04:27,670 --> 00:04:34,150
with optimizing a Gaussian likelihood, the absolute error coincides with optimizing a distributed likelihood.
58
00:04:34,840 --> 00:04:37,390
Now the details are outside the scope of this course.
59
00:04:37,720 --> 00:04:43,420
But essentially, if you optimize this lost function, your model will be less influenced by outliers,
60
00:04:43,600 --> 00:04:46,750
which could be a good thing in practice.
61
00:04:46,750 --> 00:04:51,640
Something I find quite interesting is that people will train their model by using the squared error,
62
00:04:51,820 --> 00:04:54,420
but then report the absolute error as a metric.
63
00:04:54,910 --> 00:04:59,650
Some of the libraries we will use in this course won't give you a choice, but it's my opinion that
64
00:04:59,830 --> 00:05:04,870
if you're going to pick some error to minimize, then it makes more sense to also report that error
65
00:05:04,870 --> 00:05:05,440
metric.
66
00:05:10,160 --> 00:05:12,590
So let's see how we can take things a little bit further.
67
00:05:13,580 --> 00:05:19,190
One downside to both the mean square error and the mean absolute error is that they depend on the scale
68
00:05:19,190 --> 00:05:19,850
of the data.
69
00:05:20,630 --> 00:05:25,970
For example, if you're trying to predict house prices, house prices are on the scale of hundreds of
70
00:05:25,970 --> 00:05:30,660
thousands to millions of dollars, your error will be proportionally large.
71
00:05:31,250 --> 00:05:34,430
On the other hand, if you're trying to predict daily stock returns.
72
00:05:34,670 --> 00:05:38,440
These are very minuscule on the order of fractions of a percent.
73
00:05:38,990 --> 00:05:42,500
So it's not straightforward to compare which of these tasks is easier.
74
00:05:42,980 --> 00:05:48,170
Your error for stock returns might be a percent of a percent, but your error for house prices might
75
00:05:48,170 --> 00:05:49,660
be in the thousands of dollars.
76
00:05:50,150 --> 00:05:55,040
But this doesn't imply that predicting house prices is harder than predicting stock returns.
77
00:05:55,640 --> 00:06:00,410
Note that this is unlike tasks like classification, where you're either correct or incorrect.
78
00:06:00,710 --> 00:06:04,720
If you're correct 80 percent of the time, then your accuracy is 80 percent.
79
00:06:05,300 --> 00:06:07,300
And this is the case no matter the data set.
80
00:06:08,180 --> 00:06:12,380
So it seems like it would be pretty useful to have a scale invariant metric.
81
00:06:17,070 --> 00:06:20,530
One common metric that is scale and invariant is the R squared.
82
00:06:21,120 --> 00:06:25,950
Note that the R squared is not like an error in that we want this to be bigger, not smaller.
83
00:06:26,640 --> 00:06:32,370
One simple way to express the R squared is by taking the ratio between the sum of squared errors called
84
00:06:32,370 --> 00:06:38,340
the SC and the total sum of squares called the S.T. and then we subtract that from one.
85
00:06:39,300 --> 00:06:45,690
So to explain this further, the S is essentially what we would get if our prediction was the mean and
86
00:06:45,690 --> 00:06:48,270
we took the sum of squared errors of that prediction.
87
00:06:49,140 --> 00:06:54,960
Another way to think of this is that if we divide both the top and bottom by N, we get the mean squared
88
00:06:54,960 --> 00:06:57,930
error divided by the sample variance of the targets.
89
00:06:58,980 --> 00:07:02,070
So this is one way to think of how good your model is.
90
00:07:02,430 --> 00:07:07,770
If your model has perfect predictions, then your MSE will be zero and your R-squared will be one.
91
00:07:09,180 --> 00:07:14,400
If your model is terrible and you can only predict the average of the targets, then your mzee will
92
00:07:14,400 --> 00:07:18,450
be equal to the sample variance and you'll have one minus one, which is zero.
93
00:07:19,620 --> 00:07:26,370
So just to drill that in and I'm scared of one is a model with perfect predictions and I sort of zero
94
00:07:26,370 --> 00:07:29,880
is a model that does no better than simply predicting the mean.
95
00:07:31,470 --> 00:07:36,480
Clearly, this is invariant to the scale of the data, which was our original motivation.
96
00:07:41,090 --> 00:07:44,960
One thing you should note is that it's possible for the R-squared to be negative.
97
00:07:45,440 --> 00:07:50,610
Imagine, for example, your predictions are worse than simply predicting the mean of the targets.
98
00:07:51,170 --> 00:07:56,290
In this case, the difference between why and why hat will be bigger than that of why and why bother.
99
00:07:56,960 --> 00:08:02,030
And so the numerator will be bigger than the denominator and the whole thing will be bigger than one.
100
00:08:02,780 --> 00:08:07,040
One minus a number bigger than one will be negative, giving you a negative R squared.
101
00:08:08,240 --> 00:08:11,680
In fact, the R squared is unbounded in the negative direction.
102
00:08:12,170 --> 00:08:16,640
So this is unlike classification accuracy, which must be between zero and one.
103
00:08:21,350 --> 00:08:26,420
It's worth noting that with Saikat Learn, which is probably the most popular machine learning library,
104
00:08:26,750 --> 00:08:30,350
the score function computes the R-squared by default for regression.
105
00:08:31,040 --> 00:08:33,990
For classification, you get the classification accuracy.
106
00:08:34,400 --> 00:08:38,060
So just something to keep in mind for later when we use IQ, learn.
107
00:08:42,860 --> 00:08:48,470
OK, so we're still not quite done since the Field of Time series analysis for some reason likes to
108
00:08:48,470 --> 00:08:49,790
have lots of metrics.
109
00:08:50,630 --> 00:08:53,570
So we're on the topic of scale invariant metrics.
110
00:08:54,020 --> 00:09:00,740
One obvious way to think of how accurate your model is, is with a percentage if a house is one million
111
00:09:00,740 --> 00:09:01,250
dollars.
112
00:09:01,250 --> 00:09:03,350
But my prediction is one million.
113
00:09:03,350 --> 00:09:04,430
One thousand dollars.
114
00:09:04,580 --> 00:09:08,420
I don't mind because that's only a zero point one percent difference.
115
00:09:09,440 --> 00:09:14,030
On the other hand, if I'm predicting the price of something that costs one thousand dollars and I'm
116
00:09:14,030 --> 00:09:18,280
off by one thousand dollars, that's a huge error because I'm off by 100 percent.
117
00:09:19,460 --> 00:09:23,990
So the mean absolute percentage error or the map expresses this idea.
118
00:09:29,010 --> 00:09:35,130
Now, one downside to the map is that it's not symmetric as an example, if your target is 10 and your
119
00:09:35,130 --> 00:09:36,150
prediction is 11.
120
00:09:36,630 --> 00:09:40,780
This leads to a different value when your prediction is 10 and your target is 11.
121
00:09:41,400 --> 00:09:46,890
Of course, some smart person has thought of this already and come up with the symmetric map or SMAP.
122
00:09:47,320 --> 00:09:52,530
As you can see, this just takes the average of why and why hat in the denominator so that the result
123
00:09:52,530 --> 00:09:53,400
is symmetric.
124
00:09:54,000 --> 00:09:57,640
The reason I mention this one is that it shows up in a paper we're going to look at.
125
00:09:57,840 --> 00:09:59,400
So it's nice to cover now.
126
00:10:04,050 --> 00:10:08,680
So despite the map and the map being somewhat popular, there is one problem.
127
00:10:09,300 --> 00:10:11,770
What happens when the denominator is zero?
128
00:10:12,360 --> 00:10:15,040
The result is that the error explodes to infinity.
129
00:10:15,570 --> 00:10:20,910
Of course, this makes no sense, since the error should not explode to infinity simply because the
130
00:10:20,910 --> 00:10:22,680
data takes on certain values.
131
00:10:23,150 --> 00:10:28,320
It should ideally only explode to infinity if your target in your prediction are very far apart.
132
00:10:29,250 --> 00:10:32,700
Nonetheless, these are popular metrics, so they are worth knowing.
133
00:10:37,430 --> 00:10:42,110
So you must be wondering, what is the point of having so many metrics to choose from?
134
00:10:42,710 --> 00:10:48,110
Well, the goal is to give you exposure to this field so that when you're reading papers or communicating
135
00:10:48,110 --> 00:10:51,100
with other professionals, you share a common language.
136
00:10:51,590 --> 00:10:56,240
And again, a common theme of this course is that there are many options for you to try.
137
00:10:56,630 --> 00:11:01,910
This leads to a combinatorial explosion of options to choose from in this course.
138
00:11:01,910 --> 00:11:07,760
We will probably never use all of these metrics in the same example, if we tried every technique every
139
00:11:07,760 --> 00:11:09,770
time, this course would never end.
140
00:11:10,220 --> 00:11:15,200
So again, the purpose of this is to make you aware of these tools so that you can apply them in your
141
00:11:15,200 --> 00:11:17,400
work if you think that they would be useful.
15597
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.