Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:05,290 --> 00:00:06,520
Welcome back everyone.
2
00:00:06,580 --> 00:00:10,930
In this lecture we're going to be talking about cost functions which is going to allow us to measure
3
00:00:11,170 --> 00:00:15,090
how far off we are in the output predictions of our neural network.
4
00:00:15,190 --> 00:00:19,990
And then we'll talk about gradient descent which is going to help us minimize that cost or minimize
5
00:00:19,990 --> 00:00:20,740
that error.
6
00:00:20,890 --> 00:00:24,960
So often cost functions you also hear them label as lost functions or error functions.
7
00:00:25,060 --> 00:00:27,690
So we don't talk about all that really interesting topic here.
8
00:00:27,880 --> 00:00:31,630
And it really fundamentally is gonna help us understand better how the neural network will actually
9
00:00:31,630 --> 00:00:37,870
learn so we already understand that neural networks they taken inputs in that first layer then they
10
00:00:37,870 --> 00:00:43,150
multiply them by weights and then add biases to them and then maybe that gets passed on through an activation
11
00:00:43,150 --> 00:00:48,640
function like the sigmoid function or the rectified linear unit and then that ends up going into another
12
00:00:48,640 --> 00:00:53,020
layer and then you add another set of weights and biases and then so on and so on all the way to that
13
00:00:53,020 --> 00:00:59,870
last output layer so that last output layer we can maybe call that or refer to it as y hat.
14
00:00:59,960 --> 00:01:03,970
That's essentially the model's estimation of what it predicts the label to be.
15
00:01:04,040 --> 00:01:08,600
And so we have two main questions after the network creates that prediction.
16
00:01:08,630 --> 00:01:11,840
How do we actually evaluate it against the true label.
17
00:01:11,840 --> 00:01:16,410
And then after the evaluation how can we update the networks weights and biases.
18
00:01:16,430 --> 00:01:23,000
So really we're going to focus on this lecture is how do we evaluate how far off our prediction is for
19
00:01:23,390 --> 00:01:25,130
updating the networks weights and biases.
20
00:01:25,130 --> 00:01:28,830
We'll have to learn about back propagation which is coming up in a future lecture.
21
00:01:28,850 --> 00:01:34,880
So right now let's focus on that first question so what we need to do is we need to take the estimated
22
00:01:34,970 --> 00:01:40,760
outputs of the network and then compare them to the real values of the label and keep in mind what I'm
23
00:01:40,760 --> 00:01:46,490
referring to right now is happening during the training or fitting portion of the supervised learning
24
00:01:46,490 --> 00:01:47,530
process.
25
00:01:47,540 --> 00:01:53,150
So right now what we're doing is we're using just the training data set that way we can go back and
26
00:01:53,210 --> 00:01:55,970
update our weights and biases during the test set.
27
00:01:55,970 --> 00:01:57,700
We're not really updating weights and biases.
28
00:01:57,710 --> 00:02:01,910
Instead we're just again evaluating overall on the entire dataset.
29
00:02:01,940 --> 00:02:03,710
How does our neural network perform.
30
00:02:03,740 --> 00:02:08,840
Right now we're doing these small evaluations on these training batches in order to actually then go
31
00:02:08,840 --> 00:02:11,190
back and update weights and biases in our network.
32
00:02:11,210 --> 00:02:14,860
So keep that in mind OK.
33
00:02:14,990 --> 00:02:21,380
So in order to actually compare our neural networks output to the true value we're gonna be using what's
34
00:02:21,380 --> 00:02:23,150
known as a cost function.
35
00:02:23,150 --> 00:02:26,630
And this is also often referred to as a lost function or error function.
36
00:02:26,630 --> 00:02:31,400
Essentially it's just something that measures how far off you are from the true value based off your
37
00:02:31,400 --> 00:02:32,570
prediction.
38
00:02:32,600 --> 00:02:39,170
And one important caveat is it should be an average that we can output a single value then you can keep
39
00:02:39,170 --> 00:02:43,400
track of that loss or cost during training to monitor network performance.
40
00:02:43,550 --> 00:02:51,140
So hopefully during each epoch of training your loss or cost goes down down down until you kind of like
41
00:02:51,140 --> 00:03:00,000
converge to some minimum cost value so we're going to here is I want to introduce a couple of variables.
42
00:03:00,000 --> 00:03:05,940
We're gonna be using Y to represent the true value and then we'll be using a to represent the neurons
43
00:03:05,970 --> 00:03:06,600
prediction.
44
00:03:07,290 --> 00:03:14,160
So in terms of weights and biases what we have here is recall we set Z to be a representation of weights
45
00:03:14,160 --> 00:03:20,760
times x plus B and then we pass that Z into an activation function such as the sigmoid function so that
46
00:03:20,760 --> 00:03:22,110
Z passed into the sigmoid.
47
00:03:22,110 --> 00:03:23,530
Is that equal to a.
48
00:03:23,550 --> 00:03:29,400
So all I'm trying to say here is that a represents kind of that final output of a neuron which takes
49
00:03:29,400 --> 00:03:31,490
into account what the activation function was.
50
00:03:31,500 --> 00:03:37,350
And that also then takes into account Z which in turn takes to account w which are the weights and the
51
00:03:37,350 --> 00:03:38,050
biases.
52
00:03:38,070 --> 00:03:41,160
So a just keep that in mind holds a lot of information.
53
00:03:41,310 --> 00:03:47,690
It holds information about the activation function the weights and the biases so probably the most common
54
00:03:47,690 --> 00:03:51,190
cost function you'll see is known as the quadratic cost function.
55
00:03:51,290 --> 00:03:54,680
And if you've done any machine learning this probably looks really familiar to you because it looks
56
00:03:54,680 --> 00:03:58,430
like root mean square error which is kind of essentially what it is.
57
00:03:58,550 --> 00:04:06,080
Here is just not notated for multidimensional data so all we're doing here is that the other day we're
58
00:04:06,080 --> 00:04:10,610
calculating the difference between the real values here we label it y of x.
59
00:04:10,610 --> 00:04:16,340
So if you add some X input we specify Y is the true function so that would be the true value.
60
00:04:16,490 --> 00:04:18,760
And then we're subtracting our predicted values.
61
00:04:18,770 --> 00:04:21,170
So here we have a level of X.
62
00:04:21,350 --> 00:04:28,740
And keep in mind that a of L that notation just signifies that that is the activation function.
63
00:04:28,760 --> 00:04:35,450
Output of the L layer where L is your last layer which means the layer before that in the network is
64
00:04:35,480 --> 00:04:36,470
L minus 1.
65
00:04:36,620 --> 00:04:39,950
Then the layer before that is L minus 2 and so on.
66
00:04:39,950 --> 00:04:44,840
Later on we'll see why it's more convenient to kind of Mark L as your last layer and then work backwards
67
00:04:44,840 --> 00:04:46,820
from there instead of starting from the beginning.
68
00:04:46,940 --> 00:04:55,650
So again AFL of X that's essentially your predicted output so keep in mind the notation again shown
69
00:04:55,650 --> 00:04:58,070
here kind of corresponds to vector inputs and outputs.
70
00:04:58,080 --> 00:05:02,400
Since we're really dealing with a batch of training points and predictions as we go along but the main
71
00:05:02,400 --> 00:05:07,830
idea is you have C as the cost function you're doing some sort of averaging so that's we have one over
72
00:05:08,070 --> 00:05:11,730
two times and so n is the number of points there.
73
00:05:11,790 --> 00:05:17,740
Then you're taking the sum of all those differences and squaring them so really come question is why
74
00:05:17,730 --> 00:05:20,720
are we actually squaring this into this two useful things for us.
75
00:05:20,740 --> 00:05:25,980
1 It keeps everything positive because well you have a positive air or negative error.
76
00:05:26,020 --> 00:05:30,130
If you square it it becomes positive which is good because we want some sort of absolute measurement
77
00:05:30,250 --> 00:05:31,350
of error.
78
00:05:31,450 --> 00:05:35,980
If we weren't squaring this and we had stuff that was negative and positive when you average it out
79
00:05:36,070 --> 00:05:42,010
that could hover around zero which is actually not a true indication of the absolute value or absolute
80
00:05:42,010 --> 00:05:43,480
units of how far off you are.
81
00:05:43,870 --> 00:05:49,120
So squaring it make sure that everything's positive the other and much more important thing it does
82
00:05:49,270 --> 00:05:51,620
is it's gonna punish really large errors.
83
00:05:51,730 --> 00:05:57,250
So sometimes you have some data points that you're gonna be really off on and if you actually square
84
00:05:57,250 --> 00:06:02,610
that error then it exponentially grows as terms of your cost.
85
00:06:02,770 --> 00:06:08,550
So maybe you're off by ten dollars in whatever unit you're trying to measure but your cost is going
86
00:06:08,550 --> 00:06:10,420
to report that and unit squared.
87
00:06:10,440 --> 00:06:13,910
So it's going to say you're off by one hundred instead of just ten.
88
00:06:14,080 --> 00:06:19,570
You're going to really punish your network for being really off on certain points which is good because
89
00:06:19,570 --> 00:06:25,630
you don't want your network to suddenly be able to not predict well on even just a few points where
90
00:06:25,900 --> 00:06:27,370
it gives you a huge error.
91
00:06:27,370 --> 00:06:32,380
You'd rather suffer a little bit on all the other points and not be totally off on those few kind of
92
00:06:32,380 --> 00:06:33,320
edge cases.
93
00:06:33,340 --> 00:06:39,970
So that really helps punish large errors now in general we can think of the cost function as a function
94
00:06:40,060 --> 00:06:47,340
of four main things so the cost function is going to be a function of W which is our neural networks
95
00:06:47,340 --> 00:06:54,060
weights B which is all the biases in our neural network SFR which is the input of a single training
96
00:06:54,060 --> 00:06:58,230
sample and each of our which is the desired output of that training sample.
97
00:06:58,230 --> 00:07:03,300
And that makes a lot of sense because the cost is dependent on what the current weights and biases are.
98
00:07:03,300 --> 00:07:06,950
It's also dependent on what you passed in as the actual training example.
99
00:07:06,950 --> 00:07:12,850
And then it's also dependent on what you're comparing it to which is e of our so notice how that information
100
00:07:12,850 --> 00:07:16,500
was actually all encoded in that simplified notation that we had.
101
00:07:16,510 --> 00:07:21,790
So I showed you this as the cost function and you may be wondering didn't you just say that the cost
102
00:07:21,790 --> 00:07:26,890
function is a function of W and B was W and B in the quadratic function here.
103
00:07:27,190 --> 00:07:32,680
Well it's actually encoded within a of X because remember a of X holds information about weights and
104
00:07:32,680 --> 00:07:39,940
biases because of X is Z passed into the activation function where z then contains information about
105
00:07:39,940 --> 00:07:48,600
W and B Ok so this means that if we have a huge network we can expect the actual cost function to be
106
00:07:48,600 --> 00:07:55,010
really quite complex with a huge vector or tensor of weights and another huge tensor of biases.
107
00:07:56,550 --> 00:08:01,860
So for example if we were just to take a small network and start labeling every way and every bias and
108
00:08:01,860 --> 00:08:07,260
every output as kind of the output of the activation function which is a you can see all the parameters
109
00:08:07,260 --> 00:08:12,090
labeled here you get really complicated really fast and this is a really small network.
110
00:08:12,090 --> 00:08:16,880
There is only kind of four layers here two hidden layers and those hidden layers aren't even that big.
111
00:08:16,900 --> 00:08:20,880
We can see here if we start noticing everything it can become quite complex.
112
00:08:20,880 --> 00:08:28,580
You already have quite a large matrix of weights and biases so how do we actually calculate this.
113
00:08:28,580 --> 00:08:35,520
How do we calculate that cost function and then figure out how to minimize it so in a real case this
114
00:08:35,520 --> 00:08:40,080
means that we have some cost function C dependent on lots of weights that cost function is going to
115
00:08:40,080 --> 00:08:42,950
dependent on the weight of that first input than weight.
116
00:08:42,960 --> 00:08:48,210
Second one weight on third one all the way to W M and we do is when you figure out which particular
117
00:08:48,210 --> 00:08:54,360
weights lead us to the lowest cost because we want to go back here and figure out for all these weights
118
00:08:54,570 --> 00:08:57,590
how do we change them to minimize my cost function.
119
00:08:57,660 --> 00:09:05,420
At the very end so for simplicity and just thinking about this let's imagine that we're dealing with
120
00:09:05,420 --> 00:09:10,400
a really simple network that only has a single weight essentially just one year on.
121
00:09:10,400 --> 00:09:16,400
So what we want to do is we want to minimize our loss or cost essentially our overall error which again
122
00:09:16,430 --> 00:09:22,700
that means we need to figure out what value of W do we use that's going to result in the minimum of
123
00:09:22,790 --> 00:09:31,940
C of value so here is our plot it out really simple cost function where it's a really simple network
124
00:09:32,000 --> 00:09:38,840
it only contains one weight well what we want to do is we want to figure out what value of W minimizes
125
00:09:38,840 --> 00:09:44,910
this cost function and while this is a really simple example you can probably just tell that in order
126
00:09:44,910 --> 00:09:49,690
to minimize the cost function here you can see that the minimum probably fall somewhere where that arrow
127
00:09:49,690 --> 00:09:55,850
is so that's the weight that is going to minimize that cost function.
128
00:09:55,850 --> 00:10:01,580
Which means that's probably the weight we want in the actual neuron or that input to the neuron.
129
00:10:01,580 --> 00:10:06,060
Because that reduces the cost to its minimum.
130
00:10:06,130 --> 00:10:10,640
Now students of calculus know what we could do is we could just take the derivative of this cost function
131
00:10:10,730 --> 00:10:12,840
and then solve for zero.
132
00:10:13,070 --> 00:10:18,650
But recall our real cost function is gonna be super complex and it's not going to be one dimensional
133
00:10:18,680 --> 00:10:20,720
two dimensional and three dimensional.
134
00:10:20,720 --> 00:10:24,670
If you take a look back at that network it's gonna be as many dimensions as there are W..
135
00:10:24,740 --> 00:10:31,060
And that's not even something I can actually plot so again it's going to be n dimensional which means
136
00:10:31,330 --> 00:10:36,810
taking that derivative and setting it equal to zero is actually not going to be you're not gonna be
137
00:10:36,810 --> 00:10:43,160
able to calculate that without spending kind of a thousand years of computational time so our networks
138
00:10:43,250 --> 00:10:46,730
especially when we build out really large networks are gonna have thousands of weights to them hundreds
139
00:10:46,730 --> 00:10:47,480
of weights.
140
00:10:47,480 --> 00:10:49,910
We're not gonna build take that derivative.
141
00:10:49,910 --> 00:10:55,850
So instead what we do is a stochastic process to what we can do is we can use gradient descent to solve
142
00:10:55,850 --> 00:11:02,100
this sort of problem so let's again go back to this kind of simplified version of our network we just
143
00:11:02,100 --> 00:11:05,970
have one wait and see how gradient descent would work on this simple example.
144
00:11:05,970 --> 00:11:10,630
And then we can easily expand it to more complex examples.
145
00:11:10,750 --> 00:11:17,440
So what we do is we just start off at one point on this cost function and then again what we're searching
146
00:11:17,440 --> 00:11:26,300
for here is that w value that minimizes this cost function so what we do is we calculate the slope at
147
00:11:26,300 --> 00:11:34,030
one point and then we move in the downward direction of the slope and you keep repeating this process
148
00:11:35,320 --> 00:11:40,760
until eventually you're going to converge to zero indicating a minimum.
149
00:11:40,830 --> 00:11:45,330
So what we could have done is keep in mind we could've changed our step size to find the next point
150
00:11:46,780 --> 00:11:54,010
so here we kind of took equal step sizes and if you take smaller step sizes it takes longer to find
151
00:11:54,010 --> 00:11:59,430
the minimum if you take larger steps sizes you'll go faster.
152
00:11:59,430 --> 00:12:02,430
But what happens is you risk overshooting the minimum.
153
00:12:02,430 --> 00:12:08,850
So if you go too large of a step size you may actually miss that minimum weight or a minimum weight
154
00:12:08,850 --> 00:12:13,170
tensor and then kind of overshoot and you don't end up converging.
155
00:12:13,170 --> 00:12:16,340
So that's step size is known as the learning rate.
156
00:12:16,380 --> 00:12:19,980
So if you ever see in your own networks that they're editing the learning rate what they're really doing
157
00:12:19,980 --> 00:12:25,560
here is they're editing how fast they're going to try to find that minimum weight value and it works
158
00:12:25,560 --> 00:12:30,930
the same for biases you're finding those minimum weights really the values of the weights and biases
159
00:12:30,960 --> 00:12:33,300
that minimize that cost function.
160
00:12:33,300 --> 00:12:39,050
Now in those previous examples something I should know is that the learning rate was constant.
161
00:12:39,090 --> 00:12:41,670
That is to say each step size was equal.
162
00:12:41,670 --> 00:12:47,670
So regardless of which one we're actually looking at such as the smaller step sizes or the larger step
163
00:12:47,670 --> 00:12:52,570
sizes the actual step is equal for all of these.
164
00:12:52,600 --> 00:12:58,810
Now we can be actually be a little clever and adapt our step size as we go along.
165
00:12:58,850 --> 00:13:04,580
You can imagine that since you're starting off kind of an around the position in this and dimensional
166
00:13:04,580 --> 00:13:10,880
space of possible weights and biases if you start with larger steps well you can do is you can then
167
00:13:11,150 --> 00:13:16,990
go smaller and smaller in your step size as that gradient or that slope gets closer to zero.
168
00:13:17,090 --> 00:13:23,240
And this is known as Adaptive gradient descent depending on that gradient you get back you're going
169
00:13:23,240 --> 00:13:32,150
to adapt your step size so in 2015 Kingman and Bob published a paper called Adam a method first stochastic
170
00:13:32,180 --> 00:13:33,490
optimization.
171
00:13:33,500 --> 00:13:37,570
And Adam is a much more efficient way of searching for these minimums.
172
00:13:37,640 --> 00:13:43,130
So you're going to see us actually kind of sometimes state Adam as our optimizer during the code.
173
00:13:43,160 --> 00:13:49,160
So keep that in mind if you ever see Adam all we're really referring to is this optimized way of performing
174
00:13:49,220 --> 00:13:52,780
this gradient descent where we have kind of this adaptive step side.
175
00:13:52,810 --> 00:13:56,700
So we kind of start off large and then depending on where we are maybe go smaller and smaller.
176
00:13:56,810 --> 00:13:58,810
So you kind of get the best of both worlds.
177
00:13:58,880 --> 00:14:03,500
You get to use the larger step sizes and kind of speed up finding that minimum.
178
00:14:03,500 --> 00:14:07,490
But then as you get closer and closer to it and you don't want to overshoot you can go a smaller step
179
00:14:07,490 --> 00:14:13,280
sizes so you can actually then compare Adam versus other gradient descent algorithms.
180
00:14:13,280 --> 00:14:19,190
And here it's showing you the training cost versus iterations over the entire dataset and you can see
181
00:14:19,190 --> 00:14:23,540
Adam here is outperforming these other adaptive gradient descent algorithms.
182
00:14:23,570 --> 00:14:29,030
So all the ones listed here that are not Adam they're also actually adaptive gradient descents however
183
00:14:29,120 --> 00:14:31,640
Adam performs better than all these.
184
00:14:31,640 --> 00:14:35,450
So we're going to be doing is we'll be using Adam since it's really quite common to use it for neural
185
00:14:35,450 --> 00:14:36,290
networks.
186
00:14:36,290 --> 00:14:41,120
So if you see that all we're doing here is we're kind of saying OK optimize the way we find this minimum
187
00:14:42,940 --> 00:14:48,950
now realistically we're showing you that illustration of gradient descent on just a single W..
188
00:14:48,950 --> 00:14:54,380
So that was kind of a one dimensional W. or really doing is we're calculating gradient descent in an
189
00:14:54,440 --> 00:14:56,710
end dimensional space for all our weights.
190
00:14:56,810 --> 00:15:02,660
Here you can see calculating gradient descent in a two dimensional plane including the weights and the
191
00:15:02,660 --> 00:15:03,680
bias.
192
00:15:03,680 --> 00:15:09,380
Realistically I'm not even gonna be able to illustrate the end dimensional space because it's gonna
193
00:15:09,410 --> 00:15:13,210
be end dimensions of tens or hundreds of weights and biases.
194
00:15:13,250 --> 00:15:15,730
So there's really no way we can illustrate that for you.
195
00:15:15,730 --> 00:15:20,710
Which is why we simplify it down to kind of a single way you can understand gradient descent is doing.
196
00:15:20,900 --> 00:15:24,980
And then what we're actually doing or what the computer will be doing for us is doing the same sort
197
00:15:24,980 --> 00:15:31,830
of calculations that on an n dimensional space that we can't really realistically illustrate for you.
198
00:15:31,880 --> 00:15:36,920
So when dealing with these n dimensional vectors otherwise known as sensors the notation changes from
199
00:15:36,920 --> 00:15:38,390
derivative to gradient.
200
00:15:38,420 --> 00:15:42,720
So that's why you've heard me say a gradient a couple of times instead of the term derivative.
201
00:15:42,770 --> 00:15:47,690
So when we're dealing with n dimensions instead of saying the derivative that actually the correct term
202
00:15:47,690 --> 00:15:48,860
becomes gradient.
203
00:15:48,980 --> 00:15:53,920
And so that means we calculate the gradient of the cost function with respect to all these weights.
204
00:15:53,930 --> 00:15:59,540
So if you see that upside that upside down triangle that's essentially kind of the way you notate gradient
205
00:15:59,660 --> 00:16:01,790
instead of just saying derivative.
206
00:16:01,790 --> 00:16:08,030
Now before we finish up and wrap up this lecture on loss or cost functions and gradient descent on a
207
00:16:08,030 --> 00:16:13,820
quickly I mentioned that for classification problems instead of using the quadratic cost function we
208
00:16:13,820 --> 00:16:17,260
end up often using is the cross entropy loss function.
209
00:16:17,420 --> 00:16:22,520
And what's nice about this cross entropy loss function is that basically what it does it assumes that
210
00:16:22,520 --> 00:16:26,470
your model predicts a probability distribution for each class.
211
00:16:26,780 --> 00:16:32,330
So maybe it has a distribution for class one class two plus three and so on.
212
00:16:32,510 --> 00:16:38,360
And the way this formula actually works out is for a binary classification that is only two classes.
213
00:16:38,450 --> 00:16:45,380
You have the top formula resulting and then those logs actually represent natural logs and then for
214
00:16:45,380 --> 00:16:49,500
any number of classes greater than two you are using the formula below.
215
00:16:49,650 --> 00:16:53,420
And the worry too much about this formula because essentially more coding it out we're just going to
216
00:16:53,420 --> 00:16:55,320
specify use cross entropy.
217
00:16:55,400 --> 00:17:00,620
So when we're performing classification especially multi class classification that's greater than just
218
00:17:00,620 --> 00:17:01,950
binary classification.
219
00:17:02,060 --> 00:17:05,720
We'll be calling upon cross entropy to be our cost function.
220
00:17:06,230 --> 00:17:06,490
Okay.
221
00:17:06,500 --> 00:17:10,320
So just keep that in mind if you ever see us coding and we specify cross entropy.
222
00:17:10,430 --> 00:17:14,990
These are the actual formulas we're specifying the computer to use in order to figure out a probability
223
00:17:14,990 --> 00:17:21,170
distribution for each of those classes so as a quick review we talked about cost functions.
224
00:17:21,180 --> 00:17:22,740
We talked about gradient descent.
225
00:17:22,740 --> 00:17:27,000
We talked about the fact that there's different optimizer is like the atom optimizer and then we also
226
00:17:27,000 --> 00:17:33,590
talked about quadratic cost and cross entropy so far we understand the networks take an input affect
227
00:17:33,590 --> 00:17:38,050
that input of weights and biases and activation functions to produce an estimated output.
228
00:17:38,060 --> 00:17:42,020
Then we learned how to evaluate that output against the true labels.
229
00:17:42,020 --> 00:17:47,150
The last thing you need to do in our theory discussions is the following question.
230
00:17:47,150 --> 00:17:51,470
Once we actually get that cost or loss value we understand how far off we are.
231
00:17:51,470 --> 00:17:56,100
We still haven't really talked about how we go back and adjust all those weights and biases.
232
00:17:56,170 --> 00:18:01,040
Still kind of this magical thing and that magical thing and just the second will no longer be so magical.
233
00:18:01,050 --> 00:18:04,600
So it'll be mathematical and it's called Back propagation.
234
00:18:04,630 --> 00:18:09,650
You essentially propagate backwards through your network and then update all those weights and biases.
235
00:18:09,740 --> 00:18:12,290
So that's exactly we're going to cover in the next lecture.
236
00:18:12,290 --> 00:18:12,770
I'll see you there.
27668
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.