Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,940 --> 00:00:04,150
Hello and welcome back to the course on artificial intelligence.
2
00:00:04,150 --> 00:00:09,070
All right so I hope you're enjoying the tutorial so far we're nearly done with the intuition you soon
3
00:00:09,070 --> 00:00:13,390
very soon get to the practical side of things we've just got a few little things that we need to cover.
4
00:00:13,510 --> 00:00:20,320
All right so previously we talked about how we add neural networks into this whole equation of CULE
5
00:00:20,350 --> 00:00:25,360
learning and take ular into the next step and turn it into deep learning.
6
00:00:25,690 --> 00:00:33,130
And today we're going to add a an extra important feature which will be coding in the practical side
7
00:00:33,130 --> 00:00:39,100
of things so headline and I decided that it's important for us to cover it often in the intuition side
8
00:00:39,100 --> 00:00:42,430
of things so that you're more prepared for it when it comes in the coding side of things.
9
00:00:42,430 --> 00:00:47,950
So as we discussed we've got the network there there's two parts that happen.
10
00:00:47,950 --> 00:00:53,110
First of all it's the learning so the network actually learns with every new state it.
11
00:00:53,270 --> 00:00:58,870
It slowly updates its waits to get better and better and better at dealing with this environment.
12
00:00:58,870 --> 00:01:06,910
And then there is the acting inside the state so after the q values have been counted in the state then
13
00:01:06,970 --> 00:01:08,220
once you selected.
14
00:01:08,230 --> 00:01:14,800
So today we're still going to talk about the learning part we're going to come up with a interesting
15
00:01:14,800 --> 00:01:20,050
feature that's going to help in undergrad to come up with this feature ourselves but we'll talk about
16
00:01:20,080 --> 00:01:29,690
a feature that is very important for a deep cool learning and that feature is called experience replay.
17
00:01:29,710 --> 00:01:30,030
All right.
18
00:01:30,040 --> 00:01:34,570
So here is our network so we've just copied it over here.
19
00:01:34,570 --> 00:01:39,000
We've got that lost that is Calcott at the bottom is back propagator through network.
20
00:01:39,100 --> 00:01:44,770
And let's have a look at an example of what happens to understand the problem that we're dealing with
21
00:01:44,770 --> 00:01:45,670
a bit better.
22
00:01:45,670 --> 00:01:49,120
So here is an example actually from the scores.
23
00:01:49,120 --> 00:01:54,820
This is a screen shot shot exactly from this course this is what you will be programming.
24
00:01:54,820 --> 00:02:02,170
This is a self-driving car that is driving through this through along this road and it has to learn
25
00:02:02,170 --> 00:02:03,780
how to navigate this road.
26
00:02:03,820 --> 00:02:09,290
And so what it is as we discussed previously What is this in this state.
27
00:02:09,320 --> 00:02:15,850
And of course the state is not going to be x1 x2 Lundell just describe it in a lot more detail what
28
00:02:15,850 --> 00:02:23,650
the state is it is going to be a couple of parameters which relate to the angle of the car and some
29
00:02:23,650 --> 00:02:26,490
relative parameters what the sensors are reading and so on.
30
00:02:26,490 --> 00:02:29,820
So there's going to be more parameters than that to describe the state.
31
00:02:29,830 --> 00:02:34,120
But nevertheless it's going to be a vector of values going to go through a neural network and then on
32
00:02:34,120 --> 00:02:36,520
the output you're going to have some ACU values.
33
00:02:36,520 --> 00:02:39,850
Again there'll be a difference depending on the environment.
34
00:02:39,850 --> 00:02:44,380
They can be a different number of actions possible actions.
35
00:02:44,460 --> 00:02:49,660
But we're just going to for simplicity sake leave it at for just for us to be able to understand better
36
00:02:49,660 --> 00:02:50,830
what's going on here.
37
00:02:50,830 --> 00:02:55,710
So in this case what is the question is so far what is this.
38
00:02:55,730 --> 00:03:03,510
This inputs into this neural network or more specifically how often do we trigger this neural net.
39
00:03:03,520 --> 00:03:05,080
How often does this neural net growth.
40
00:03:05,110 --> 00:03:11,410
Well every time the car ends up in a new state so the car makes a move it ends up in a new state and
41
00:03:11,530 --> 00:03:12,650
then everything go.
42
00:03:12,670 --> 00:03:17,410
All that data all that information from about the state goes through the network give Alice the calculated
43
00:03:17,650 --> 00:03:18,200
errors.
44
00:03:18,280 --> 00:03:22,960
This error is calculated based on what we discussed in previous tutorials.
45
00:03:22,990 --> 00:03:26,080
This is back propagated through and their weights are updated.
46
00:03:26,080 --> 00:03:32,570
Then the car selects which action was to take makes that move ends up in a new state in the new state.
47
00:03:32,590 --> 00:03:34,390
Everything starts over again.
48
00:03:34,450 --> 00:03:39,880
And so basically this happens every time the car is in and you said well have a look at this example.
49
00:03:39,880 --> 00:03:46,240
I specifically took the screen shot because it looks it's very well illustrates the problem that is
50
00:03:46,240 --> 00:03:51,430
addressed through experience replay and expense replays not just something that we use in this course
51
00:03:51,430 --> 00:03:52,730
or in this specific problem.
52
00:03:52,810 --> 00:03:57,190
It is something that you will see used throughout.
53
00:03:57,340 --> 00:04:04,480
On and on and over and over again in artificial intelligence algorithms because it is so powerful and
54
00:04:04,480 --> 00:04:05,140
it's so important.
55
00:04:05,140 --> 00:04:11,440
So look at this car this car in this problem or in this environment it's goal is come from go from here
56
00:04:11,440 --> 00:04:12,440
to here and back.
57
00:04:12,440 --> 00:04:17,540
Its goal is to navigate its way here here without crossing these walls which are made of sand.
58
00:04:17,790 --> 00:04:24,430
And so the car started over here it went down and like its reward is based on you know how close it
59
00:04:24,430 --> 00:04:25,120
is to start.
60
00:04:25,120 --> 00:04:29,890
So the car went from here it went down and kept going like this like this like this like this or along
61
00:04:29,890 --> 00:04:31,490
this wall along the seawall.
62
00:04:31,570 --> 00:04:34,990
And what is it going to do next is going to turn is going to keep going.
63
00:04:34,990 --> 00:04:37,450
Well what we wanted to do is keep going here.
64
00:04:37,690 --> 00:04:39,490
But let's think about it for a second.
65
00:04:39,580 --> 00:04:44,240
Once it got to this wall every single time it moves forward it moves forward.
66
00:04:44,260 --> 00:04:48,570
It moves forward it moves forward moves forward moves forward moves forward and so on it moves forward.
67
00:04:48,580 --> 00:04:53,320
So there might be like depending on the structure environment could be like a hundred moves here or
68
00:04:53,320 --> 00:04:54,710
50 moves here.
69
00:04:54,990 --> 00:04:59,100
It just keeps moving forward forward forward forward for it and nothing changes.
70
00:04:59,160 --> 00:05:03,310
Not really changes it gets way further way from this started closer to this story.
71
00:05:03,310 --> 00:05:04,060
That's lovely.
72
00:05:04,210 --> 00:05:09,990
But in terms of the surrounding environment not many things are changing it's still that same wall.
73
00:05:10,090 --> 00:05:15,460
If you are sitting in the car you probably seen the situation when you're driving in the whatever you're
74
00:05:15,460 --> 00:05:21,220
seeing is like the environment is so monotonous that you're just seeing kind of the same thing is just
75
00:05:21,220 --> 00:05:21,840
passing by.
76
00:05:21,840 --> 00:05:26,680
But like I imagine you're driving through a desert and you're just seeing the same thing it's the same
77
00:05:26,680 --> 00:05:29,100
sound it's the same sound nothing is happening.
78
00:05:29,100 --> 00:05:30,340
Nothing is changing.
79
00:05:30,550 --> 00:05:36,820
And so based but every single time we're putting that state that new state into here.
80
00:05:37,000 --> 00:05:42,010
Yes of course something might be changing for us as you're driving the car and your GPS is showing you're
81
00:05:42,010 --> 00:05:43,530
closer to your destination.
82
00:05:43,540 --> 00:05:49,300
So one of these inputs is strange but a lot of these other inputs the sensors for instance which are
83
00:05:49,300 --> 00:05:55,850
on the car they're not changing and therefore as you're driving slow in this day to put in put the inputs
84
00:05:55,850 --> 00:06:02,380
into your own into here here here here here here here and here here all the time the inputs are pretty
85
00:06:02,380 --> 00:06:03,220
much the same.
86
00:06:03,250 --> 00:06:11,140
And so if you keep inputting the same inputs same values in vector or very similar vectors into your
87
00:06:11,140 --> 00:06:14,240
network because there is no variety.
88
00:06:14,320 --> 00:06:16,840
The car will learn very well.
89
00:06:16,870 --> 00:06:22,420
One thing you'll learn very well how to drive along this wall which is on its right and so that's how
90
00:06:22,420 --> 00:06:27,970
the network will update and it will get rewarded will slowly start getting rewarded for driving so well
91
00:06:27,970 --> 00:06:28,570
it will be like.
92
00:06:28,580 --> 00:06:33,980
OK so up from here I'll be learning all I'm doing so good I'm doing better I'm doing it better.
93
00:06:34,050 --> 00:06:34,420
It all.
94
00:06:34,480 --> 00:06:41,920
It will have this this false perception that it's actually doing very well even though it only learns
95
00:06:41,920 --> 00:06:47,560
how to drive along as well as other neural networks will become very adapted to driving along this well
96
00:06:47,560 --> 00:06:51,100
and then all of a sudden there's this curve and the car doesn't know what to do.
97
00:06:51,310 --> 00:06:55,240
And it completely doesn't fit in with this neural network.
98
00:06:55,420 --> 00:07:01,870
And even if it does it just somehow let's hypothetically say passes a spot and then it ends up on this
99
00:07:01,870 --> 00:07:02,250
wall.
100
00:07:02,260 --> 00:07:05,320
Same thing is going to happen is going to drive from here here here.
101
00:07:05,320 --> 00:07:10,870
OK now the neural network is restructuring itself to adapt to this wall and then bam this thing happens.
102
00:07:10,900 --> 00:07:15,880
And then even if somehow it gets passed that it will drive past this thing and then same thing along
103
00:07:15,880 --> 00:07:16,260
these lines.
104
00:07:16,260 --> 00:07:23,590
So basically this is a very vivid example of the problem that we are what we have is that because the
105
00:07:23,590 --> 00:07:29,770
way we're using the neural net updating it every single state once we have lots of consecutive stuff
106
00:07:29,770 --> 00:07:36,490
they don't even have to be the same but there is in environments that's normal that is consecutive states
107
00:07:36,880 --> 00:07:44,950
are somehow correlated or are somehow interdependent and we don't want that interdependency to bias
108
00:07:44,980 --> 00:07:45,550
our network.
109
00:07:45,550 --> 00:07:52,600
We don't want the car to just learn how to drive along like a straight line or a long curved line or
110
00:07:54,100 --> 00:08:01,750
like anything that you think that you can think of in in life where an agent would be Navigant environment
111
00:08:01,780 --> 00:08:10,570
where we can think of correlated or interdependent states that come after another that can really mess
112
00:08:10,630 --> 00:08:12,130
up your neural network.
113
00:08:12,190 --> 00:08:15,270
If you just going to let the agent learn from that.
114
00:08:15,430 --> 00:08:17,600
And that's where experience replay comes in.
115
00:08:17,620 --> 00:08:24,850
What what happens in experience replay is these experiences so these states that it's in one two three
116
00:08:24,850 --> 00:08:31,040
however many 50 states here in neuro they don't get put through the network right away.
117
00:08:31,350 --> 00:08:35,980
They are actually saved into memory of the agent.
118
00:08:36,160 --> 00:08:41,440
And so for instance it save all these and saves all these and some at some point once it reaches a certain
119
00:08:41,590 --> 00:08:44,940
threshold which you'll be able to code and Atlanta will show you how to do that.
120
00:08:45,100 --> 00:08:51,310
Once it reaches a certain threshold then the agent decides for itself OK it's time to learn.
121
00:08:51,310 --> 00:08:57,580
I have I have this batch of experiences that I have I'm not going to learn from that and so randomly
122
00:08:57,580 --> 00:09:04,120
selects a uniformly distribution and uniformity is key is important here because that's something we'll
123
00:09:04,240 --> 00:09:06,460
we'll talk about on the next slide.
124
00:09:06,820 --> 00:09:08,140
We'll book will mention that.
125
00:09:08,140 --> 00:09:12,400
But it takes a uniformly distributed sample.
126
00:09:12,460 --> 00:09:15,660
So basically all experiences are considered to be equal.
127
00:09:15,670 --> 00:09:23,410
It takes a uniformly distributed sample from that batch of experiences that it has and then it goes
128
00:09:23,410 --> 00:09:28,060
through them and learns from them so it doesn't take all the experience or just takes it uniformly distribute
129
00:09:28,060 --> 00:09:33,130
samples it might take couple of from here a couple from here a couple from here and it and each experience
130
00:09:33,130 --> 00:09:39,940
is characterized by the state it was in the action that it took the state it ended up in and the reward
131
00:09:40,000 --> 00:09:47,110
it it achieved through that action in that specific state so four elements in each experience state
132
00:09:47,110 --> 00:09:53,470
one action state two and reward and so it takes all those experiences and then it passes them through
133
00:09:53,470 --> 00:09:54,660
the network and it learns.
134
00:09:54,660 --> 00:10:05,160
And that way it it breaks the pattern of that bias which comes from the sequential nature of the experience
135
00:10:05,160 --> 00:10:08,110
as if you were to put them through the network one after the other.
136
00:10:08,340 --> 00:10:11,930
So that's the main focus of experience we play.
137
00:10:11,930 --> 00:10:17,730
That's that's what the problem and address is and another benefit of experience replay is that sometimes
138
00:10:17,730 --> 00:10:22,400
in an environment like this you might have very valuable rare experiences.
139
00:10:22,410 --> 00:10:28,340
So for instance I don't know let's say let's look at this corner right this is this is the right corner.
140
00:10:28,440 --> 00:10:28,730
Right.
141
00:10:28,740 --> 00:10:30,880
And a very sharp one is sharp.
142
00:10:30,900 --> 00:10:35,640
So it'll be coming from here assuming it's going to be hugging this corner.
143
00:10:35,640 --> 00:10:40,500
So having you sharp right corners do we have in this in this whole we're going to have one right corner
144
00:10:40,500 --> 00:10:43,410
here and one right corner here.
145
00:10:43,680 --> 00:10:46,240
Right so when it's coming this way that's the right corner.
146
00:10:46,380 --> 00:10:48,630
And then when it's going back it's a sharp right corner here.
147
00:10:48,620 --> 00:10:53,070
So and this one's not sharp this way in the shop so there's only one opportunity in the whole environment
148
00:10:53,640 --> 00:10:56,770
to learn from a sharp right corner.
149
00:10:56,970 --> 00:11:03,050
And that's a very important experience because it might get really good at driving along straight lines
150
00:11:03,060 --> 00:11:06,990
get really good at doing like soft corners like that like that but.
151
00:11:07,170 --> 00:11:14,070
And then it'll keep messing up this sharp right corner simply because simply because it doesn't have
152
00:11:14,070 --> 00:11:18,070
that much opportunity to learn from it and so therefore it will learn everything else very quickly but
153
00:11:18,070 --> 00:11:20,180
it'll take a long time to learn the right course.
154
00:11:20,180 --> 00:11:26,010
It's a very simplified example is a very simplified explanation but it illustrates the concept that
155
00:11:26,280 --> 00:11:30,140
sometimes they're are rare experiences which are which can be valuable.
156
00:11:30,270 --> 00:11:35,880
And if you're just doing a simple neural network where you're putting in your values here and you know
157
00:11:35,880 --> 00:11:40,950
they're going through and you know like even if you forget about that problem of the sequential nature
158
00:11:40,950 --> 00:11:45,690
of experiences and how they can be interdependent and all correlated Thimphu even forget about that
159
00:11:45,680 --> 00:11:46,640
for a second.
160
00:11:46,800 --> 00:11:52,110
What happens is once you put an experience in it goes through networks up data then you instantly forget
161
00:11:52,120 --> 00:11:53,370
but forget about that experience.
162
00:11:53,370 --> 00:11:54,380
You move on to the next one.
163
00:11:54,420 --> 00:11:56,180
That's just how the neural network works.
164
00:11:56,220 --> 00:11:59,710
Then you move onto the next state the next step the next step the next experience X experience that
165
00:11:59,780 --> 00:12:01,170
experience and so on.
166
00:12:01,170 --> 00:12:06,180
So this right corner as soon as it goes through a network is gone and you don't have any memory of that
167
00:12:06,510 --> 00:12:07,450
valuable experience.
168
00:12:07,560 --> 00:12:14,220
Whereas we've experienced replay because you're putting these experiences into batches you can organize
169
00:12:14,220 --> 00:12:19,920
your bash as a rolling window so for instance you could have like 100 batches So hundred experiences
170
00:12:19,920 --> 00:12:25,920
in your batch so when it's coming back from here it's as soon as it has this recorded this experience
171
00:12:25,920 --> 00:12:27,380
in its batch.
172
00:12:27,390 --> 00:12:34,260
Then like at some point it runs its takes a uniform distribution from its batch of experiences and then
173
00:12:34,260 --> 00:12:37,980
there's a rolling window so it forgets these experiences but then it keeps these experiences.
174
00:12:37,980 --> 00:12:44,160
And then again it learns from once it's here it learns from this batch and then once it's here it forgets
175
00:12:44,280 --> 00:12:45,410
all the way up to here.
176
00:12:45,420 --> 00:12:50,550
But then it has a batch of experiences like that so therefore not learn from these experiences.
177
00:12:50,730 --> 00:12:58,380
And that way what you are getting is that this right hand corner might come up several times in its
178
00:12:58,380 --> 00:13:03,480
learning process because it was in that batch when the batch was like this around there than there was
179
00:13:03,480 --> 00:13:08,760
in the batch here in about here so it came up in several batches because the abash might be updated
180
00:13:08,790 --> 00:13:11,430
as a rolling window of experience.
181
00:13:11,430 --> 00:13:15,630
So the older experiences get kicked out the newer experiences are added and then again older experience
182
00:13:15,630 --> 00:13:16,290
get it.
183
00:13:16,440 --> 00:13:23,040
So and experience it stays in the batch for quite some time and the car or agent can learn from that
184
00:13:23,040 --> 00:13:24,100
experience several times.
185
00:13:24,210 --> 00:13:27,430
So that's another advantage of experience replay.
186
00:13:27,570 --> 00:13:33,480
And of course the final advantage is experience replay gives you an opportunity to learn from more experiences
187
00:13:34,220 --> 00:13:39,290
than if you're just learning for one at a time because you have that batch and therefore And it's a
188
00:13:39,300 --> 00:13:46,710
rolling window and therefore even if your environment is limited to experience your experience replay
189
00:13:46,710 --> 00:13:49,260
approach can help you learn faster.
190
00:13:49,410 --> 00:13:55,230
And instead of just redoing there are many many many times you can learn fast because you don't have
191
00:13:55,230 --> 00:13:55,710
to redo it.
192
00:13:55,710 --> 00:13:57,440
You have those experiences saved.
193
00:13:57,810 --> 00:13:59,850
So those are the main advantages of experience.
194
00:13:59,910 --> 00:14:01,760
Let's recap on that we've got the.
195
00:14:01,840 --> 00:14:09,280
We're breaking that pattern over independence and correlation of sequential experiences we save rare
196
00:14:09,280 --> 00:14:15,640
experiences which might be important therefore we can learn from them more often and we can learn in
197
00:14:16,090 --> 00:14:21,260
environments we can learn Fosler environments which are experience.
198
00:14:21,520 --> 00:14:27,310
We have a shortage of experiences which don't have that many experiences that the agent goes through
199
00:14:27,310 --> 00:14:29,180
and still we can be able to learn that.
200
00:14:29,380 --> 00:14:32,470
So that is what experience replays all about.
201
00:14:32,470 --> 00:14:34,530
If you'd like to read a bit more than this.
202
00:14:34,630 --> 00:14:41,290
There's an interesting article published by deep mind in 2016 is called prioritized experience replay
203
00:14:41,560 --> 00:14:44,380
and it talks about why.
204
00:14:44,410 --> 00:14:50,860
Why are we using a uniform distribution to select our experiences from the experience Bachche why don't
205
00:14:50,860 --> 00:14:55,870
we find a better way to select our experiences and prioritize some of the experiences which we feel
206
00:14:55,870 --> 00:14:57,160
that are important.
207
00:14:57,220 --> 00:15:03,880
It's quite an interesting thing though in this case you will be able to not only reinforce or not only
208
00:15:03,880 --> 00:15:11,800
reinforce your knowledge on experience replay but you will actually be able to move with the cutting
209
00:15:11,800 --> 00:15:12,660
edge of technology.
210
00:15:12,660 --> 00:15:15,120
So this is 2016 and published by deep minds.
211
00:15:15,120 --> 00:15:21,580
It's a very recent very powerful paper so you'll be able to actually explore the limits or explore even
212
00:15:21,580 --> 00:15:24,530
further this algorithm and take it to the next level.
213
00:15:24,550 --> 00:15:31,270
So I'll leave it up to you to find out why and how we can change the uniform to a different approach
214
00:15:31,270 --> 00:15:33,810
to experience replay from this paper if you'd like.
215
00:15:33,940 --> 00:15:35,270
And I hope you enjoy this.
216
00:15:35,270 --> 00:15:41,020
Tauriel and now we know what experience really is and we can confidently use it in our practical circles
217
00:15:41,440 --> 00:15:42,860
and I look for see you next time.
218
00:15:42,940 --> 00:15:44,550
Until then enjoy AI.
24985
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.