Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,980 --> 00:00:04,960
Hello and welcome back to the course on artificial intelligence.
2
00:00:05,000 --> 00:00:12,140
Previously we had quite a strenuous and long tutorial on Margrove decision processes and hopefully you
3
00:00:12,200 --> 00:00:13,710
got along well with that.
4
00:00:13,760 --> 00:00:19,010
And hopefully I could explain things in a approachable and engaging way.
5
00:00:19,130 --> 00:00:22,750
And today we're going to talk about policies versus plans.
6
00:00:22,760 --> 00:00:27,910
There will be a quick and fun tutorial because now we're entering into a new world we're entering into
7
00:00:27,910 --> 00:00:34,310
a world of stochastic search nondeterministic search when you're just not getting through the maze but
8
00:00:34,310 --> 00:00:38,990
also accounting for random factors that might just hit you in the head when you're going through this
9
00:00:38,990 --> 00:00:41,080
maze and you need to be prepared for it.
10
00:00:41,080 --> 00:00:42,070
That's the world.
11
00:00:42,080 --> 00:00:48,640
Our agent is living in and it's more fun but it's also more dangerous it's more it's less predictable.
12
00:00:48,650 --> 00:00:50,880
So how is our agent going to behave.
13
00:00:50,960 --> 00:00:52,280
Let's have a look.
14
00:00:52,280 --> 00:00:58,190
There's our mark of decision process framework which is once again our favor Belman equation.
15
00:00:58,250 --> 00:01:02,010
However the the more advanced version of the Belman equation we are working with.
16
00:01:02,010 --> 00:01:04,760
So from now on we're just going to call this the Beldon equation.
17
00:01:04,760 --> 00:01:10,970
And here we've got our maximal and Crucell action so the value of a state any state as is the maximum
18
00:01:10,970 --> 00:01:14,020
across all actions an agent could possibly perform in that state.
19
00:01:14,120 --> 00:01:21,230
And the maxim was taken from the reward that the agent will get by perform action A instate as Plus
20
00:01:21,230 --> 00:01:26,590
a discount factor multiplied by the expected value of the new state it will be in.
21
00:01:26,830 --> 00:01:31,850
And I'd expect those taken here because they are doesn't know exactly what sadle end up in.
22
00:01:31,880 --> 00:01:40,390
They are some random effects that are present in the environment that might alter the state and not
23
00:01:40,800 --> 00:01:42,630
might not end up in the desired state.
24
00:01:42,640 --> 00:01:44,200
It might end up in a different state.
25
00:01:44,210 --> 00:01:47,760
That's why we're taking the expected value over here somewhere here.
26
00:01:47,990 --> 00:01:53,750
So let's have a look at this as our example our or in our example of the maze.
27
00:01:53,750 --> 00:02:00,220
So this is what we had previously so previously we're dealing with live deterministic search.
28
00:02:00,230 --> 00:02:01,960
So we knew that.
29
00:02:01,970 --> 00:02:05,550
All right so if I'm here I definitely need to go here if I'm here.
30
00:02:05,570 --> 00:02:09,030
I definitely need to go here if I'm here I definitely need to go here if I'm here I'm here.
31
00:02:09,140 --> 00:02:11,360
So it was all pretty straightforward.
32
00:02:11,480 --> 00:02:14,680
Once you have this map and remember called it we called it a plan.
33
00:02:14,690 --> 00:02:18,050
Once you have the plan it's pretty straightforward to do.
34
00:02:18,050 --> 00:02:18,990
There are.
35
00:02:18,990 --> 00:02:20,490
So that's the plan with arrows.
36
00:02:20,580 --> 00:02:25,000
And from here it was very straightforward we're this is these are the routes that they will take whenever
37
00:02:25,010 --> 00:02:26,210
you start on this blue line.
38
00:02:26,210 --> 00:02:28,210
That's that's exactly the way you'd go.
39
00:02:28,680 --> 00:02:31,120
However now we don't have a plan anymore.
40
00:02:31,120 --> 00:02:38,060
We can't have a plan because you know whatever we plan might not happen it's not under control or plan
41
00:02:38,060 --> 00:02:40,940
is when you know exactly what you need to do next.
42
00:02:40,940 --> 00:02:41,820
You know the steps.
43
00:02:41,840 --> 00:02:46,640
So you have a you have a starting point you have a goal and you know every single step so you can plan
44
00:02:46,640 --> 00:02:50,500
them out you like I'll do this one I'll do this one I'll do this like in life like a plan.
45
00:02:50,630 --> 00:02:54,870
But at the same time there's so much now randomness going on.
46
00:02:54,890 --> 00:03:00,080
You can have a plan because what if you get here and then you click to the right and actually takes
47
00:03:00,080 --> 00:03:00,560
you down.
48
00:03:00,680 --> 00:03:02,100
So that's not part of your plan.
49
00:03:02,390 --> 00:03:04,120
So that's why it's called the planning more.
50
00:03:04,220 --> 00:03:09,080
And here we're going to calculate the values are actually going to just look at the calculated values
51
00:03:09,410 --> 00:03:11,990
for this same problem.
52
00:03:12,080 --> 00:03:16,700
But based on that given that we have this randomness inside.
53
00:03:16,700 --> 00:03:18,380
So these are the new values.
54
00:03:18,800 --> 00:03:22,840
And so why are these values different so let's just compare to what we had previously.
55
00:03:22,850 --> 00:03:24,710
This is what we had previously.
56
00:03:24,710 --> 00:03:25,650
These are then you.
57
00:03:25,660 --> 00:03:29,750
So once again we had previously because he won 3.9 percent.
58
00:03:29,770 --> 00:03:31,590
He was really 366.
59
00:03:31,790 --> 00:03:36,750
And this is what we have now a a a less than once in force and 1 6 3.
60
00:03:36,800 --> 00:03:43,850
And by the way these are not exactly the current rallies off the top of my head but if we were to run
61
00:03:43,850 --> 00:03:49,220
an agent some values would be something similar to this and the values could change because depending
62
00:03:49,220 --> 00:03:54,650
on the gamble that it would choose 3.9 or other value but nevertheless for the for argument's sake these
63
00:03:54,650 --> 00:04:00,560
are the values that we're dealing with now and they're approximate they convey the whole notion in the
64
00:04:00,560 --> 00:04:02,270
correct way so let's have a look at them.
65
00:04:02,270 --> 00:04:03,240
Why have they changed.
66
00:04:03,410 --> 00:04:07,480
Well why is here with this one here the value was one.
67
00:04:07,490 --> 00:04:10,520
Why is it all of a sudden 0.26 Why is it less than one.
68
00:04:10,560 --> 00:04:11,730
Just go from here here.
69
00:04:11,930 --> 00:04:18,620
Well we actually called because from here if we go right which is our intention if we go right we could
70
00:04:18,640 --> 00:04:22,340
I would actually we have a 10 percent chance we'd end up here.
71
00:04:22,340 --> 00:04:25,130
So we'd hit the wall and would be back in this state.
72
00:04:25,130 --> 00:04:30,740
And remember we have a Gamla So the value would be discounted and or we're off or off at 10 and chance
73
00:04:30,740 --> 00:04:32,150
would end up here in this state.
74
00:04:32,150 --> 00:04:37,670
So it's not 100 percent probability I would get here so therefore disvalue can no longer be a one it's
75
00:04:37,670 --> 00:04:41,310
something less and it's 0.26.
76
00:04:41,570 --> 00:04:43,770
So that's an example of why it's like this.
77
00:04:43,770 --> 00:04:49,130
And you could get the exact value if you calculated the Belman equation the full but my question that
78
00:04:49,130 --> 00:04:49,850
we have now.
79
00:04:49,850 --> 00:04:53,540
The only problem is that there's going to be some recursion because you would need to know the value
80
00:04:53,540 --> 00:04:57,440
for this and then you need to know the value for this that is quite complex and that's why we're not
81
00:04:57,440 --> 00:04:59,180
doing the calculations manually here.
82
00:04:59,240 --> 00:05:06,000
That's why the I can do them as it's going through all this it's it's like it's like nothing too complex
83
00:05:06,000 --> 00:05:06,510
for a.
84
00:05:06,540 --> 00:05:08,520
You can't play these things.
85
00:05:08,520 --> 00:05:10,090
So that's our value here.
86
00:05:10,110 --> 00:05:11,520
But of this is a different one.
87
00:05:11,520 --> 00:05:16,830
So here it just to be 0.9 just because of discounting factor remember from here to here again now from
88
00:05:16,830 --> 00:05:23,070
here we colleges jump from here to here simply because even if we jump if we go like this we might end
89
00:05:23,070 --> 00:05:24,680
up back here back here.
90
00:05:24,700 --> 00:05:28,440
Right this 20 percent chance that will still stay in the square because we'll hit a wall.
91
00:05:28,710 --> 00:05:29,730
And again and so on.
92
00:05:29,730 --> 00:05:32,700
So the value of being here is zero point seventy one.
93
00:05:32,850 --> 00:05:35,370
Again this and the discounting factor.
94
00:05:35,370 --> 00:05:39,970
You know this might look odd to you that this is even with the discount in factor this is too high.
95
00:05:40,050 --> 00:05:44,440
Maybe the discounting factor in this example is not 0.9 maybe it's seven point ninety nine or something
96
00:05:44,500 --> 00:05:46,310
that so don't worry about that.
97
00:05:46,350 --> 00:05:48,480
Just kind of like focus on that.
98
00:05:48,480 --> 00:05:53,210
The values have indeed changed that the values are now less.
99
00:05:53,460 --> 00:05:58,700
Mostly because it's not it's a hundred percent probability to get to the state that you want to get
100
00:05:59,100 --> 00:06:00,180
and what you will find.
101
00:06:00,210 --> 00:06:06,660
An interesting one is here that here just to be 0.9 actually has dropped very much has dropped substantially.
102
00:06:06,660 --> 00:06:07,110
Why is that.
103
00:06:07,110 --> 00:06:12,120
Well because if you go from here up which is our intention there's a 10 percent chance of hitting a
104
00:06:12,120 --> 00:06:18,700
wall but there's a 10 percent chance of actually ending up in the firepit and losing minus one to reward
105
00:06:18,700 --> 00:06:22,820
and basically that means for the agent that's that's end of the game.
106
00:06:23,160 --> 00:06:25,640
And so this is a very bad state to be in.
107
00:06:25,680 --> 00:06:29,910
So all of a sudden remember we had zero point nine years apart and so they were equivalent.
108
00:06:29,910 --> 00:06:34,900
Doesn't matter you hear here they're pretty much equal in terms of value of being in each of these states.
109
00:06:34,980 --> 00:06:43,440
But now all of a sudden bam this date is like nearly twice as good as this one simply just because here
110
00:06:43,590 --> 00:06:46,980
if you go straight to it you go right where you want to go.
111
00:06:47,050 --> 00:06:51,270
The you know the consequences of the randomness occurring is you just stay here.
112
00:06:51,290 --> 00:06:55,070
Here one of the consequences a 10 percent chance is you end up in the pit.
113
00:06:55,110 --> 00:07:02,160
So as you can see this is no longer such a good state anymore simply because of something that fluctuation
114
00:07:02,160 --> 00:07:03,460
that could happen.
115
00:07:03,570 --> 00:07:09,150
As you can see this one is also very bad because it's as bad as this one in terms of you know it's only
116
00:07:09,150 --> 00:07:12,660
10 percent chance of ending up in the pit and 10 percent chance of ending up in the wall.
117
00:07:12,660 --> 00:07:18,480
But at the same time there's a discounting factor So first of all the discounting factor and also after
118
00:07:18,480 --> 00:07:20,390
this one you'd have to go here.
119
00:07:20,700 --> 00:07:23,900
And even if you hypothetically went here you could end up in the pit again.
120
00:07:23,910 --> 00:07:28,710
So that chance would also be taken into account because remember this values derive from this value
121
00:07:28,710 --> 00:07:31,760
and this value is derived from this value.
122
00:07:31,820 --> 00:07:32,350
Right.
123
00:07:32,400 --> 00:07:37,560
And therefore it's small but in reality actually what I said there was wrong.
124
00:07:37,560 --> 00:07:39,640
This value is not derived from the Fed.
125
00:07:39,810 --> 00:07:46,800
So if you just have a look now you will notice that this value over here is actually greater than this
126
00:07:46,800 --> 00:07:47,300
one.
127
00:07:47,610 --> 00:07:54,780
You will notice that for the agent it's better to go all this way than this way and it makes sense right.
128
00:07:54,780 --> 00:07:58,580
Because this way it doesn't lose it there's no chance of getting in the pit.
129
00:07:58,590 --> 00:08:03,450
Yes is a bit longer and therefore the discounting factor has a bigger effect.
130
00:08:03,510 --> 00:08:07,470
But at the same time simply because there's a chance of getting in the pit here if it goes straight
131
00:08:07,530 --> 00:08:09,140
it will there's a chance of jumping.
132
00:08:09,160 --> 00:08:15,120
So it will take a draw to take its time and just go around because that way there's a much lesser chance
133
00:08:15,120 --> 00:08:16,530
of it getting But there still is.
134
00:08:16,530 --> 00:08:19,590
So from here goes there it from here goes there.
135
00:08:19,590 --> 00:08:23,590
It could potentially get into the pit because it could end up there and that could end up in the bill.
136
00:08:23,730 --> 00:08:27,430
But nevertheless it's a lesser chance so it will just go on like that.
137
00:08:27,430 --> 00:08:32,430
So very interesting to see how they're all change remember previous you from here you'd go like that.
138
00:08:32,430 --> 00:08:34,790
From here you'd go like that and from here we go like that.
139
00:08:35,010 --> 00:08:36,870
And now all of a sudden you can see his change.
140
00:08:36,870 --> 00:08:41,000
Let's roll the arrows and see what it looks like now and voila.
141
00:08:41,010 --> 00:08:43,760
You see even a more random thing right.
142
00:08:43,770 --> 00:08:45,260
So yes this is true.
143
00:08:45,270 --> 00:08:46,500
But look at what happened here.
144
00:08:46,500 --> 00:08:47,610
Look at this one.
145
00:08:47,690 --> 00:08:48,970
Look at this one.
146
00:08:49,050 --> 00:08:50,490
Were you expecting that.
147
00:08:50,520 --> 00:08:54,570
That's something I definitely like when I saw this one first time I was was very impressed.
148
00:08:54,570 --> 00:08:59,800
I was not super I was not surprised and I was not expecting this at all.
149
00:08:59,970 --> 00:09:04,860
And this is an example of you know when I can outsmart a human.
150
00:09:05,120 --> 00:09:10,680
It sounds like something you caught even you could predict but the I through enforcement learning remember
151
00:09:10,680 --> 00:09:14,400
that example of dogs can actually sometimes work better than normal real life.
152
00:09:14,400 --> 00:09:21,330
Dogs are preprogrammed robot dogs can play soccer simply because they come up with these ideas that
153
00:09:21,390 --> 00:09:22,350
even we can't see.
154
00:09:22,440 --> 00:09:27,330
And as a great example you probably weren't expecting that as well that the Asians instead of going
155
00:09:27,330 --> 00:09:29,690
up is like why would I.
156
00:09:29,850 --> 00:09:33,120
Like if I go up then there's a 10 percent chance I'll jump into the pit.
157
00:09:33,120 --> 00:09:35,130
But what does it achieve by going into the war.
158
00:09:35,280 --> 00:09:38,330
Well 80 percent of the time will bump back and stay in the state.
159
00:09:38,490 --> 00:09:42,360
But 10 percent of the time will go here and 10 percent of the time I'll go here.
160
00:09:42,360 --> 00:09:49,130
So all of a sudden you can see that now it's actually in this new approach of jumping into the wall.
161
00:09:49,170 --> 00:09:53,350
There is a zero percent chance it will go into the fire but from this spot so.
162
00:09:53,370 --> 00:09:57,690
And it's like it really doesn't want to go into the fire pit so drugged bon bons into the wall a couple
163
00:09:57,690 --> 00:10:03,050
of times and then it will go the right or left at some point because that randomness is going to happen.
164
00:10:03,080 --> 00:10:09,680
And so it learned that through experimentation it learned that OK when I go forward the results are
165
00:10:09,680 --> 00:10:11,440
not as good as when I go to the wall.
166
00:10:11,510 --> 00:10:13,540
And if you think about it it's like this.
167
00:10:13,580 --> 00:10:18,350
This robot if you think about it this is a firepit is a very this is the this is like a square is like
168
00:10:18,350 --> 00:10:21,630
a very tiny ledge and then this is like a mountain like a cliff.
169
00:10:21,650 --> 00:10:27,830
And this robot is just hugging the cliff and just like trying to waiting until it like pushes right
170
00:10:27,830 --> 00:10:32,640
or left because well as a human you probably do the same you wouldn't be standing facing that way or
171
00:10:32,750 --> 00:10:34,970
you'd be hugging the cliff right.
172
00:10:35,000 --> 00:10:35,860
Or something like that.
173
00:10:35,940 --> 00:10:39,740
And hopefully you know we need to end up never end up in situations like that.
174
00:10:39,770 --> 00:10:43,670
But like visually just visually if you think about something here.
175
00:10:43,760 --> 00:10:46,450
And so that's pretty intense right.
176
00:10:46,460 --> 00:10:51,860
So that the AI came up with this idea and same here that is sort of going left and Riskin get in a fight
177
00:10:51,860 --> 00:10:56,270
but I'm just going to try balls off the wall like you know hug a wall try to jump into the wall and
178
00:10:56,300 --> 00:11:01,430
at some point I know that you know just there is a probability is a 10 percent chance every time I do
179
00:11:01,430 --> 00:11:04,910
that I'll go here and something will happen and I'll end up here and I'll be safe and then I'll just
180
00:11:04,910 --> 00:11:06,680
keep going like that.
181
00:11:06,830 --> 00:11:13,240
So very very interesting approach that they took here and you can see the the routes are like this so
182
00:11:13,250 --> 00:11:17,500
from here it might go right and then it'll go right to the exit or here or go left like that.
183
00:11:17,690 --> 00:11:22,230
And here we will at some point you will go left and go like that again.
184
00:11:22,310 --> 00:11:23,170
This is important.
185
00:11:23,180 --> 00:11:27,610
I'm not a policy so even when it jumps from here it will go here.
186
00:11:27,650 --> 00:11:30,400
Maybe And then from here it might actually rain straight.
187
00:11:30,410 --> 00:11:34,520
It might actually go back to the right and then from here and I going to let me get that right.
188
00:11:34,550 --> 00:11:38,260
So there's lots of different options for it guys who might not follow exactly this ironmonger go the
189
00:11:38,270 --> 00:11:38,730
other way.
190
00:11:38,960 --> 00:11:42,500
This is just the desired routes that it's designed for itself.
191
00:11:42,590 --> 00:11:44,690
But the way it'll work out is actually could be different.
192
00:11:44,690 --> 00:11:46,130
It depends on the real world.
193
00:11:46,340 --> 00:11:46,940
So there we go.
194
00:11:46,950 --> 00:11:50,090
That's the world of artificial intelligence.
195
00:11:50,090 --> 00:11:56,780
That's what a policy versus a plan is and hopefully you're getting slowly getting excited by what the
196
00:11:57,000 --> 00:12:01,220
AI can do especially given what we saw over here.
197
00:12:01,340 --> 00:12:07,430
These are some very virtuoso type of decisions that the AIs coming up with.
198
00:12:07,610 --> 00:12:12,500
And as you can see when you're playing AI even from this small example you can see that even when you
199
00:12:12,500 --> 00:12:18,950
play in a real world maybe you'll come up with ideas and decisions that even sometimes humans can come
200
00:12:18,950 --> 00:12:19,240
up with.
201
00:12:19,250 --> 00:12:25,460
And that's exactly kind of like what happened in those games where the Google Alpha goal was playing
202
00:12:25,520 --> 00:12:32,320
versus Lisa idole champion of goal in Korea back in the world champion of go.
203
00:12:32,390 --> 00:12:37,000
And they were playing in Korea back bakla in 2016 I think is March 2016.
204
00:12:37,000 --> 00:12:42,370
It came up with some moves that humans had never played in 3000 years or humans were not used to playing.
205
00:12:42,380 --> 00:12:45,510
And this is this is exactly an example of that.
206
00:12:45,740 --> 00:12:50,290
So once again hope you're getting excited and pumped about discourse and about what we can integrate.
207
00:12:50,330 --> 00:12:51,840
And I look for it.
208
00:12:51,840 --> 00:12:52,720
See you next time.
209
00:12:52,730 --> 00:12:54,410
Until then enjoy.
210
00:12:54,410 --> 00:12:54,640
I.
22616
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.