Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,090 --> 00:00:04,270
Hello and welcome back to the course on artificial intelligence.
2
00:00:04,290 --> 00:00:07,260
Today we're talking about the living penalty.
3
00:00:07,600 --> 00:00:13,540
All right so here we've got all Belman equation and as we've been going through this course we've been
4
00:00:13,540 --> 00:00:20,030
slowly making more and more complex so so far we've already added these probabilities in here.
5
00:00:20,200 --> 00:00:22,930
And also we've added the discounting factor.
6
00:00:22,930 --> 00:00:28,440
Now we're going to look at in more detail at this side of the question where we have the reward now.
7
00:00:28,480 --> 00:00:34,660
Remember previously when we talked about how reinforcement learning works we said we have an agent and
8
00:00:34,660 --> 00:00:41,290
it performs actions in the environment and in an exchange or as a result of that it gets a new state
9
00:00:41,320 --> 00:00:45,600
and which is now in and a reward for that action.
10
00:00:45,610 --> 00:00:52,210
Well so far in our example we've only been getting rewards at the very end either if we get to the finish
11
00:00:52,210 --> 00:00:58,640
line or if we for the agent ends up in the fire pit he gets a plus one or a minus one reward.
12
00:00:58,960 --> 00:01:05,770
But that is a very simplistic approach to reinforcement learning and in more realistic scenarios you
13
00:01:05,800 --> 00:01:11,050
will likely have rewards throughout the journey not just at the very end you might have rewards throughout
14
00:01:11,050 --> 00:01:11,380
the journey.
15
00:01:11,380 --> 00:01:20,680
For instance if it's an AI playing a game and if for example it's like shooting somebody in doom it
16
00:01:20,680 --> 00:01:26,320
might get points for killing that enemy or it might be a different other game.
17
00:01:26,470 --> 00:01:32,260
If it overtakes another car or something like that just because of the rules of the game not because
18
00:01:32,260 --> 00:01:39,400
of its way of analyzing the game but actually the game is structured in a way that it's reinforcing
19
00:01:39,400 --> 00:01:43,230
its giving points for doing certain actions even before the game is over.
20
00:01:43,540 --> 00:01:49,570
So Sinatras like that are very common and not just in games and also in real life and that's why we're
21
00:01:49,570 --> 00:01:55,120
going to introduce something similar into our example a simplified version of that but nevertheless
22
00:01:55,330 --> 00:02:01,180
a reward that is continuously given to the agent throughout the game not just at the end and the way
23
00:02:01,180 --> 00:02:04,450
we're going to do it is by looking at the other tiles.
24
00:02:04,450 --> 00:02:10,060
So right now we only have a reward plus one at the final tile and reward minus 1 at the other final
25
00:02:10,060 --> 00:02:11,530
tile the firepit.
26
00:02:11,800 --> 00:02:14,310
But now we're going to add rewards in every single time.
27
00:02:14,430 --> 00:02:17,770
We'll add a very small reward will be minus 0.04.
28
00:02:17,770 --> 00:02:23,440
And as you can see it's negative so every time the agent moves he'll get a negative reward and that's
29
00:02:23,440 --> 00:02:28,300
what's called a living penalty because no matter where he goes he will always get this negative reward
30
00:02:28,450 --> 00:02:31,000
except for these final tiles because that's the end of the game.
31
00:02:31,300 --> 00:02:35,120
And so you can see the reward even on this tile is madness or a puzzle.
32
00:02:35,170 --> 00:02:37,960
But that doesn't mean that he starts with that reward.
33
00:02:37,960 --> 00:02:39,470
He only gets this reward.
34
00:02:39,760 --> 00:02:44,860
And this is important to remember he only gets his reward when he enters a tile so whenever he he promised
35
00:02:44,860 --> 00:02:51,110
an action he goes here then he will get this reward minus 0.04 and then he comes back to this style
36
00:02:51,130 --> 00:02:53,650
he'll get another mind and 0.04 word.
37
00:02:53,770 --> 00:03:00,370
And so the longer he walks around the more he accumulates his negative reward and therefore is an incentive
38
00:03:00,370 --> 00:03:03,870
for him to finish the game earlier so quickly as possible.
39
00:03:03,890 --> 00:03:10,390
And so now let's have a look at how our policy or how the agents policy is going to change depending
40
00:03:10,420 --> 00:03:14,150
on what value we set for this reward.
41
00:03:14,410 --> 00:03:18,730
So here are four environments and in each one we're going to explore a different.
42
00:03:18,770 --> 00:03:21,070
We're not going to do the calculations.
43
00:03:21,130 --> 00:03:25,690
We're just going to project the results and you will see that intuitively they make total sense.
44
00:03:25,690 --> 00:03:31,820
So here we've got a reward for any step offer any for getting into any state.
45
00:03:32,050 --> 00:03:32,830
Is equal to zero.
46
00:03:32,830 --> 00:03:36,890
Just as what we've seen before here the reward is going to be Mei's 0.0.
47
00:03:36,910 --> 00:03:43,150
For what we just did just now you know the reward will be at minus 0.5 or the level of giving penalty
48
00:03:43,150 --> 00:03:47,690
will be mine is open fire so much higher you can see them here more than 10 times greater.
49
00:03:47,800 --> 00:03:50,170
And here are the living Penhall it will be minus two.
50
00:03:50,170 --> 00:03:59,050
So even more than the rewards you get for jumping or even less than the reward that you are the agent
51
00:03:59,050 --> 00:04:00,700
gets for ending up in the fire pit.
52
00:04:00,700 --> 00:04:07,660
So let's have a look at how the actions or the optimal policy for passing this environment will change
53
00:04:07,660 --> 00:04:09,160
depending on this reward.
54
00:04:09,170 --> 00:04:11,560
So this is our original policy.
55
00:04:11,920 --> 00:04:18,280
And as you can remember we had these two very interesting and even a little bit weird a decision by
56
00:04:18,280 --> 00:04:23,950
the agent but which totally makes sense if he can live for as long as he likes.
57
00:04:23,950 --> 00:04:29,530
If you can just travel around for as long as he wants without being penalized for staying alive very
58
00:04:29,530 --> 00:04:30,430
long.
59
00:04:30,670 --> 00:04:37,630
He why not why wouldn't he just go into the corner here into the wall and just keep doing that until
60
00:04:37,870 --> 00:04:38,470
it happens.
61
00:04:38,470 --> 00:04:41,300
It so happens that he goes this way and then he will walk around.
62
00:04:41,500 --> 00:04:46,120
And same thing here it's much safer for him to jump into the wall hoping that one of these will come
63
00:04:46,120 --> 00:04:51,970
up eventually and then he'll go to the finish line anyway because by choosing these two actions he doesn't
64
00:04:51,970 --> 00:04:53,680
risk getting into the fire pit.
65
00:04:53,690 --> 00:04:59,950
Now let's see what happens if we add a reward negative reward for just being a life for making a step.
66
00:05:00,270 --> 00:05:04,960
Move here you can see that instantly these two changed.
67
00:05:04,970 --> 00:05:07,940
Now the agent doesn't want to jump into the wall.
68
00:05:07,940 --> 00:05:13,490
He is more likely to risk getting to the firepit having a 10 percent chance of jumping in here but he
69
00:05:13,490 --> 00:05:19,400
will go forward because every time he comes to watch here if he was going to be doing it here as well
70
00:05:19,850 --> 00:05:24,620
every time he jumps into well he performs an action he ends up into in this state with an 80 percent
71
00:05:24,620 --> 00:05:24,990
chance.
72
00:05:25,010 --> 00:05:31,180
And that means an 80 percent chance you'll get a minus 0.04 reward meaning that a lot of the time he's
73
00:05:31,190 --> 00:05:34,940
going to be getting this accumulating this negative reward.
74
00:05:34,940 --> 00:05:41,600
Same thing here if he jumps into the wall waiting for that moment when he will actually be randomly
75
00:05:41,600 --> 00:05:42,780
moved to the right.
76
00:05:42,980 --> 00:05:49,340
If he keeps doing that he will accumulate this negative reward and that the result of that if you perform
77
00:05:49,340 --> 00:05:55,670
the calculations you'll see that the result of that the expected value of that approach jumping to the
78
00:05:55,670 --> 00:06:02,840
wall is worse than taking the risk of going forward and actually ending up in in the firepit.
79
00:06:02,840 --> 00:06:10,230
So he changes his decisions in these two blocks to instead move forward and here move to the left even
80
00:06:10,230 --> 00:06:15,320
know there's a risk of the firepit fire simply because now the longer he's alive the longer he will
81
00:06:15,320 --> 00:06:18,830
accumulate this living penalty in the next environment.
82
00:06:18,830 --> 00:06:23,720
Now we're increasing the living Pouncey to even a greater number Meinzer point five and let's see what
83
00:06:23,720 --> 00:06:24,590
changes here.
84
00:06:24,860 --> 00:06:27,220
So now you can see that compared to this environment.
85
00:06:27,260 --> 00:06:31,740
The only thing that changed here is that this arrow is pointing to the right.
86
00:06:32,060 --> 00:06:38,360
And what that means is that now it's no longer a good option for the agent or actually also this arrows
87
00:06:38,360 --> 00:06:42,340
pointing was pointing the left and nozzles nose pointing upwards.
88
00:06:42,350 --> 00:06:48,740
So now it's no longer a good idea for the agent to go around from here or go around all the way because
89
00:06:49,100 --> 00:06:53,330
if he goes wrong all the way yes he's safe or there's a lesser chance there's no chance of getting the
90
00:06:53,340 --> 00:06:54,030
firepit.
91
00:06:54,320 --> 00:06:57,640
But at the same time or there's less chance are going to happen.
92
00:06:57,710 --> 00:07:03,140
But at the same time he will accumulate quite a substantial negative reward as he walks around.
93
00:07:03,140 --> 00:07:05,540
So it's just it's the path is too long.
94
00:07:05,540 --> 00:07:12,350
So that forces him whether he's here or here to take the shorter route to get here even though he has
95
00:07:12,350 --> 00:07:17,330
a much higher risk of getting into the firepit because as soon as he ends up in the square there's a
96
00:07:17,330 --> 00:07:19,350
10 percent chance of getting to the fire.
97
00:07:20,120 --> 00:07:21,760
According to his calculations.
98
00:07:21,800 --> 00:07:27,980
It's just the expected value of this approach is better than the expected value of going around simply
99
00:07:27,980 --> 00:07:30,480
because we've increased this living penalty.
100
00:07:30,710 --> 00:07:37,130
And finally we're getting to the example with the living penalty of minus two point zero.
101
00:07:37,130 --> 00:07:43,010
So here I encourage you to post the video now that you've seen how the policy has changed as we increase
102
00:07:43,010 --> 00:07:44,430
the loading punt penalty.
103
00:07:44,450 --> 00:07:49,850
I encourage you to pause the video and think for yourself what will happen in this scenario.
104
00:07:49,850 --> 00:07:57,070
What do you think the optimal policy will be given that the living penalty is so high so all this supposed
105
00:07:57,090 --> 00:07:58,280
video if you'd like to.
106
00:07:58,490 --> 00:08:04,880
And now I'm going to jump into showing you the solution so in this case if you increase the penalty
107
00:08:04,880 --> 00:08:13,460
to minus 2.0 it's so high remember that the penalty here is only minus 1.0 it's so high that the agent
108
00:08:13,680 --> 00:08:18,540
just wants to get out of the game in any way possible even if it's just by jumping into the fire pit.
109
00:08:18,560 --> 00:08:19,200
He will do it.
110
00:08:19,220 --> 00:08:25,460
He will be like every time I make a step every time I end up in a new in in your state or every time
111
00:08:25,460 --> 00:08:30,020
I make an action I end up getting a minus two reward.
112
00:08:30,020 --> 00:08:36,280
So what's the point of trying to get to the finish line if from here will take me two extra steps.
113
00:08:36,350 --> 00:08:41,060
I'm just going to go here and then straight into the firepit because that way my reward is going to
114
00:08:41,060 --> 00:08:49,190
be less than negative reward is going to be as bad as in the case of just making additional steps so
115
00:08:49,190 --> 00:08:56,770
you can see that adding this living reward and depending on the value of the living reward that we're
116
00:08:56,780 --> 00:08:59,270
adding the results are going to be different.
117
00:08:59,270 --> 00:09:06,290
And the agent is going to select different policies and that's basically what's how the reward value
118
00:09:06,440 --> 00:09:12,020
can be is incorporated by the Belmont equation even when it's not just at the finish line or at the
119
00:09:12,020 --> 00:09:13,790
end of the game but even throughout the game.
120
00:09:13,790 --> 00:09:19,250
And again once again doesn't have to be on every single in every single state depending on the environment
121
00:09:19,250 --> 00:09:20,180
itself.
122
00:09:20,180 --> 00:09:26,540
It might be given to the agent at certain specific states not at every state but in our simplistic example
123
00:09:26,540 --> 00:09:29,880
we're just using rewards at every given state.
124
00:09:30,050 --> 00:09:34,470
To illustrate this concept so I hope you enjoyed today's tutorial.
125
00:09:34,580 --> 00:09:40,550
And as you can see we've already made our Belman equation quite sophisticated and now it can be applied
126
00:09:40,550 --> 00:09:44,340
to many different scenarios and I can't wait to see in the next tutorial.
127
00:09:44,360 --> 00:09:46,200
And until then enjoy a I.
14917
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.