subtitlecat.com

All language subtitles for 012 Adding a Living Penalty-subtitle-en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English Download

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German Download

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu Download

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:01,090 --> 00:00:04,270 Hello and welcome back to the course on artificial intelligence. 2 00:00:04,290 --> 00:00:07,260 Today we're talking about the living penalty. 3 00:00:07,600 --> 00:00:13,540 All right so here we've got all Belman equation and as we've been going through this course we've been 4 00:00:13,540 --> 00:00:20,030 slowly making more and more complex so so far we've already added these probabilities in here. 5 00:00:20,200 --> 00:00:22,930 And also we've added the discounting factor. 6 00:00:22,930 --> 00:00:28,440 Now we're going to look at in more detail at this side of the question where we have the reward now. 7 00:00:28,480 --> 00:00:34,660 Remember previously when we talked about how reinforcement learning works we said we have an agent and 8 00:00:34,660 --> 00:00:41,290 it performs actions in the environment and in an exchange or as a result of that it gets a new state 9 00:00:41,320 --> 00:00:45,600 and which is now in and a reward for that action. 10 00:00:45,610 --> 00:00:52,210 Well so far in our example we've only been getting rewards at the very end either if we get to the finish 11 00:00:52,210 --> 00:00:58,640 line or if we for the agent ends up in the fire pit he gets a plus one or a minus one reward. 12 00:00:58,960 --> 00:01:05,770 But that is a very simplistic approach to reinforcement learning and in more realistic scenarios you 13 00:01:05,800 --> 00:01:11,050 will likely have rewards throughout the journey not just at the very end you might have rewards throughout 14 00:01:11,050 --> 00:01:11,380 the journey. 15 00:01:11,380 --> 00:01:20,680 For instance if it's an AI playing a game and if for example it's like shooting somebody in doom it 16 00:01:20,680 --> 00:01:26,320 might get points for killing that enemy or it might be a different other game. 17 00:01:26,470 --> 00:01:32,260 If it overtakes another car or something like that just because of the rules of the game not because 18 00:01:32,260 --> 00:01:39,400 of its way of analyzing the game but actually the game is structured in a way that it's reinforcing 19 00:01:39,400 --> 00:01:43,230 its giving points for doing certain actions even before the game is over. 20 00:01:43,540 --> 00:01:49,570 So Sinatras like that are very common and not just in games and also in real life and that's why we're 21 00:01:49,570 --> 00:01:55,120 going to introduce something similar into our example a simplified version of that but nevertheless 22 00:01:55,330 --> 00:02:01,180 a reward that is continuously given to the agent throughout the game not just at the end and the way 23 00:02:01,180 --> 00:02:04,450 we're going to do it is by looking at the other tiles. 24 00:02:04,450 --> 00:02:10,060 So right now we only have a reward plus one at the final tile and reward minus 1 at the other final 25 00:02:10,060 --> 00:02:11,530 tile the firepit. 26 00:02:11,800 --> 00:02:14,310 But now we're going to add rewards in every single time. 27 00:02:14,430 --> 00:02:17,770 We'll add a very small reward will be minus 0.04. 28 00:02:17,770 --> 00:02:23,440 And as you can see it's negative so every time the agent moves he'll get a negative reward and that's 29 00:02:23,440 --> 00:02:28,300 what's called a living penalty because no matter where he goes he will always get this negative reward 30 00:02:28,450 --> 00:02:31,000 except for these final tiles because that's the end of the game. 31 00:02:31,300 --> 00:02:35,120 And so you can see the reward even on this tile is madness or a puzzle. 32 00:02:35,170 --> 00:02:37,960 But that doesn't mean that he starts with that reward. 33 00:02:37,960 --> 00:02:39,470 He only gets this reward. 34 00:02:39,760 --> 00:02:44,860 And this is important to remember he only gets his reward when he enters a tile so whenever he he promised 35 00:02:44,860 --> 00:02:51,110 an action he goes here then he will get this reward minus 0.04 and then he comes back to this style 36 00:02:51,130 --> 00:02:53,650 he'll get another mind and 0.04 word. 37 00:02:53,770 --> 00:03:00,370 And so the longer he walks around the more he accumulates his negative reward and therefore is an incentive 38 00:03:00,370 --> 00:03:03,870 for him to finish the game earlier so quickly as possible. 39 00:03:03,890 --> 00:03:10,390 And so now let's have a look at how our policy or how the agents policy is going to change depending 40 00:03:10,420 --> 00:03:14,150 on what value we set for this reward. 41 00:03:14,410 --> 00:03:18,730 So here are four environments and in each one we're going to explore a different. 42 00:03:18,770 --> 00:03:21,070 We're not going to do the calculations. 43 00:03:21,130 --> 00:03:25,690 We're just going to project the results and you will see that intuitively they make total sense. 44 00:03:25,690 --> 00:03:31,820 So here we've got a reward for any step offer any for getting into any state. 45 00:03:32,050 --> 00:03:32,830 Is equal to zero. 46 00:03:32,830 --> 00:03:36,890 Just as what we've seen before here the reward is going to be Mei's 0.0. 47 00:03:36,910 --> 00:03:43,150 For what we just did just now you know the reward will be at minus 0.5 or the level of giving penalty 48 00:03:43,150 --> 00:03:47,690 will be mine is open fire so much higher you can see them here more than 10 times greater. 49 00:03:47,800 --> 00:03:50,170 And here are the living Penhall it will be minus two. 50 00:03:50,170 --> 00:03:59,050 So even more than the rewards you get for jumping or even less than the reward that you are the agent 51 00:03:59,050 --> 00:04:00,700 gets for ending up in the fire pit. 52 00:04:00,700 --> 00:04:07,660 So let's have a look at how the actions or the optimal policy for passing this environment will change 53 00:04:07,660 --> 00:04:09,160 depending on this reward. 54 00:04:09,170 --> 00:04:11,560 So this is our original policy. 55 00:04:11,920 --> 00:04:18,280 And as you can remember we had these two very interesting and even a little bit weird a decision by 56 00:04:18,280 --> 00:04:23,950 the agent but which totally makes sense if he can live for as long as he likes. 57 00:04:23,950 --> 00:04:29,530 If you can just travel around for as long as he wants without being penalized for staying alive very 58 00:04:29,530 --> 00:04:30,430 long. 59 00:04:30,670 --> 00:04:37,630 He why not why wouldn't he just go into the corner here into the wall and just keep doing that until 60 00:04:37,870 --> 00:04:38,470 it happens. 61 00:04:38,470 --> 00:04:41,300 It so happens that he goes this way and then he will walk around. 62 00:04:41,500 --> 00:04:46,120 And same thing here it's much safer for him to jump into the wall hoping that one of these will come 63 00:04:46,120 --> 00:04:51,970 up eventually and then he'll go to the finish line anyway because by choosing these two actions he doesn't 64 00:04:51,970 --> 00:04:53,680 risk getting into the fire pit. 65 00:04:53,690 --> 00:04:59,950 Now let's see what happens if we add a reward negative reward for just being a life for making a step. 66 00:05:00,270 --> 00:05:04,960 Move here you can see that instantly these two changed. 67 00:05:04,970 --> 00:05:07,940 Now the agent doesn't want to jump into the wall. 68 00:05:07,940 --> 00:05:13,490 He is more likely to risk getting to the firepit having a 10 percent chance of jumping in here but he 69 00:05:13,490 --> 00:05:19,400 will go forward because every time he comes to watch here if he was going to be doing it here as well 70 00:05:19,850 --> 00:05:24,620 every time he jumps into well he performs an action he ends up into in this state with an 80 percent 71 00:05:24,620 --> 00:05:24,990 chance. 72 00:05:25,010 --> 00:05:31,180 And that means an 80 percent chance you'll get a minus 0.04 reward meaning that a lot of the time he's 73 00:05:31,190 --> 00:05:34,940 going to be getting this accumulating this negative reward. 74 00:05:34,940 --> 00:05:41,600 Same thing here if he jumps into the wall waiting for that moment when he will actually be randomly 75 00:05:41,600 --> 00:05:42,780 moved to the right. 76 00:05:42,980 --> 00:05:49,340 If he keeps doing that he will accumulate this negative reward and that the result of that if you perform 77 00:05:49,340 --> 00:05:55,670 the calculations you'll see that the result of that the expected value of that approach jumping to the 78 00:05:55,670 --> 00:06:02,840 wall is worse than taking the risk of going forward and actually ending up in in the firepit. 79 00:06:02,840 --> 00:06:10,230 So he changes his decisions in these two blocks to instead move forward and here move to the left even 80 00:06:10,230 --> 00:06:15,320 know there's a risk of the firepit fire simply because now the longer he's alive the longer he will 81 00:06:15,320 --> 00:06:18,830 accumulate this living penalty in the next environment. 82 00:06:18,830 --> 00:06:23,720 Now we're increasing the living Pouncey to even a greater number Meinzer point five and let's see what 83 00:06:23,720 --> 00:06:24,590 changes here. 84 00:06:24,860 --> 00:06:27,220 So now you can see that compared to this environment. 85 00:06:27,260 --> 00:06:31,740 The only thing that changed here is that this arrow is pointing to the right. 86 00:06:32,060 --> 00:06:38,360 And what that means is that now it's no longer a good option for the agent or actually also this arrows 87 00:06:38,360 --> 00:06:42,340 pointing was pointing the left and nozzles nose pointing upwards. 88 00:06:42,350 --> 00:06:48,740 So now it's no longer a good idea for the agent to go around from here or go around all the way because 89 00:06:49,100 --> 00:06:53,330 if he goes wrong all the way yes he's safe or there's a lesser chance there's no chance of getting the 90 00:06:53,340 --> 00:06:54,030 firepit. 91 00:06:54,320 --> 00:06:57,640 But at the same time or there's less chance are going to happen. 92 00:06:57,710 --> 00:07:03,140 But at the same time he will accumulate quite a substantial negative reward as he walks around. 93 00:07:03,140 --> 00:07:05,540 So it's just it's the path is too long. 94 00:07:05,540 --> 00:07:12,350 So that forces him whether he's here or here to take the shorter route to get here even though he has 95 00:07:12,350 --> 00:07:17,330 a much higher risk of getting into the firepit because as soon as he ends up in the square there's a 96 00:07:17,330 --> 00:07:19,350 10 percent chance of getting to the fire. 97 00:07:20,120 --> 00:07:21,760 According to his calculations. 98 00:07:21,800 --> 00:07:27,980 It's just the expected value of this approach is better than the expected value of going around simply 99 00:07:27,980 --> 00:07:30,480 because we've increased this living penalty. 100 00:07:30,710 --> 00:07:37,130 And finally we're getting to the example with the living penalty of minus two point zero. 101 00:07:37,130 --> 00:07:43,010 So here I encourage you to post the video now that you've seen how the policy has changed as we increase 102 00:07:43,010 --> 00:07:44,430 the loading punt penalty. 103 00:07:44,450 --> 00:07:49,850 I encourage you to pause the video and think for yourself what will happen in this scenario. 104 00:07:49,850 --> 00:07:57,070 What do you think the optimal policy will be given that the living penalty is so high so all this supposed 105 00:07:57,090 --> 00:07:58,280 video if you'd like to. 106 00:07:58,490 --> 00:08:04,880 And now I'm going to jump into showing you the solution so in this case if you increase the penalty 107 00:08:04,880 --> 00:08:13,460 to minus 2.0 it's so high remember that the penalty here is only minus 1.0 it's so high that the agent 108 00:08:13,680 --> 00:08:18,540 just wants to get out of the game in any way possible even if it's just by jumping into the fire pit. 109 00:08:18,560 --> 00:08:19,200 He will do it. 110 00:08:19,220 --> 00:08:25,460 He will be like every time I make a step every time I end up in a new in in your state or every time 111 00:08:25,460 --> 00:08:30,020 I make an action I end up getting a minus two reward. 112 00:08:30,020 --> 00:08:36,280 So what's the point of trying to get to the finish line if from here will take me two extra steps. 113 00:08:36,350 --> 00:08:41,060 I'm just going to go here and then straight into the firepit because that way my reward is going to 114 00:08:41,060 --> 00:08:49,190 be less than negative reward is going to be as bad as in the case of just making additional steps so 115 00:08:49,190 --> 00:08:56,770 you can see that adding this living reward and depending on the value of the living reward that we're 116 00:08:56,780 --> 00:08:59,270 adding the results are going to be different. 117 00:08:59,270 --> 00:09:06,290 And the agent is going to select different policies and that's basically what's how the reward value 118 00:09:06,440 --> 00:09:12,020 can be is incorporated by the Belmont equation even when it's not just at the finish line or at the 119 00:09:12,020 --> 00:09:13,790 end of the game but even throughout the game. 120 00:09:13,790 --> 00:09:19,250 And again once again doesn't have to be on every single in every single state depending on the environment 121 00:09:19,250 --> 00:09:20,180 itself. 122 00:09:20,180 --> 00:09:26,540 It might be given to the agent at certain specific states not at every state but in our simplistic example 123 00:09:26,540 --> 00:09:29,880 we're just using rewards at every given state. 124 00:09:30,050 --> 00:09:34,470 To illustrate this concept so I hope you enjoyed today's tutorial. 125 00:09:34,580 --> 00:09:40,550 And as you can see we've already made our Belman equation quite sophisticated and now it can be applied 126 00:09:40,550 --> 00:09:44,340 to many different scenarios and I can't wait to see in the next tutorial. 127 00:09:44,360 --> 00:09:46,200 And until then enjoy a I. 14917