subtitlecat.com

All language subtitles for 020 Experience Replay-subtitle-en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali Download

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu Download

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,940 --> 00:00:04,150 Hello and welcome back to the course on artificial intelligence. 2 00:00:04,150 --> 00:00:09,070 All right so I hope you're enjoying the tutorial so far we're nearly done with the intuition you soon 3 00:00:09,070 --> 00:00:13,390 very soon get to the practical side of things we've just got a few little things that we need to cover. 4 00:00:13,510 --> 00:00:20,320 All right so previously we talked about how we add neural networks into this whole equation of CULE 5 00:00:20,350 --> 00:00:25,360 learning and take ular into the next step and turn it into deep learning. 6 00:00:25,690 --> 00:00:33,130 And today we're going to add a an extra important feature which will be coding in the practical side 7 00:00:33,130 --> 00:00:39,100 of things so headline and I decided that it's important for us to cover it often in the intuition side 8 00:00:39,100 --> 00:00:42,430 of things so that you're more prepared for it when it comes in the coding side of things. 9 00:00:42,430 --> 00:00:47,950 So as we discussed we've got the network there there's two parts that happen. 10 00:00:47,950 --> 00:00:53,110 First of all it's the learning so the network actually learns with every new state it. 11 00:00:53,270 --> 00:00:58,870 It slowly updates its waits to get better and better and better at dealing with this environment. 12 00:00:58,870 --> 00:01:06,910 And then there is the acting inside the state so after the q values have been counted in the state then 13 00:01:06,970 --> 00:01:08,220 once you selected. 14 00:01:08,230 --> 00:01:14,800 So today we're still going to talk about the learning part we're going to come up with a interesting 15 00:01:14,800 --> 00:01:20,050 feature that's going to help in undergrad to come up with this feature ourselves but we'll talk about 16 00:01:20,080 --> 00:01:29,690 a feature that is very important for a deep cool learning and that feature is called experience replay. 17 00:01:29,710 --> 00:01:30,030 All right. 18 00:01:30,040 --> 00:01:34,570 So here is our network so we've just copied it over here. 19 00:01:34,570 --> 00:01:39,000 We've got that lost that is Calcott at the bottom is back propagator through network. 20 00:01:39,100 --> 00:01:44,770 And let's have a look at an example of what happens to understand the problem that we're dealing with 21 00:01:44,770 --> 00:01:45,670 a bit better. 22 00:01:45,670 --> 00:01:49,120 So here is an example actually from the scores. 23 00:01:49,120 --> 00:01:54,820 This is a screen shot shot exactly from this course this is what you will be programming. 24 00:01:54,820 --> 00:02:02,170 This is a self-driving car that is driving through this through along this road and it has to learn 25 00:02:02,170 --> 00:02:03,780 how to navigate this road. 26 00:02:03,820 --> 00:02:09,290 And so what it is as we discussed previously What is this in this state. 27 00:02:09,320 --> 00:02:15,850 And of course the state is not going to be x1 x2 Lundell just describe it in a lot more detail what 28 00:02:15,850 --> 00:02:23,650 the state is it is going to be a couple of parameters which relate to the angle of the car and some 29 00:02:23,650 --> 00:02:26,490 relative parameters what the sensors are reading and so on. 30 00:02:26,490 --> 00:02:29,820 So there's going to be more parameters than that to describe the state. 31 00:02:29,830 --> 00:02:34,120 But nevertheless it's going to be a vector of values going to go through a neural network and then on 32 00:02:34,120 --> 00:02:36,520 the output you're going to have some ACU values. 33 00:02:36,520 --> 00:02:39,850 Again there'll be a difference depending on the environment. 34 00:02:39,850 --> 00:02:44,380 They can be a different number of actions possible actions. 35 00:02:44,460 --> 00:02:49,660 But we're just going to for simplicity sake leave it at for just for us to be able to understand better 36 00:02:49,660 --> 00:02:50,830 what's going on here. 37 00:02:50,830 --> 00:02:55,710 So in this case what is the question is so far what is this. 38 00:02:55,730 --> 00:03:03,510 This inputs into this neural network or more specifically how often do we trigger this neural net. 39 00:03:03,520 --> 00:03:05,080 How often does this neural net growth. 40 00:03:05,110 --> 00:03:11,410 Well every time the car ends up in a new state so the car makes a move it ends up in a new state and 41 00:03:11,530 --> 00:03:12,650 then everything go. 42 00:03:12,670 --> 00:03:17,410 All that data all that information from about the state goes through the network give Alice the calculated 43 00:03:17,650 --> 00:03:18,200 errors. 44 00:03:18,280 --> 00:03:22,960 This error is calculated based on what we discussed in previous tutorials. 45 00:03:22,990 --> 00:03:26,080 This is back propagated through and their weights are updated. 46 00:03:26,080 --> 00:03:32,570 Then the car selects which action was to take makes that move ends up in a new state in the new state. 47 00:03:32,590 --> 00:03:34,390 Everything starts over again. 48 00:03:34,450 --> 00:03:39,880 And so basically this happens every time the car is in and you said well have a look at this example. 49 00:03:39,880 --> 00:03:46,240 I specifically took the screen shot because it looks it's very well illustrates the problem that is 50 00:03:46,240 --> 00:03:51,430 addressed through experience replay and expense replays not just something that we use in this course 51 00:03:51,430 --> 00:03:52,730 or in this specific problem. 52 00:03:52,810 --> 00:03:57,190 It is something that you will see used throughout. 53 00:03:57,340 --> 00:04:04,480 On and on and over and over again in artificial intelligence algorithms because it is so powerful and 54 00:04:04,480 --> 00:04:05,140 it's so important. 55 00:04:05,140 --> 00:04:11,440 So look at this car this car in this problem or in this environment it's goal is come from go from here 56 00:04:11,440 --> 00:04:12,440 to here and back. 57 00:04:12,440 --> 00:04:17,540 Its goal is to navigate its way here here without crossing these walls which are made of sand. 58 00:04:17,790 --> 00:04:24,430 And so the car started over here it went down and like its reward is based on you know how close it 59 00:04:24,430 --> 00:04:25,120 is to start. 60 00:04:25,120 --> 00:04:29,890 So the car went from here it went down and kept going like this like this like this like this or along 61 00:04:29,890 --> 00:04:31,490 this wall along the seawall. 62 00:04:31,570 --> 00:04:34,990 And what is it going to do next is going to turn is going to keep going. 63 00:04:34,990 --> 00:04:37,450 Well what we wanted to do is keep going here. 64 00:04:37,690 --> 00:04:39,490 But let's think about it for a second. 65 00:04:39,580 --> 00:04:44,240 Once it got to this wall every single time it moves forward it moves forward. 66 00:04:44,260 --> 00:04:48,570 It moves forward it moves forward moves forward moves forward moves forward and so on it moves forward. 67 00:04:48,580 --> 00:04:53,320 So there might be like depending on the structure environment could be like a hundred moves here or 68 00:04:53,320 --> 00:04:54,710 50 moves here. 69 00:04:54,990 --> 00:04:59,100 It just keeps moving forward forward forward forward for it and nothing changes. 70 00:04:59,160 --> 00:05:03,310 Not really changes it gets way further way from this started closer to this story. 71 00:05:03,310 --> 00:05:04,060 That's lovely. 72 00:05:04,210 --> 00:05:09,990 But in terms of the surrounding environment not many things are changing it's still that same wall. 73 00:05:10,090 --> 00:05:15,460 If you are sitting in the car you probably seen the situation when you're driving in the whatever you're 74 00:05:15,460 --> 00:05:21,220 seeing is like the environment is so monotonous that you're just seeing kind of the same thing is just 75 00:05:21,220 --> 00:05:21,840 passing by. 76 00:05:21,840 --> 00:05:26,680 But like I imagine you're driving through a desert and you're just seeing the same thing it's the same 77 00:05:26,680 --> 00:05:29,100 sound it's the same sound nothing is happening. 78 00:05:29,100 --> 00:05:30,340 Nothing is changing. 79 00:05:30,550 --> 00:05:36,820 And so based but every single time we're putting that state that new state into here. 80 00:05:37,000 --> 00:05:42,010 Yes of course something might be changing for us as you're driving the car and your GPS is showing you're 81 00:05:42,010 --> 00:05:43,530 closer to your destination. 82 00:05:43,540 --> 00:05:49,300 So one of these inputs is strange but a lot of these other inputs the sensors for instance which are 83 00:05:49,300 --> 00:05:55,850 on the car they're not changing and therefore as you're driving slow in this day to put in put the inputs 84 00:05:55,850 --> 00:06:02,380 into your own into here here here here here here here and here here all the time the inputs are pretty 85 00:06:02,380 --> 00:06:03,220 much the same. 86 00:06:03,250 --> 00:06:11,140 And so if you keep inputting the same inputs same values in vector or very similar vectors into your 87 00:06:11,140 --> 00:06:14,240 network because there is no variety. 88 00:06:14,320 --> 00:06:16,840 The car will learn very well. 89 00:06:16,870 --> 00:06:22,420 One thing you'll learn very well how to drive along this wall which is on its right and so that's how 90 00:06:22,420 --> 00:06:27,970 the network will update and it will get rewarded will slowly start getting rewarded for driving so well 91 00:06:27,970 --> 00:06:28,570 it will be like. 92 00:06:28,580 --> 00:06:33,980 OK so up from here I'll be learning all I'm doing so good I'm doing better I'm doing it better. 93 00:06:34,050 --> 00:06:34,420 It all. 94 00:06:34,480 --> 00:06:41,920 It will have this this false perception that it's actually doing very well even though it only learns 95 00:06:41,920 --> 00:06:47,560 how to drive along as well as other neural networks will become very adapted to driving along this well 96 00:06:47,560 --> 00:06:51,100 and then all of a sudden there's this curve and the car doesn't know what to do. 97 00:06:51,310 --> 00:06:55,240 And it completely doesn't fit in with this neural network. 98 00:06:55,420 --> 00:07:01,870 And even if it does it just somehow let's hypothetically say passes a spot and then it ends up on this 99 00:07:01,870 --> 00:07:02,250 wall. 100 00:07:02,260 --> 00:07:05,320 Same thing is going to happen is going to drive from here here here. 101 00:07:05,320 --> 00:07:10,870 OK now the neural network is restructuring itself to adapt to this wall and then bam this thing happens. 102 00:07:10,900 --> 00:07:15,880 And then even if somehow it gets passed that it will drive past this thing and then same thing along 103 00:07:15,880 --> 00:07:16,260 these lines. 104 00:07:16,260 --> 00:07:23,590 So basically this is a very vivid example of the problem that we are what we have is that because the 105 00:07:23,590 --> 00:07:29,770 way we're using the neural net updating it every single state once we have lots of consecutive stuff 106 00:07:29,770 --> 00:07:36,490 they don't even have to be the same but there is in environments that's normal that is consecutive states 107 00:07:36,880 --> 00:07:44,950 are somehow correlated or are somehow interdependent and we don't want that interdependency to bias 108 00:07:44,980 --> 00:07:45,550 our network. 109 00:07:45,550 --> 00:07:52,600 We don't want the car to just learn how to drive along like a straight line or a long curved line or 110 00:07:54,100 --> 00:08:01,750 like anything that you think that you can think of in in life where an agent would be Navigant environment 111 00:08:01,780 --> 00:08:10,570 where we can think of correlated or interdependent states that come after another that can really mess 112 00:08:10,630 --> 00:08:12,130 up your neural network. 113 00:08:12,190 --> 00:08:15,270 If you just going to let the agent learn from that. 114 00:08:15,430 --> 00:08:17,600 And that's where experience replay comes in. 115 00:08:17,620 --> 00:08:24,850 What what happens in experience replay is these experiences so these states that it's in one two three 116 00:08:24,850 --> 00:08:31,040 however many 50 states here in neuro they don't get put through the network right away. 117 00:08:31,350 --> 00:08:35,980 They are actually saved into memory of the agent. 118 00:08:36,160 --> 00:08:41,440 And so for instance it save all these and saves all these and some at some point once it reaches a certain 119 00:08:41,590 --> 00:08:44,940 threshold which you'll be able to code and Atlanta will show you how to do that. 120 00:08:45,100 --> 00:08:51,310 Once it reaches a certain threshold then the agent decides for itself OK it's time to learn. 121 00:08:51,310 --> 00:08:57,580 I have I have this batch of experiences that I have I'm not going to learn from that and so randomly 122 00:08:57,580 --> 00:09:04,120 selects a uniformly distribution and uniformity is key is important here because that's something we'll 123 00:09:04,240 --> 00:09:06,460 we'll talk about on the next slide. 124 00:09:06,820 --> 00:09:08,140 We'll book will mention that. 125 00:09:08,140 --> 00:09:12,400 But it takes a uniformly distributed sample. 126 00:09:12,460 --> 00:09:15,660 So basically all experiences are considered to be equal. 127 00:09:15,670 --> 00:09:23,410 It takes a uniformly distributed sample from that batch of experiences that it has and then it goes 128 00:09:23,410 --> 00:09:28,060 through them and learns from them so it doesn't take all the experience or just takes it uniformly distribute 129 00:09:28,060 --> 00:09:33,130 samples it might take couple of from here a couple from here a couple from here and it and each experience 130 00:09:33,130 --> 00:09:39,940 is characterized by the state it was in the action that it took the state it ended up in and the reward 131 00:09:40,000 --> 00:09:47,110 it it achieved through that action in that specific state so four elements in each experience state 132 00:09:47,110 --> 00:09:53,470 one action state two and reward and so it takes all those experiences and then it passes them through 133 00:09:53,470 --> 00:09:54,660 the network and it learns. 134 00:09:54,660 --> 00:10:05,160 And that way it it breaks the pattern of that bias which comes from the sequential nature of the experience 135 00:10:05,160 --> 00:10:08,110 as if you were to put them through the network one after the other. 136 00:10:08,340 --> 00:10:11,930 So that's the main focus of experience we play. 137 00:10:11,930 --> 00:10:17,730 That's that's what the problem and address is and another benefit of experience replay is that sometimes 138 00:10:17,730 --> 00:10:22,400 in an environment like this you might have very valuable rare experiences. 139 00:10:22,410 --> 00:10:28,340 So for instance I don't know let's say let's look at this corner right this is this is the right corner. 140 00:10:28,440 --> 00:10:28,730 Right. 141 00:10:28,740 --> 00:10:30,880 And a very sharp one is sharp. 142 00:10:30,900 --> 00:10:35,640 So it'll be coming from here assuming it's going to be hugging this corner. 143 00:10:35,640 --> 00:10:40,500 So having you sharp right corners do we have in this in this whole we're going to have one right corner 144 00:10:40,500 --> 00:10:43,410 here and one right corner here. 145 00:10:43,680 --> 00:10:46,240 Right so when it's coming this way that's the right corner. 146 00:10:46,380 --> 00:10:48,630 And then when it's going back it's a sharp right corner here. 147 00:10:48,620 --> 00:10:53,070 So and this one's not sharp this way in the shop so there's only one opportunity in the whole environment 148 00:10:53,640 --> 00:10:56,770 to learn from a sharp right corner. 149 00:10:56,970 --> 00:11:03,050 And that's a very important experience because it might get really good at driving along straight lines 150 00:11:03,060 --> 00:11:06,990 get really good at doing like soft corners like that like that but. 151 00:11:07,170 --> 00:11:14,070 And then it'll keep messing up this sharp right corner simply because simply because it doesn't have 152 00:11:14,070 --> 00:11:18,070 that much opportunity to learn from it and so therefore it will learn everything else very quickly but 153 00:11:18,070 --> 00:11:20,180 it'll take a long time to learn the right course. 154 00:11:20,180 --> 00:11:26,010 It's a very simplified example is a very simplified explanation but it illustrates the concept that 155 00:11:26,280 --> 00:11:30,140 sometimes they're are rare experiences which are which can be valuable. 156 00:11:30,270 --> 00:11:35,880 And if you're just doing a simple neural network where you're putting in your values here and you know 157 00:11:35,880 --> 00:11:40,950 they're going through and you know like even if you forget about that problem of the sequential nature 158 00:11:40,950 --> 00:11:45,690 of experiences and how they can be interdependent and all correlated Thimphu even forget about that 159 00:11:45,680 --> 00:11:46,640 for a second. 160 00:11:46,800 --> 00:11:52,110 What happens is once you put an experience in it goes through networks up data then you instantly forget 161 00:11:52,120 --> 00:11:53,370 but forget about that experience. 162 00:11:53,370 --> 00:11:54,380 You move on to the next one. 163 00:11:54,420 --> 00:11:56,180 That's just how the neural network works. 164 00:11:56,220 --> 00:11:59,710 Then you move onto the next state the next step the next step the next experience X experience that 165 00:11:59,780 --> 00:12:01,170 experience and so on. 166 00:12:01,170 --> 00:12:06,180 So this right corner as soon as it goes through a network is gone and you don't have any memory of that 167 00:12:06,510 --> 00:12:07,450 valuable experience. 168 00:12:07,560 --> 00:12:14,220 Whereas we've experienced replay because you're putting these experiences into batches you can organize 169 00:12:14,220 --> 00:12:19,920 your bash as a rolling window so for instance you could have like 100 batches So hundred experiences 170 00:12:19,920 --> 00:12:25,920 in your batch so when it's coming back from here it's as soon as it has this recorded this experience 171 00:12:25,920 --> 00:12:27,380 in its batch. 172 00:12:27,390 --> 00:12:34,260 Then like at some point it runs its takes a uniform distribution from its batch of experiences and then 173 00:12:34,260 --> 00:12:37,980 there's a rolling window so it forgets these experiences but then it keeps these experiences. 174 00:12:37,980 --> 00:12:44,160 And then again it learns from once it's here it learns from this batch and then once it's here it forgets 175 00:12:44,280 --> 00:12:45,410 all the way up to here. 176 00:12:45,420 --> 00:12:50,550 But then it has a batch of experiences like that so therefore not learn from these experiences. 177 00:12:50,730 --> 00:12:58,380 And that way what you are getting is that this right hand corner might come up several times in its 178 00:12:58,380 --> 00:13:03,480 learning process because it was in that batch when the batch was like this around there than there was 179 00:13:03,480 --> 00:13:08,760 in the batch here in about here so it came up in several batches because the abash might be updated 180 00:13:08,790 --> 00:13:11,430 as a rolling window of experience. 181 00:13:11,430 --> 00:13:15,630 So the older experiences get kicked out the newer experiences are added and then again older experience 182 00:13:15,630 --> 00:13:16,290 get it. 183 00:13:16,440 --> 00:13:23,040 So and experience it stays in the batch for quite some time and the car or agent can learn from that 184 00:13:23,040 --> 00:13:24,100 experience several times. 185 00:13:24,210 --> 00:13:27,430 So that's another advantage of experience replay. 186 00:13:27,570 --> 00:13:33,480 And of course the final advantage is experience replay gives you an opportunity to learn from more experiences 187 00:13:34,220 --> 00:13:39,290 than if you're just learning for one at a time because you have that batch and therefore And it's a 188 00:13:39,300 --> 00:13:46,710 rolling window and therefore even if your environment is limited to experience your experience replay 189 00:13:46,710 --> 00:13:49,260 approach can help you learn faster. 190 00:13:49,410 --> 00:13:55,230 And instead of just redoing there are many many many times you can learn fast because you don't have 191 00:13:55,230 --> 00:13:55,710 to redo it. 192 00:13:55,710 --> 00:13:57,440 You have those experiences saved. 193 00:13:57,810 --> 00:13:59,850 So those are the main advantages of experience. 194 00:13:59,910 --> 00:14:01,760 Let's recap on that we've got the. 195 00:14:01,840 --> 00:14:09,280 We're breaking that pattern over independence and correlation of sequential experiences we save rare 196 00:14:09,280 --> 00:14:15,640 experiences which might be important therefore we can learn from them more often and we can learn in 197 00:14:16,090 --> 00:14:21,260 environments we can learn Fosler environments which are experience. 198 00:14:21,520 --> 00:14:27,310 We have a shortage of experiences which don't have that many experiences that the agent goes through 199 00:14:27,310 --> 00:14:29,180 and still we can be able to learn that. 200 00:14:29,380 --> 00:14:32,470 So that is what experience replays all about. 201 00:14:32,470 --> 00:14:34,530 If you'd like to read a bit more than this. 202 00:14:34,630 --> 00:14:41,290 There's an interesting article published by deep mind in 2016 is called prioritized experience replay 203 00:14:41,560 --> 00:14:44,380 and it talks about why. 204 00:14:44,410 --> 00:14:50,860 Why are we using a uniform distribution to select our experiences from the experience Bachche why don't 205 00:14:50,860 --> 00:14:55,870 we find a better way to select our experiences and prioritize some of the experiences which we feel 206 00:14:55,870 --> 00:14:57,160 that are important. 207 00:14:57,220 --> 00:15:03,880 It's quite an interesting thing though in this case you will be able to not only reinforce or not only 208 00:15:03,880 --> 00:15:11,800 reinforce your knowledge on experience replay but you will actually be able to move with the cutting 209 00:15:11,800 --> 00:15:12,660 edge of technology. 210 00:15:12,660 --> 00:15:15,120 So this is 2016 and published by deep minds. 211 00:15:15,120 --> 00:15:21,580 It's a very recent very powerful paper so you'll be able to actually explore the limits or explore even 212 00:15:21,580 --> 00:15:24,530 further this algorithm and take it to the next level. 213 00:15:24,550 --> 00:15:31,270 So I'll leave it up to you to find out why and how we can change the uniform to a different approach 214 00:15:31,270 --> 00:15:33,810 to experience replay from this paper if you'd like. 215 00:15:33,940 --> 00:15:35,270 And I hope you enjoy this. 216 00:15:35,270 --> 00:15:41,020 Tauriel and now we know what experience really is and we can confidently use it in our practical circles 217 00:15:41,440 --> 00:15:42,860 and I look for see you next time. 218 00:15:42,940 --> 00:15:44,550 Until then enjoy AI. 24985