All language subtitles for 007 What is reinforcement learning-subtitle-en

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu Download
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,830 --> 00:00:04,470 Hello and welcome back to the course on artificial intelligence. 2 00:00:04,580 --> 00:00:09,520 I hope you're excited about today's tutorial because we are taking our very first step into the world 3 00:00:09,520 --> 00:00:10,170 the I. 4 00:00:10,460 --> 00:00:13,150 And today we're talking about reinforcement learning. 5 00:00:13,280 --> 00:00:18,710 It's a very important story because it will underpin everything else is going to be happen in this course. 6 00:00:18,770 --> 00:00:21,010 So let's get started here. 7 00:00:21,020 --> 00:00:27,140 We've got a little maze and this maze is our representation of an environment and that's what we're 8 00:00:27,140 --> 00:00:29,210 going to be dealing with in this course. 9 00:00:29,210 --> 00:00:34,040 We're going to be dealing with certain environments in which our artificial intelligence is going to 10 00:00:34,040 --> 00:00:39,950 be performing it's going to be taking actions it's going to be looking to beat these in my going she'll 11 00:00:39,950 --> 00:00:42,350 be looking to win in these environments. 12 00:00:42,350 --> 00:00:44,190 And here we've got an agent. 13 00:00:44,360 --> 00:00:46,990 The agent is our artificial intelligence. 14 00:00:47,030 --> 00:00:52,910 That's the person or that's the mind that's going to be navigating these environments and learning from 15 00:00:53,000 --> 00:00:57,110 the feedback that their minds are going to be giving it in order to perform certain actions. 16 00:00:57,150 --> 00:01:02,180 And so the way it works is the agent perform certain actions in this environment. 17 00:01:02,360 --> 00:01:09,050 And as a result the state in which it is in will change so it might be further or closer or more to 18 00:01:09,050 --> 00:01:10,070 the left more to the right. 19 00:01:10,070 --> 00:01:15,030 It might have sort of the other parameters that describe it state and those parameters. 20 00:01:15,100 --> 00:01:20,720 So the state is going to change because of the action takes and it will also get rewards based on the 21 00:01:20,720 --> 00:01:20,970 action. 22 00:01:20,970 --> 00:01:24,950 So every time it takes an action the state will change and it'll get reward. 23 00:01:24,950 --> 00:01:29,170 Now bear in mind sometimes it might happen that it won't change the state the action won't change a 24 00:01:29,170 --> 00:01:33,070 stay or there won't be a reward for taking that action. 25 00:01:33,110 --> 00:01:34,530 In that sense it was. 26 00:01:34,670 --> 00:01:38,480 But nevertheless the agent's going to keep doing that was going to be taking actions cheating the state 27 00:01:38,480 --> 00:01:42,510 getting rewards changing action taking actions changing the state and getting rewards. 28 00:01:42,800 --> 00:01:47,840 And by doing that process it's going to be learning about what was going to be exploring the environment 29 00:01:48,200 --> 00:01:53,970 understanding what actions lead to good rewards and favorable states and what actions the two rewards 30 00:01:53,990 --> 00:01:55,840 an unfavorable state. 31 00:01:56,000 --> 00:01:59,690 And this is a very simplistic representational very global problem. 32 00:01:59,690 --> 00:02:04,390 So if you think about it environments actually don't have to be just mazes. 33 00:02:04,400 --> 00:02:09,170 It's not just about getting out of a maze or finding a treasure in a maze. 34 00:02:09,170 --> 00:02:11,740 An environment can be pretty much anything in life. 35 00:02:11,750 --> 00:02:15,180 So imagine you waking up in the morning and cooking an omelette. 36 00:02:15,410 --> 00:02:22,010 So in order to make that omelet you need to go through certain steps you need to get the salt get the 37 00:02:22,010 --> 00:02:27,770 eggs get the frying pans which to fire on and so on and it does sound like a routine mundane thing. 38 00:02:27,770 --> 00:02:29,870 But it's become routine because you've done it so many times. 39 00:02:29,960 --> 00:02:34,670 But in reality it's an environment where you're performing certain actions you're taking that you putting 40 00:02:34,670 --> 00:02:40,250 the fire on you putting a frying pan on the fire you're putting all the eggs into the frying pan and 41 00:02:40,250 --> 00:02:43,190 you put some salt on the eggs and you're turning over and so on. 42 00:02:43,190 --> 00:02:49,970 So as you can see they are CRN actions actions which are taking in certain states and those actions 43 00:02:49,970 --> 00:02:52,460 lead to certain other states and sometimes reward. 44 00:02:52,460 --> 00:02:57,650 So for instance when you put the fire on and you wait wait wait wait wait you take an action of wait 45 00:02:57,650 --> 00:03:01,900 wait wait wait too long and then you put the eggs in into the frying pan. 46 00:03:01,910 --> 00:03:03,560 The rewards are going to be very negative. 47 00:03:03,560 --> 00:03:05,120 It's all going to burn. 48 00:03:05,120 --> 00:03:10,130 On the other hand if you do all the all the correct actions in the correct time so it's also very important 49 00:03:10,130 --> 00:03:13,850 to understand that actions should be taken at the correct points in time. 50 00:03:13,850 --> 00:03:20,090 So for instance putting the salt in the frying pan before you put the eggs in might not be the best 51 00:03:20,090 --> 00:03:20,770 idea. 52 00:03:20,780 --> 00:03:26,190 You might want to take that action of putting the salt into the frying pan after the eggs are in there 53 00:03:26,200 --> 00:03:28,320 so that in a different state. 54 00:03:28,370 --> 00:03:29,620 So it's important to remember that. 55 00:03:29,780 --> 00:03:34,070 And at the same time so if you take all the correct actions in the correct order in the correct states 56 00:03:34,580 --> 00:03:38,840 your final reward could be that you get an omelet which you can eat. 57 00:03:38,900 --> 00:03:44,660 And so that's a very basic activity in your life but if you think about it it is actually an environment 58 00:03:44,990 --> 00:03:50,060 and you are the agent going through this environment and perform a task you don't really need to learn 59 00:03:50,060 --> 00:03:52,190 anything because you already know it pretty well. 60 00:03:52,220 --> 00:03:56,170 But at same time you could learn maybe you could learn how to make a better omelet or especially if 61 00:03:56,340 --> 00:03:59,010 it's your first omelet that you're making you're probably going to screw it up. 62 00:03:59,030 --> 00:04:04,010 But you will learn from that because you will understand what actions lead towards states and routes 63 00:04:04,490 --> 00:04:05,890 and anything else in life. 64 00:04:06,050 --> 00:04:11,900 For instance even trading on the stock market and you know buying and selling and getting certain feedback 65 00:04:11,900 --> 00:04:16,390 from the market in the sense of return positive or negative returns. 66 00:04:16,430 --> 00:04:20,160 That's also an environment that's you participating in that environment as an aged. 67 00:04:20,210 --> 00:04:25,220 Driving a car is also an environment where you can turn the steering wheel you can accelerate you can 68 00:04:25,220 --> 00:04:29,510 break and so on and you're getting feedback from the environment and you know one of those feedbacks 69 00:04:29,510 --> 00:04:35,840 is the policeman giving you a speeding fine if you're going above the acceptable or allowed speed limit 70 00:04:35,840 --> 00:04:36,960 on that highway. 71 00:04:37,040 --> 00:04:41,900 And therefore from there you learn that that's not something that should be done because it leads to 72 00:04:41,900 --> 00:04:43,020 a negative reward. 73 00:04:43,220 --> 00:04:45,590 So rewards don't have to be just at the very end of the process. 74 00:04:45,590 --> 00:04:48,020 They can be throughout the journey throughout the process. 75 00:04:48,020 --> 00:04:49,490 So those are a couple of examples. 76 00:04:49,490 --> 00:04:54,980 And in terms of a I the simplest way to think of reinforcement learning is like training a dog when 77 00:04:54,980 --> 00:05:00,270 you train the dog you to give it certain commands and if it obeys those commands then you give it a 78 00:05:00,440 --> 00:05:04,820 reach you give it like a biscuit or something if it doesn't Abeles Kamaz you tell it that it's a bad 79 00:05:04,820 --> 00:05:06,600 dog or you just don't give it a treat. 80 00:05:06,830 --> 00:05:13,820 And through that process it learns what certain commands or what it needs to do what action it needs 81 00:05:13,820 --> 00:05:18,470 to take in certain states and the states are the commands that you're giving it. 82 00:05:18,470 --> 00:05:22,700 And based on that it will get some certain rewards of course in the world of AI. 83 00:05:22,700 --> 00:05:24,590 It's not that complex. 84 00:05:24,590 --> 00:05:26,910 You don't have to give the treats. 85 00:05:26,960 --> 00:05:32,120 You don't have to have like a bag of biscuits with you every time you just give it a plus one or a minus 86 00:05:32,120 --> 00:05:37,290 one so it's a huge advantage that in the world of AI we've created these AIs ourselves. 87 00:05:37,310 --> 00:05:42,680 So the rewards that we're giving them if you think wow this is really cool rewards are giving them they 88 00:05:42,680 --> 00:05:48,490 don't actually exist they're just a plus or minus one or plus a one or a zero or something. 89 00:05:48,500 --> 00:05:51,100 So it's all nonexistence all imaginary stuff. 90 00:05:51,110 --> 00:05:56,300 But at the same time it leads to great results as we can create these amazing things these amazing artificial 91 00:05:56,300 --> 00:06:01,760 intelligence as by this amazing artificial intelligence by just providing rewards we don't really exist. 92 00:06:01,790 --> 00:06:05,670 Plus and minus one doesn't cost anything but same time release results. 93 00:06:05,900 --> 00:06:08,170 So very similar to real world. 94 00:06:08,210 --> 00:06:15,140 And you know for example Dokes But here the rewards are digital and just numbers. 95 00:06:15,140 --> 00:06:20,920 And with that in mind we can talk about about robot dogs I love this example so this is just around 96 00:06:20,920 --> 00:06:26,630 in pictures not necessarily that exact robot dog you know that is trained through reinforcement learning 97 00:06:26,710 --> 00:06:31,050 some of the robot dogs especially the older ones you'd have an algorithm in there. 98 00:06:31,370 --> 00:06:39,260 And this is actually a good example of the difference between preprogramed agents and reinforcement 99 00:06:39,260 --> 00:06:46,120 learning agent so you could have a robot dog which is preprogrammed to how to walk it will say. 100 00:06:46,160 --> 00:06:51,500 So in the in the algorithm behind the dog in the software will say OK so in order to walk you need to 101 00:06:52,370 --> 00:06:58,160 move your left leg forward left front leg forward then your back right leg forward then your front right 102 00:06:58,160 --> 00:07:02,480 leg forward then your back left leg forward and repeat that action and you know that's that's the definition 103 00:07:02,480 --> 00:07:04,870 of walking is a function inside this dog. 104 00:07:05,040 --> 00:07:09,060 And then it might have you know how to sit how to stand and things like that. 105 00:07:09,680 --> 00:07:16,360 Whereas in a robot dog that is trained through reinforcement learning what happens is you don't preprogram 106 00:07:16,360 --> 00:07:16,710 it. 107 00:07:16,730 --> 00:07:23,810 This is the key concept to everything here that you don't have any algorithm inside that is hard coded 108 00:07:23,810 --> 00:07:24,850 into the dog. 109 00:07:24,860 --> 00:07:28,300 Instead you have what we'll be discussing in the future. 110 00:07:28,460 --> 00:07:36,710 You have this reinforcement learning algorithm which is told that OK so the goal is from to get from 111 00:07:36,860 --> 00:07:41,990 where you are now not knowing anything to that to the end of the room for example. 112 00:07:42,170 --> 00:07:44,270 And here are the certain actions you can take. 113 00:07:44,270 --> 00:07:48,950 You can move your right foot you can move your left foot you can move your right back foot you are left 114 00:07:48,950 --> 00:07:53,000 back foot so here all the degrees of freedom you can do you can move it like this you can move like 115 00:07:53,000 --> 00:07:59,180 that so like a list of actions you can take and your rewards are every time you take a step forward 116 00:07:59,210 --> 00:08:01,430 you get a plus one every time you fall over. 117 00:08:01,430 --> 00:08:04,090 You get a minus one and that's all there is to it. 118 00:08:04,160 --> 00:08:07,390 And then they just leave the dog and let it figure it out on its own. 119 00:08:07,400 --> 00:08:13,460 So the dog tries to stand up it falls then it realizes that OK I shouldn't do that action that led to 120 00:08:13,460 --> 00:08:17,040 me falling because every time I fall I get a minus one which is not good for me then. 121 00:08:17,060 --> 00:08:21,560 So does the other action that helped him stand up and then it figures are just experiments experiments 122 00:08:21,560 --> 00:08:26,090 experiments tri's things randomly and then figures out that it can make a step forward by moving its 123 00:08:26,090 --> 00:08:31,410 right front foot and he gets a plus one and realize oh I should do more of that. 124 00:08:31,460 --> 00:08:35,620 OK cool so it now learns that it should do more of this and less of that. 125 00:08:35,630 --> 00:08:42,270 And through this learning process it quickly very quickly understands how it can walk. 126 00:08:42,410 --> 00:08:49,130 And those those dogs that figured out on their own can actually sometimes walk better than dogs that 127 00:08:49,130 --> 00:08:53,930 are preprogramed because really preprogrammed things we look at the real life dogs and or you know we 128 00:08:53,930 --> 00:08:59,960 use our own imagination how to do it whereas a reinforcement learning dog can optimize things on its 129 00:08:59,960 --> 00:09:00,300 own. 130 00:09:00,320 --> 00:09:03,540 And because in AI sometimes it can get even better results. 131 00:09:03,680 --> 00:09:05,290 And that's how they can train these robot. 132 00:09:05,320 --> 00:09:07,320 The same robot dogs to play soccer. 133 00:09:07,520 --> 00:09:12,970 You can train a normal dog to play soccer because you know simply the whole approach is different. 134 00:09:12,980 --> 00:09:20,900 And it's not something that you know probably a normal dog has been trained to do or has ever done in 135 00:09:20,900 --> 00:09:23,030 its process of its evolution. 136 00:09:23,030 --> 00:09:28,190 Whereas a reinforcement learning robot dogs can very easily understand how to play soccer as long as 137 00:09:28,190 --> 00:09:32,760 you tell them what the rewards are what the goals are what the possible actions they can take. 138 00:09:33,080 --> 00:09:36,390 So that is how reinforcement learning works. 139 00:09:36,410 --> 00:09:39,160 In general there's a quick overview of reinforcement learning. 140 00:09:39,170 --> 00:09:45,500 I hope that got you very excited about was going to come next because it's a completely different world 141 00:09:45,530 --> 00:09:51,980 compared to preprogram solutions a hard program hardcoded solutions where you have the if else conditions. 142 00:09:51,980 --> 00:09:53,750 This is very different. 143 00:09:53,840 --> 00:09:56,010 And we're going to be talking more about that. 144 00:09:56,150 --> 00:10:03,400 In the meantime we've got some additional reading for you so if you'd like to have some supporting materials 145 00:10:03,700 --> 00:10:06,810 Here's a great article which you can look and look into. 146 00:10:06,830 --> 00:10:09,300 It's called simple reinforcement learning with tensor flow. 147 00:10:09,430 --> 00:10:10,570 It's got ten parts. 148 00:10:10,570 --> 00:10:14,790 The link is here and you'll find the full clickable link on. 149 00:10:14,820 --> 00:10:22,540 In the course of resources by Arthur Giuliani's 2016 article and you can follow along this course and 150 00:10:22,540 --> 00:10:24,770 also get additional information from that article. 151 00:10:24,790 --> 00:10:30,010 But bear in mind that that article is tends to flow where as in this course we are using pi torche so 152 00:10:30,520 --> 00:10:35,830 different implementation but implantations but at the same time you might pick up a few things here 153 00:10:35,830 --> 00:10:41,260 and there that might supplement your learning that we're going to be doing in this course. 154 00:10:41,260 --> 00:10:44,910 So great articles follow you in if you're considering following it for sure. 155 00:10:44,920 --> 00:10:45,820 Still just in case. 156 00:10:45,820 --> 00:10:51,890 Check out that that first part and see if you like it see if you'd like to read it a bit more. 157 00:10:52,210 --> 00:10:58,210 And then we've got specific to this tutorial a border enforcement learning there's a paper by Richard 158 00:10:58,210 --> 00:11:00,380 Sutton which is called reinforcement learning. 159 00:11:00,420 --> 00:11:08,170 One introduction is the 1998 papers are quite old but at the same time you can learn a bit about reinforcement 160 00:11:08,170 --> 00:11:13,960 learning some of the examples like that omlet example and other examples of where reinforcement learning 161 00:11:13,960 --> 00:11:17,710 can be applied and just a general overview of reinforcement learning. 162 00:11:17,710 --> 00:11:23,220 If you are looking for some additional reading and on that note we're going to wrap up this tutorial. 163 00:11:23,230 --> 00:11:24,640 Can't wait to see you next time. 164 00:11:24,640 --> 00:11:26,560 And until then enjoy AI. 19208

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.