subtitlecat.com

All language subtitles for [English]Harvard CS50’s Artificial Intelligence with Python – Full University Course[Checksub.com]

Afrikaans

Akan

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English Download

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,000 --> 00:00:07,040 This course from Harvard University explores the concepts and algorithms at the foundation of modern 2 00:00:07,040 --> 00:00:13,920 artificial intelligence, diving into the ideas that give rise to technologies like game-playing 3 00:00:13,920 --> 00:00:20,560 engines, handwriting recognition, and machine translation. You'll gain exposure to the theory 4 00:00:20,560 --> 00:00:26,880 behind graph search algorithms, classification, optimization, reinforcement learning, 5 00:00:26,880 --> 00:00:33,200 and other topics in artificial intelligence and machine learning. Brian Yu teaches this course. 6 00:00:33,200 --> 00:00:56,640 Hello, world. This is CS50, and this is an introduction to artificial intelligence with 7 00:00:56,640 --> 00:01:02,960 Python with CS50's own Brian Yu. This course picks up where CS50 itself leaves off and explores the 8 00:01:02,960 --> 00:01:06,480 concepts and algorithms at the foundation of modern AI. 9 00:01:06,480 --> 00:01:10,080 We'll start with a look at how AI can search for solutions to problems, 10 00:01:10,080 --> 00:01:12,800 whether those problems are learning how to play a game or trying 11 00:01:12,800 --> 00:01:15,120 to find driving directions to a destination. 12 00:01:15,120 --> 00:01:19,040 We'll then look at how AI can represent information, both knowledge that our AI 13 00:01:19,040 --> 00:01:23,440 is certain about, but also information and events about which our AI might be uncertain, 14 00:01:23,440 --> 00:01:26,240 learning how to represent that information, but more importantly, 15 00:01:26,240 --> 00:01:30,240 how to use that information to draw inferences and new conclusions as well. 16 00:01:30,240 --> 00:01:33,760 We'll explore how AI can solve various types of optimization problems, 17 00:01:33,760 --> 00:01:38,400 trying to maximize profits or minimize costs or satisfy some other constraints 18 00:01:38,400 --> 00:01:41,840 before turning our attention to the fast-growing field of machine learning, 19 00:01:41,840 --> 00:01:45,440 where we won't tell our AI exactly how to solve a problem, but instead, 20 00:01:45,440 --> 00:01:48,080 give our AI access to data and experiences 21 00:01:48,080 --> 00:01:52,000 so that our AI can learn on its own how to perform these tasks. 22 00:01:52,000 --> 00:01:55,440 In particular, we'll look at neural networks, one of the most popular tools 23 00:01:55,440 --> 00:01:59,760 in modern machine learning, inspired by the way that human brains learn and reason as well 24 00:01:59,760 --> 00:02:03,200 before finally taking a look at the world of natural language processing 25 00:02:03,200 --> 00:02:06,800 so that it's not just us humans learning to learn how artificial intelligence is 26 00:02:06,800 --> 00:02:11,840 able to speak, but also AI learning how to understand and interpret human language as well. 27 00:02:11,840 --> 00:02:14,720 We'll explore these ideas and algorithms, and along the way, 28 00:02:14,720 --> 00:02:19,520 give you the opportunity to build your own AI programs to implement all of this and more. 29 00:02:19,520 --> 00:02:44,000 This is CS50. 30 00:02:44,000 --> 00:02:44,560 All right. 31 00:02:44,560 --> 00:02:48,080 Welcome, everyone, to an introduction to artificial intelligence with Python. 32 00:02:48,080 --> 00:02:50,240 My name is Brian Yu, and in this class, we'll 33 00:02:50,240 --> 00:02:53,200 explore some of the ideas and techniques and algorithms 34 00:02:53,200 --> 00:02:56,240 that are at the foundation of artificial intelligence. 35 00:02:56,240 --> 00:03:00,000 Now, artificial intelligence covers a wide variety of types of techniques. 36 00:03:00,000 --> 00:03:01,920 Anytime you see a computer do something that 37 00:03:01,920 --> 00:03:05,120 appears to be intelligent or rational in some way, 38 00:03:05,120 --> 00:03:07,360 like recognizing someone's face in a photo, 39 00:03:07,360 --> 00:03:09,840 or being able to play a game better than people can, 40 00:03:09,840 --> 00:03:12,960 or being able to understand human language when we talk to our phones 41 00:03:12,960 --> 00:03:16,080 and they understand what we mean and are able to respond back to us, 42 00:03:16,080 --> 00:03:19,760 these are all examples of AI, or artificial intelligence. 43 00:03:19,760 --> 00:03:24,320 And in this class, we'll explore some of the ideas that make that AI possible. 44 00:03:24,320 --> 00:03:28,000 So we'll begin our conversations with search, the problem of we have an AI, 45 00:03:28,000 --> 00:03:32,160 and we would like the AI to be able to search for solutions to some kind of problem, 46 00:03:32,160 --> 00:03:33,520 no matter what that problem might be. 47 00:03:33,520 --> 00:03:37,040 Whether it's trying to get driving directions from point A to point B, 48 00:03:37,040 --> 00:03:40,080 or trying to figure out how to play a game, given a tic-tac-toe game, 49 00:03:40,080 --> 00:03:43,040 for example, figuring out what move it ought to make. 50 00:03:43,040 --> 00:03:45,520 After that, we'll take a look at knowledge. 51 00:03:45,520 --> 00:03:48,560 Ideally, we want our AI to be able to know information, 52 00:03:48,560 --> 00:03:51,440 to be able to represent that information, and more importantly, 53 00:03:51,440 --> 00:03:53,680 to be able to draw inferences from that information, 54 00:03:53,680 --> 00:03:57,760 to be able to use the information it knows and draw additional conclusions. 55 00:03:57,760 --> 00:04:02,240 So we'll talk about how AI can be programmed in order to do just that. 56 00:04:02,240 --> 00:04:04,400 Then we'll explore the topic of uncertainty, 57 00:04:04,400 --> 00:04:08,160 talking about ideas of what happens if a computer isn't sure about a fact, 58 00:04:08,160 --> 00:04:11,120 but maybe is only sure with a certain probability. 59 00:04:11,120 --> 00:04:13,520 So we'll talk about some of the ideas behind probability, 60 00:04:13,520 --> 00:04:16,640 and how computers can begin to deal with uncertain events 61 00:04:16,640 --> 00:04:20,960 in order to be a little bit more intelligent in that sense as well. 62 00:04:20,960 --> 00:04:23,680 After that, we'll turn our attention to optimization, 63 00:04:23,680 --> 00:04:27,280 problems of when the computer is trying to optimize for some sort of goal, 64 00:04:27,280 --> 00:04:29,840 especially in a situation where there might be multiple ways 65 00:04:29,840 --> 00:04:33,120 that a computer might solve a problem, but we're looking for a better way, 66 00:04:33,120 --> 00:04:36,640 or potentially the best way, if that's at all possible. 67 00:04:36,640 --> 00:04:39,680 Then we'll take a look at machine learning, or learning more generally, 68 00:04:39,680 --> 00:04:41,920 and looking at how, when we have access to data, 69 00:04:41,920 --> 00:04:45,360 our computers can be programmed to be quite intelligent by learning from data 70 00:04:45,360 --> 00:04:48,880 and learning from experience, being able to perform a task better and better 71 00:04:48,880 --> 00:04:50,880 based on greater access to data. 72 00:04:50,880 --> 00:04:54,080 So your email, for example, where your email inbox somehow knows 73 00:04:54,080 --> 00:04:57,360 which of your emails are good emails and which of your emails are spam. 74 00:04:57,360 --> 00:05:01,920 These are all examples of computers being able to learn from past experiences 75 00:05:01,920 --> 00:05:03,600 and past data. 76 00:05:03,600 --> 00:05:05,520 We'll take a look, too, at how computers are 77 00:05:05,520 --> 00:05:08,320 able to draw inspiration from human intelligence, 78 00:05:08,320 --> 00:05:10,160 looking at the structure of the human brain, 79 00:05:10,160 --> 00:05:13,920 and how neural networks can be a computer analog to that sort of idea, 80 00:05:13,920 --> 00:05:17,680 and how, by taking advantage of a certain type of structure of a computer program, 81 00:05:17,680 --> 00:05:21,040 we can write neural networks that are able to perform tasks very, very 82 00:05:21,040 --> 00:05:22,240 effectively. 83 00:05:22,240 --> 00:05:25,440 And then finally, we'll turn our attention to language, not programming 84 00:05:25,440 --> 00:05:28,320 languages, but human languages that we speak every day. 85 00:05:28,320 --> 00:05:31,280 And taking a look at the challenges that come about as a computer tries 86 00:05:31,280 --> 00:05:35,280 to understand natural language, and how it is some of the natural language 87 00:05:35,280 --> 00:05:39,120 processing that occurs in modern artificial intelligence can actually 88 00:05:39,120 --> 00:05:40,400 work. 89 00:05:40,400 --> 00:05:43,520 But today, we'll begin our conversation with search, this problem 90 00:05:43,520 --> 00:05:47,040 of trying to figure out what to do when we have some sort of situation 91 00:05:47,040 --> 00:05:50,320 that the computer is in, some sort of environment that an agent is in, 92 00:05:50,320 --> 00:05:52,560 so to speak, and we would like for that agent 93 00:05:52,560 --> 00:05:56,480 to be able to somehow look for a solution to that problem. 94 00:05:56,480 --> 00:05:59,680 Now, these problems can come in any number of different types of formats. 95 00:05:59,680 --> 00:06:01,600 One example, for instance, might be something 96 00:06:01,600 --> 00:06:04,880 like this classic 15 puzzle with the sliding tiles that you might have seen. 97 00:06:04,880 --> 00:06:06,640 Where you're trying to slide the tiles in order 98 00:06:06,640 --> 00:06:09,120 to make sure that all the numbers line up in order. 99 00:06:09,120 --> 00:06:12,000 This is an example of what you might call a search problem. 100 00:06:12,000 --> 00:06:15,600 The 15 puzzle begins in an initially mixed up state, 101 00:06:15,600 --> 00:06:18,400 and we need some way of finding moves to make in order 102 00:06:18,400 --> 00:06:20,960 to return the puzzle to its solved state. 103 00:06:20,960 --> 00:06:23,440 But there are similar problems that you can frame in other ways. 104 00:06:23,440 --> 00:06:25,600 Trying to find your way through a maze, for example, 105 00:06:25,600 --> 00:06:27,440 is another example of a search problem. 106 00:06:27,440 --> 00:06:31,040 You begin in one place, you have some goal of where you're trying to get to, 107 00:06:31,040 --> 00:06:34,320 and you need to figure out the correct sequence of actions that will take you 108 00:06:34,320 --> 00:06:36,880 from that initial state to the goal. 109 00:06:36,880 --> 00:06:38,880 And while this is a little bit abstract, any time 110 00:06:38,880 --> 00:06:40,880 we talk about maze solving in this class, 111 00:06:40,880 --> 00:06:43,440 you can translate it to something a little more real world. 112 00:06:43,440 --> 00:06:45,280 Something like driving directions. 113 00:06:45,280 --> 00:06:48,640 If you ever wonder how Google Maps is able to figure out what is the best way 114 00:06:48,640 --> 00:06:52,400 for you to get from point A to point B, and what turns to make at what time, 115 00:06:52,400 --> 00:06:56,720 depending on traffic, for example, it's often some sort of search algorithm. 116 00:06:56,720 --> 00:06:59,840 You have an AI that is trying to get from an initial position 117 00:06:59,840 --> 00:07:03,520 to some sort of goal by taking some sequence of actions. 118 00:07:03,520 --> 00:07:06,160 So we'll start our conversations today by thinking 119 00:07:06,160 --> 00:07:08,080 about these types of search problems and what 120 00:07:08,080 --> 00:07:11,680 goes in to solving a search problem like this in order for an AI 121 00:07:11,680 --> 00:07:14,160 to be able to find a good solution. 122 00:07:14,160 --> 00:07:15,600 In order to do so, though, we're going to need 123 00:07:15,600 --> 00:07:19,120 to introduce a little bit of terminology, some of which I've already used. 124 00:07:19,120 --> 00:07:22,080 But the first term we'll need to think about is an agent. 125 00:07:22,080 --> 00:07:25,360 An agent is just some entity that perceives its environment. 126 00:07:25,360 --> 00:07:27,520 It somehow is able to perceive the things around it 127 00:07:27,520 --> 00:07:30,080 and act on that environment in some way. 128 00:07:30,080 --> 00:07:31,600 So in the case of the driving directions, 129 00:07:31,600 --> 00:07:34,400 your agent might be some representation of a car that 130 00:07:34,400 --> 00:07:36,640 is trying to figure out what actions to take in order 131 00:07:36,640 --> 00:07:38,160 to arrive at a destination. 132 00:07:38,160 --> 00:07:40,880 In the case of the 15 puzzle with the sliding tiles, 133 00:07:40,880 --> 00:07:43,280 the agent might be the AI or the person that 134 00:07:43,280 --> 00:07:46,720 is trying to solve that puzzle to try and figure out what tiles to move 135 00:07:46,720 --> 00:07:49,520 in order to get to that solution. 136 00:07:49,520 --> 00:07:52,160 Next, we introduce the idea of a state. 137 00:07:52,160 --> 00:07:56,560 A state is just some configuration of the agent in its environment. 138 00:07:56,560 --> 00:08:00,320 So in the 15 puzzle, for example, any state might be any one of these three, 139 00:08:00,320 --> 00:08:03,760 for example. A state is just some configuration of the tiles. 140 00:08:03,760 --> 00:08:05,680 And each of these states is different and is 141 00:08:05,680 --> 00:08:08,240 going to require a slightly different solution. 142 00:08:08,240 --> 00:08:11,440 A different sequence of actions will be needed in each one of these 143 00:08:11,440 --> 00:08:15,120 in order to get from this initial state to the goal, which 144 00:08:15,120 --> 00:08:16,880 is where we're trying to get. 145 00:08:16,880 --> 00:08:18,640 So the initial state, then, what is that? 146 00:08:18,640 --> 00:08:21,520 The initial state is just the state where the agent begins. 147 00:08:21,520 --> 00:08:24,320 It is one such state where we're going to start from. 148 00:08:24,320 --> 00:08:27,440 And this is going to be the starting point for our search algorithm, 149 00:08:27,440 --> 00:08:28,160 so to speak. 150 00:08:28,160 --> 00:08:29,840 We're going to begin with this initial state 151 00:08:29,840 --> 00:08:32,960 and then start to reason about it, to think about what actions might we 152 00:08:32,960 --> 00:08:37,120 apply to that initial state in order to figure out how to get from the beginning 153 00:08:37,120 --> 00:08:42,080 to the end, from the initial position to whatever our goal happens to be. 154 00:08:42,080 --> 00:08:44,880 And how do we make our way from that initial position to the goal? 155 00:08:44,880 --> 00:08:47,440 Well, ultimately, it's via taking actions. 156 00:08:47,440 --> 00:08:50,880 Actions are just choices that we can make in any given state. 157 00:08:50,880 --> 00:08:54,400 And in AI, we're always going to try to formalize these ideas a little bit 158 00:08:54,400 --> 00:08:57,280 more precisely, such that we could program them a little bit more 159 00:08:57,280 --> 00:08:58,800 mathematically, so to speak. 160 00:08:58,800 --> 00:09:00,480 So this will be a recurring theme. 161 00:09:00,480 --> 00:09:04,240 And we can more precisely define actions as a function. 162 00:09:04,240 --> 00:09:07,680 We're going to effectively define a function called actions that takes an 163 00:09:07,680 --> 00:09:12,960 input, s, where s is going to be some state that exists inside of our environment. 164 00:09:12,960 --> 00:09:17,600 And actions of s is going to take the state as input and return as output 165 00:09:17,600 --> 00:09:22,000 the set of all actions that can be executed in that state. 166 00:09:22,000 --> 00:09:25,600 And so it's possible that some actions are only valid in certain states 167 00:09:25,600 --> 00:09:27,040 and not in other states. 168 00:09:27,040 --> 00:09:29,840 And we'll see examples of that soon, too. 169 00:09:29,840 --> 00:09:31,920 So in the case of the 15 puzzle, for example, 170 00:09:31,920 --> 00:09:35,600 there are generally going to be four possible actions that we can do most of 171 00:09:35,600 --> 00:09:36,160 the time. 172 00:09:36,160 --> 00:09:39,680 We can slide a tile to the right, slide a tile to the left, slide a tile up, 173 00:09:39,680 --> 00:09:41,680 or slide a tile down, for example. 174 00:09:41,680 --> 00:09:45,280 And those are going to be the actions that are available to us. 175 00:09:45,280 --> 00:09:48,400 So somehow our AI, our program, needs some encoding 176 00:09:48,400 --> 00:09:51,600 of the state, which is often going to be in some numerical format, 177 00:09:51,600 --> 00:09:53,520 and some encoding of these actions. 178 00:09:53,520 --> 00:09:56,640 But it also needs some encoding of the relationship between these things. 179 00:09:56,640 --> 00:10:00,080 How do the states and actions relate to one another? 180 00:10:00,080 --> 00:10:04,000 And in order to do that, we'll introduce to our AI a transition model, which 181 00:10:04,000 --> 00:10:08,240 will be a description of what state we get after we perform some available 182 00:10:08,240 --> 00:10:10,800 action in some other state. 183 00:10:10,800 --> 00:10:12,960 And again, we can be a little bit more precise about this, 184 00:10:12,960 --> 00:10:17,200 define this transition model a little bit more formally, again, as a function. 185 00:10:17,200 --> 00:10:20,720 The function is going to be a function called result that this time takes two 186 00:10:20,720 --> 00:10:21,600 inputs. 187 00:10:21,600 --> 00:10:24,560 Input number one is s, some state. 188 00:10:24,560 --> 00:10:27,680 And input number two is a, some action. 189 00:10:27,680 --> 00:10:30,080 And the output of this function result is it 190 00:10:30,080 --> 00:10:36,320 is going to give us the state that we get after we perform action a in state s. 191 00:10:36,320 --> 00:10:39,840 So let's take a look at an example to see more precisely what this actually means. 192 00:10:39,840 --> 00:10:43,280 Here is an example of a state, of the 15 puzzle, for example. 193 00:10:43,280 --> 00:10:46,880 And here is an example of an action, sliding a tile to the right. 194 00:10:46,880 --> 00:10:50,160 What happens if we pass these as inputs to the result function? 195 00:10:50,160 --> 00:10:54,720 Again, the result function takes this board, this state, as its first input. 196 00:10:54,720 --> 00:10:57,120 And it takes an action as a second input. 197 00:10:57,120 --> 00:10:59,360 And of course, here, I'm describing things visually 198 00:10:59,360 --> 00:11:02,320 so that you can see visually what the state is and what the action is. 199 00:11:02,320 --> 00:11:04,720 In a computer, you might represent one of these actions 200 00:11:04,720 --> 00:11:06,960 as just some number that represents the action. 201 00:11:06,960 --> 00:11:08,720 Or if you're familiar with enums that allow 202 00:11:08,720 --> 00:11:10,400 you to enumerate multiple possibilities, 203 00:11:10,400 --> 00:11:11,760 it might be something like that. 204 00:11:11,760 --> 00:11:13,760 And this state might just be represented 205 00:11:13,760 --> 00:11:17,760 as an array or two-dimensional array of all of these numbers that exist. 206 00:11:17,760 --> 00:11:20,880 But here, we're going to show it visually just so you can see it. 207 00:11:20,880 --> 00:11:23,360 But when we take this state and this action, 208 00:11:23,360 --> 00:11:26,800 pass it into the result function, the output is a new state. 209 00:11:26,800 --> 00:11:30,080 The state we get after we take a tile and slide it to the right, 210 00:11:30,080 --> 00:11:32,000 and this is the state we get as a result. 211 00:11:32,000 --> 00:11:35,200 If we had a different action and a different state, for example, 212 00:11:35,200 --> 00:11:37,120 and pass that into the result function, we'd 213 00:11:37,120 --> 00:11:38,960 get a different answer altogether. 214 00:11:38,960 --> 00:11:41,280 So the result function needs to take care 215 00:11:41,280 --> 00:11:45,600 of figuring out how to take a state and take an action and get what results. 216 00:11:45,600 --> 00:11:48,320 And this is going to be our transition model that 217 00:11:48,320 --> 00:11:52,800 describes how it is that states and actions are related to each other. 218 00:11:52,800 --> 00:11:55,760 If we take this transition model and think about it more generally 219 00:11:55,760 --> 00:12:00,320 and across the entire problem, we can form what we might call a state space. 220 00:12:00,320 --> 00:12:03,520 The set of all of the states we can get from the initial state 221 00:12:03,520 --> 00:12:08,160 via any sequence of actions, by taking 0 or 1 or 2 or more actions in addition 222 00:12:08,160 --> 00:12:12,160 to that, so we could draw a diagram that looks something like this, where 223 00:12:12,160 --> 00:12:15,920 every state is represented here by a game board, and there are arrows 224 00:12:15,920 --> 00:12:20,240 that connect every state to every other state we can get to from that state. 225 00:12:20,240 --> 00:12:23,280 And the state space is much larger than what you see just here. 226 00:12:23,280 --> 00:12:27,600 This is just a sample of what the state space might actually look like. 227 00:12:27,600 --> 00:12:29,840 And in general, across many search problems, 228 00:12:29,840 --> 00:12:33,680 whether they're this particular 15 puzzle or driving directions or something else, 229 00:12:33,680 --> 00:12:36,080 the state space is going to look something like this. 230 00:12:36,080 --> 00:12:40,480 We have individual states and arrows that are connecting them. 231 00:12:40,480 --> 00:12:42,560 And oftentimes, just for simplicity, we'll 232 00:12:42,560 --> 00:12:47,120 simplify our representation of this entire thing as a graph, some sequence 233 00:12:47,120 --> 00:12:50,080 of nodes and edges that connect nodes. 234 00:12:50,080 --> 00:12:52,640 But you can think of this more abstract representation 235 00:12:52,640 --> 00:12:54,160 as the exact same idea. 236 00:12:54,160 --> 00:12:56,320 Each of these little circles or nodes is going 237 00:12:56,320 --> 00:12:59,360 to represent one of the states inside of our problem. 238 00:12:59,360 --> 00:13:01,440 And the arrows here represent the actions 239 00:13:01,440 --> 00:13:04,320 that we can take in any particular state, taking us 240 00:13:04,320 --> 00:13:09,680 from one particular state to another state, for example. 241 00:13:09,680 --> 00:13:10,560 All right. 242 00:13:10,560 --> 00:13:14,320 So now we have this idea of nodes that are representing these states, 243 00:13:14,320 --> 00:13:16,560 actions that can take us from one state to another, 244 00:13:16,560 --> 00:13:19,520 and a transition model that defines what happens after we 245 00:13:19,520 --> 00:13:21,120 take a particular action. 246 00:13:21,120 --> 00:13:23,280 So the next step we need to figure out is how 247 00:13:23,280 --> 00:13:26,400 we know when the AI is done solving the problem. 248 00:13:26,400 --> 00:13:30,720 The AI needs some way to know when it gets to the goal that it's found the goal. 249 00:13:30,720 --> 00:13:33,920 So the next thing we'll need to encode into our artificial intelligence 250 00:13:33,920 --> 00:13:39,200 is a goal test, some way to determine whether a given state is a goal state. 251 00:13:39,200 --> 00:13:42,560 In the case of something like driving directions, it might be pretty easy. 252 00:13:42,560 --> 00:13:45,600 If you're in a state that corresponds to whatever the user typed 253 00:13:45,600 --> 00:13:48,960 in as their intended destination, well, then you know you're in a goal state. 254 00:13:48,960 --> 00:13:51,200 In the 15 puzzle, it might be checking the numbers 255 00:13:51,200 --> 00:13:52,880 to make sure they're all in ascending order. 256 00:13:52,880 --> 00:13:55,760 But the AI needs some way to encode whether or not 257 00:13:55,760 --> 00:13:58,160 any state they happen to be in is a goal. 258 00:13:58,160 --> 00:14:00,480 And some problems might have one goal, like a maze 259 00:14:00,480 --> 00:14:03,120 where you have one initial position and one ending position, 260 00:14:03,120 --> 00:14:04,240 and that's the goal. 261 00:14:04,240 --> 00:14:06,560 In other more complex problems, you might imagine 262 00:14:06,560 --> 00:14:08,240 that there are multiple possible goals. 263 00:14:08,240 --> 00:14:10,880 That there are multiple ways to solve a problem, 264 00:14:10,880 --> 00:14:13,680 and we might not care which one the computer finds, 265 00:14:13,680 --> 00:14:17,200 as long as it does find a particular goal. 266 00:14:17,200 --> 00:14:20,800 However, sometimes the computer doesn't just care about finding a goal, 267 00:14:20,800 --> 00:14:23,840 but finding a goal well, or one with a low cost. 268 00:14:23,840 --> 00:14:26,160 And it's for that reason that the last piece of terminology 269 00:14:26,160 --> 00:14:28,240 that we'll use to define these search problems 270 00:14:28,240 --> 00:14:30,560 is something called a path cost. 271 00:14:30,560 --> 00:14:33,040 You might imagine that in the case of driving directions, 272 00:14:33,040 --> 00:14:36,560 it would be pretty annoying if I said I wanted directions from point A 273 00:14:36,560 --> 00:14:38,960 to point B, and the route that Google Maps gave me 274 00:14:38,960 --> 00:14:42,640 was a long route with lots of detours that were unnecessary that took longer 275 00:14:42,640 --> 00:14:45,360 than it should have for me to get to that destination. 276 00:14:45,360 --> 00:14:48,240 And it's for that reason that when we're formulating search problems, 277 00:14:48,240 --> 00:14:51,920 we'll often give every path some sort of numerical cost, 278 00:14:51,920 --> 00:14:56,480 some number telling us how expensive it is to take this particular option, 279 00:14:56,480 --> 00:14:59,440 and then tell our AI that instead of just finding 280 00:14:59,440 --> 00:15:02,800 a solution, some way of getting from the initial state to the goal, 281 00:15:02,800 --> 00:15:06,480 we'd really like to find one that minimizes this path cost. 282 00:15:06,480 --> 00:15:09,200 That is, less expensive, or takes less time, 283 00:15:09,200 --> 00:15:12,320 or minimizes some other numerical value. 284 00:15:12,320 --> 00:15:15,520 We can represent this graphically if we take a look at this graph again, 285 00:15:15,520 --> 00:15:18,560 and imagine that each of these arrows, each of these actions 286 00:15:18,560 --> 00:15:21,360 that we can take from one state to another state, 287 00:15:21,360 --> 00:15:23,520 has some sort of number associated with it. 288 00:15:23,520 --> 00:15:26,800 That number being the path cost of this particular action, 289 00:15:26,800 --> 00:15:29,280 where some of the costs for any particular action 290 00:15:29,280 --> 00:15:33,280 might be more expensive than the cost for some other action, for example. 291 00:15:33,280 --> 00:15:35,920 Although this will only happen in some sorts of problems. 292 00:15:35,920 --> 00:15:38,320 In other problems, we can simplify the diagram 293 00:15:38,320 --> 00:15:42,400 and just assume that the cost of any particular action is the same. 294 00:15:42,400 --> 00:15:45,280 And this is probably the case in something like the 15 puzzle, 295 00:15:45,280 --> 00:15:47,840 for example, where it doesn't really make a difference 296 00:15:47,840 --> 00:15:49,680 whether I'm moving right or moving left. 297 00:15:49,680 --> 00:15:52,240 The only thing that matters is the total number 298 00:15:52,240 --> 00:15:56,080 of steps that I have to take to get from point A to point B. 299 00:15:56,080 --> 00:15:58,720 And each of those steps is of equal cost. 300 00:15:58,720 --> 00:16:03,040 We can just assume it's of some constant cost like one. 301 00:16:03,040 --> 00:16:07,520 And so this now forms the basis for what we might consider to be a search problem. 302 00:16:07,520 --> 00:16:11,680 A search problem has some sort of initial state, some place where we begin, 303 00:16:11,680 --> 00:16:14,160 some sort of action that we can take or multiple actions 304 00:16:14,160 --> 00:16:16,080 that we can take in any given state. 305 00:16:16,080 --> 00:16:17,680 And it has a transition model. 306 00:16:17,680 --> 00:16:21,120 Some way of defining what happens when we go from one state 307 00:16:21,120 --> 00:16:24,960 and take one action, what state do we end up with as a result. 308 00:16:24,960 --> 00:16:26,960 In addition to that, we need some goal test 309 00:16:26,960 --> 00:16:29,440 to know whether or not we've reached a goal. 310 00:16:29,440 --> 00:16:31,840 And then we need a path cost function that 311 00:16:31,840 --> 00:16:35,760 tells us for any particular path, by following some sequence of actions, 312 00:16:35,760 --> 00:16:37,520 how expensive is that path. 313 00:16:37,520 --> 00:16:41,280 What does its cost in terms of money or time or some other resource 314 00:16:41,280 --> 00:16:44,160 that we are trying to minimize our usage of. 315 00:16:44,160 --> 00:16:46,880 And the goal ultimately is to find a solution. 316 00:16:46,880 --> 00:16:50,000 Where a solution in this case is just some sequence of actions 317 00:16:50,000 --> 00:16:52,960 that will take us from the initial state to the goal state. 318 00:16:52,960 --> 00:16:55,920 And ideally, we'd like to find not just any solution 319 00:16:55,920 --> 00:16:58,800 but the optimal solution, which is a solution that 320 00:16:58,800 --> 00:17:02,800 has the lowest path cost among all of the possible solutions. 321 00:17:02,800 --> 00:17:05,440 And in some cases, there might be multiple optimal solutions. 322 00:17:05,440 --> 00:17:07,440 But an optimal solution just means that there 323 00:17:07,440 --> 00:17:12,160 is no way that we could have done better in terms of finding that solution. 324 00:17:12,160 --> 00:17:13,760 So now we've defined the problem. 325 00:17:13,760 --> 00:17:15,920 And now we need to begin to figure out how it 326 00:17:15,920 --> 00:17:18,800 is that we're going to solve this kind of search problem. 327 00:17:18,800 --> 00:17:21,120 And in order to do so, you'll probably imagine 328 00:17:21,120 --> 00:17:24,640 that our computer is going to need to represent a whole bunch of data 329 00:17:24,640 --> 00:17:26,000 about this particular problem. 330 00:17:26,000 --> 00:17:28,880 We need to represent data about where we are in the problem. 331 00:17:28,880 --> 00:17:32,320 And we might need to be considering multiple different options at once. 332 00:17:32,320 --> 00:17:35,280 And oftentimes, when we're trying to package a whole bunch of data 333 00:17:35,280 --> 00:17:38,640 related to a state together, we'll do so using a data structure 334 00:17:38,640 --> 00:17:40,480 that we're going to call a node. 335 00:17:40,480 --> 00:17:42,400 A node is a data structure that is just going 336 00:17:42,400 --> 00:17:44,960 to keep track of a variety of different values. 337 00:17:44,960 --> 00:17:47,280 And specifically, in the case of a search problem, 338 00:17:47,280 --> 00:17:50,480 it's going to keep track of these four values in particular. 339 00:17:50,480 --> 00:17:54,400 Every node is going to keep track of a state, the state we're currently on. 340 00:17:54,400 --> 00:17:57,360 And every node is also going to keep track of a parent. 341 00:17:57,360 --> 00:18:00,320 A parent being the state before us or the node 342 00:18:00,320 --> 00:18:03,440 that we used in order to get to this current state. 343 00:18:03,440 --> 00:18:07,120 And this is going to be relevant because eventually, once we reach the goal node, 344 00:18:07,120 --> 00:18:10,720 once we get to the end, we want to know what sequence of actions 345 00:18:10,720 --> 00:18:12,880 we use in order to get to that goal. 346 00:18:12,880 --> 00:18:16,000 And the way we'll know that is by looking at these parents 347 00:18:16,000 --> 00:18:19,680 to keep track of what led us to the goal and what led us to that state 348 00:18:19,680 --> 00:18:22,560 and what led us to the state before that, so on and so forth, 349 00:18:22,560 --> 00:18:25,200 backtracking our way to the beginning so that we 350 00:18:25,200 --> 00:18:27,840 know the entire sequence of actions we needed in order 351 00:18:27,840 --> 00:18:30,560 to get from the beginning to the end. 352 00:18:30,560 --> 00:18:33,440 The node is also going to keep track of what action we took in order 353 00:18:33,440 --> 00:18:35,920 to get from the parent to the current state. 354 00:18:35,920 --> 00:18:39,360 And the node is also going to keep track of a path cost. 355 00:18:39,360 --> 00:18:41,920 In other words, it's going to keep track of the number 356 00:18:41,920 --> 00:18:45,440 that represents how long it took to get from the initial state 357 00:18:45,440 --> 00:18:47,920 to the state that we currently happen to be at. 358 00:18:47,920 --> 00:18:49,760 And we'll see why this is relevant as we 359 00:18:49,760 --> 00:18:51,600 start to talk about some of the optimizations 360 00:18:51,600 --> 00:18:55,360 that we can make in terms of these search problems more generally. 361 00:18:55,360 --> 00:18:57,920 So this is the data structure that we're going to use in order to solve 362 00:18:57,920 --> 00:18:58,800 the problem. 363 00:18:58,800 --> 00:19:00,480 And now let's talk about the approach. 364 00:19:00,480 --> 00:19:03,840 How might we actually begin to solve the problem? 365 00:19:03,840 --> 00:19:05,840 Well, as you might imagine, what we're going to do 366 00:19:05,840 --> 00:19:08,000 is we're going to start at one particular state, 367 00:19:08,000 --> 00:19:10,560 and we're just going to explore from there. 368 00:19:10,560 --> 00:19:12,560 The intuition is that from a given state, 369 00:19:12,560 --> 00:19:14,560 we have multiple options that we could take, 370 00:19:14,560 --> 00:19:16,640 and we're going to explore those options. 371 00:19:16,640 --> 00:19:18,640 And once we explore those options, we'll 372 00:19:18,640 --> 00:19:22,160 find that more options than that are going to make themselves available. 373 00:19:22,160 --> 00:19:24,960 And we're going to consider all of the available options 374 00:19:24,960 --> 00:19:29,120 to be stored inside of a single data structure that we'll call the frontier. 375 00:19:29,120 --> 00:19:31,600 The frontier is going to represent all of the things 376 00:19:31,600 --> 00:19:36,640 that we could explore next that we haven't yet explored or visited. 377 00:19:36,640 --> 00:19:39,200 So in our approach, we're going to begin the search algorithm 378 00:19:39,200 --> 00:19:42,800 by starting with a frontier that just contains one state. 379 00:19:42,800 --> 00:19:45,280 The frontier is going to contain the initial state, 380 00:19:45,280 --> 00:19:47,840 because at the beginning, that's the only state we know about. 381 00:19:47,840 --> 00:19:50,160 That is the only state that exists. 382 00:19:50,160 --> 00:19:53,600 And then our search algorithm is effectively going to follow a loop. 383 00:19:53,600 --> 00:19:57,200 We're going to repeat some process again and again and again. 384 00:19:57,200 --> 00:20:01,040 The first thing we're going to do is if the frontier is empty, 385 00:20:01,040 --> 00:20:02,320 then there's no solution. 386 00:20:02,320 --> 00:20:05,120 And we can report that there is no way to get to the goal. 387 00:20:05,120 --> 00:20:06,400 And that's certainly possible. 388 00:20:06,400 --> 00:20:09,680 There are certain types of problems that an AI might try to explore 389 00:20:09,680 --> 00:20:12,640 and realize that there is no way to solve that problem. 390 00:20:12,640 --> 00:20:15,360 And that's useful information for humans to know as well. 391 00:20:15,360 --> 00:20:19,360 So if ever the frontier is empty, that means there's nothing left to explore. 392 00:20:19,360 --> 00:20:22,960 And we haven't yet found a solution, so there is no solution. 393 00:20:22,960 --> 00:20:24,960 There's nothing left to explore. 394 00:20:24,960 --> 00:20:28,720 Otherwise, what we'll do is we'll remove a node from the frontier. 395 00:20:28,720 --> 00:20:32,000 So right now at the beginning, the frontier just contains one node 396 00:20:32,000 --> 00:20:33,680 representing the initial state. 397 00:20:33,680 --> 00:20:35,360 But over time, the frontier might grow. 398 00:20:35,360 --> 00:20:36,880 It might contain multiple states. 399 00:20:36,880 --> 00:20:41,520 And so here, we're just going to remove a single node from that frontier. 400 00:20:41,520 --> 00:20:44,800 If that node happens to be a goal, then we found a solution. 401 00:20:44,800 --> 00:20:48,240 So we remove a node from the frontier and ask ourselves, is this the goal? 402 00:20:48,240 --> 00:20:51,360 And we do that by applying the goal test that we talked about earlier, 403 00:20:51,360 --> 00:20:53,120 asking if we're at the destination. 404 00:20:53,120 --> 00:20:56,960 Or asking if all the numbers of the 15 puzzle happen to be in order. 405 00:20:56,960 --> 00:20:59,760 So if the node contains the goal, we found a solution. 406 00:20:59,760 --> 00:21:00,240 Great. 407 00:21:00,240 --> 00:21:01,680 We're done. 408 00:21:01,680 --> 00:21:06,480 And otherwise, what we'll need to do is we'll need to expand the node. 409 00:21:06,480 --> 00:21:08,800 And this is a term of art in artificial intelligence. 410 00:21:08,800 --> 00:21:12,720 To expand the node just means to look at all of the neighbors of that node. 411 00:21:12,720 --> 00:21:15,440 In other words, consider all of the possible actions 412 00:21:15,440 --> 00:21:18,640 that I could take from the state that this node is representing 413 00:21:18,640 --> 00:21:21,120 and what nodes could I get to from there. 414 00:21:21,120 --> 00:21:23,360 We're going to take all of those nodes, the next nodes 415 00:21:23,360 --> 00:21:26,000 that I can get to from this current one I'm looking at, 416 00:21:26,000 --> 00:21:28,080 and add those to the frontier. 417 00:21:28,080 --> 00:21:30,240 And then we'll repeat this process. 418 00:21:30,240 --> 00:21:32,640 So at a very high level, the idea is we start 419 00:21:32,640 --> 00:21:35,200 with a frontier that contains the initial state. 420 00:21:35,200 --> 00:21:38,000 And we're constantly removing a node from the frontier, 421 00:21:38,000 --> 00:21:41,920 looking at where we can get to next and adding those nodes to the frontier, 422 00:21:41,920 --> 00:21:44,720 repeating this process over and over until either we 423 00:21:44,720 --> 00:21:47,440 remove a node from the frontier and it contains a goal, 424 00:21:47,440 --> 00:21:50,800 meaning we've solved the problem, or we run into a situation 425 00:21:50,800 --> 00:21:55,280 where the frontier is empty, at which point we're left with no solution. 426 00:21:55,280 --> 00:21:57,440 So let's actually try and take the pseudocode, 427 00:21:57,440 --> 00:22:02,160 put it into practice by taking a look at an example of a sample search problem. 428 00:22:02,160 --> 00:22:04,080 So right here, I have a sample graph. 429 00:22:04,080 --> 00:22:06,240 A is connected to B via this action. 430 00:22:06,240 --> 00:22:10,640 B is connected to nodes C and D. C is connected to E. D is connected to F. 431 00:22:10,640 --> 00:22:16,400 And what I'd like to do is have my AI find a path from A to E. 432 00:22:16,400 --> 00:22:20,800 We want to get from this initial state to this goal state. 433 00:22:20,800 --> 00:22:22,320 So how are we going to do that? 434 00:22:22,320 --> 00:22:25,520 Well, we're going to start with a frontier that contains the initial state. 435 00:22:25,520 --> 00:22:27,360 This is going to represent our frontier. 436 00:22:27,360 --> 00:22:29,360 So our frontier initially will just contain 437 00:22:29,360 --> 00:22:32,400 A, that initial state where we're going to begin. 438 00:22:32,400 --> 00:22:34,240 And now we'll repeat this process. 439 00:22:34,240 --> 00:22:36,240 If the frontier is empty, no solution. 440 00:22:36,240 --> 00:22:38,720 That's not a problem, because the frontier is not empty. 441 00:22:38,720 --> 00:22:42,880 So we'll remove a node from the frontier as the one to consider next. 442 00:22:42,880 --> 00:22:44,480 There's only one node in the frontier. 443 00:22:44,480 --> 00:22:46,640 So we'll go ahead and remove it from the frontier. 444 00:22:46,640 --> 00:22:51,280 But now A, this initial node, this is the node we're currently considering. 445 00:22:51,280 --> 00:22:52,400 We follow the next step. 446 00:22:52,400 --> 00:22:55,040 We ask ourselves, is this node the goal? 447 00:22:55,040 --> 00:22:55,760 No, it's not. 448 00:22:55,760 --> 00:22:56,640 A is not the goal. 449 00:22:56,640 --> 00:22:57,920 E is the goal. 450 00:22:57,920 --> 00:22:59,600 So we don't return the solution. 451 00:22:59,600 --> 00:23:02,960 So instead, we go to this last step, expand the node, 452 00:23:02,960 --> 00:23:05,760 and add the resulting nodes to the frontier. 453 00:23:05,760 --> 00:23:06,720 What does that mean? 454 00:23:06,720 --> 00:23:10,800 Well, it means take this state A and consider where we could get to next. 455 00:23:10,800 --> 00:23:14,000 And after A, what we could get to next is only B. 456 00:23:14,000 --> 00:23:16,880 So that's what we get when we expand A. We find B. 457 00:23:16,880 --> 00:23:18,800 And we add B to the frontier. 458 00:23:18,800 --> 00:23:20,400 And now B is in the frontier. 459 00:23:20,400 --> 00:23:22,080 And we repeat the process again. 460 00:23:22,080 --> 00:23:24,080 We say, all right, the frontier is not empty. 461 00:23:24,080 --> 00:23:26,240 So let's remove B from the frontier. 462 00:23:26,240 --> 00:23:28,080 B is now the node that we're considering. 463 00:23:28,080 --> 00:23:29,920 We ask ourselves, is B the goal? 464 00:23:29,920 --> 00:23:30,880 No, it's not. 465 00:23:30,880 --> 00:23:35,760 So we go ahead and expand B and add its resulting nodes to the frontier. 466 00:23:35,760 --> 00:23:37,440 What happens when we expand B? 467 00:23:37,440 --> 00:23:40,480 In other words, what nodes can we get to from B? 468 00:23:40,480 --> 00:23:43,760 Well, we can get to C and D. So we'll go ahead and add C and D 469 00:23:43,760 --> 00:23:44,800 from the frontier. 470 00:23:44,800 --> 00:23:47,200 And now we have two nodes in the frontier, C and D. 471 00:23:47,200 --> 00:23:48,880 And we repeat the process again. 472 00:23:48,880 --> 00:23:50,560 We remove a node from the frontier. 473 00:23:50,560 --> 00:23:52,960 For now, I'll do so arbitrarily just by picking C. 474 00:23:52,960 --> 00:23:56,320 We'll see why later, how choosing which node you remove from the frontier 475 00:23:56,320 --> 00:23:58,560 is actually quite an important part of the algorithm. 476 00:23:58,560 --> 00:24:02,000 But for now, I'll arbitrarily remove C, say it's not the goal. 477 00:24:02,000 --> 00:24:05,040 So we'll add E, the next one, to the frontier. 478 00:24:05,040 --> 00:24:07,200 Then let's say I remove E from the frontier. 479 00:24:07,200 --> 00:24:11,440 And now I check I'm currently looking at state E. Is it a goal state? 480 00:24:11,440 --> 00:24:15,600 It is, because I'm trying to find a path from A to E. So I would return the goal. 481 00:24:15,600 --> 00:24:19,760 And that now would be the solution, that I'm now able to return the solution. 482 00:24:19,760 --> 00:24:23,120 And I have found a path from A to E. 483 00:24:23,120 --> 00:24:26,560 So this is the general idea, the general approach of this search algorithm, 484 00:24:26,560 --> 00:24:30,080 to follow these steps, constantly removing nodes from the frontier, 485 00:24:30,080 --> 00:24:31,600 until we're able to find a solution. 486 00:24:31,600 --> 00:24:35,600 So the next question you might reasonably ask is, what could go wrong here? 487 00:24:35,600 --> 00:24:39,040 What are the potential problems with an approach like this? 488 00:24:39,040 --> 00:24:42,960 And here's one example of a problem that could arise from this sort of approach. 489 00:24:42,960 --> 00:24:47,040 Imagine this same graph, same as before, with one change. 490 00:24:47,040 --> 00:24:50,160 The change being now, instead of just an arrow from A to B, 491 00:24:50,160 --> 00:24:54,240 we also have an arrow from B to A, meaning we can go in both directions. 492 00:24:54,240 --> 00:24:57,600 And this is true in something like the 15 puzzle, where when I slide a tile 493 00:24:57,600 --> 00:25:00,640 to the right, I could then slide a tile to the left 494 00:25:00,640 --> 00:25:02,320 to get back to the original position. 495 00:25:02,320 --> 00:25:04,800 I could go back and forth between A and B. 496 00:25:04,800 --> 00:25:06,880 And that's what these double arrows symbolize, 497 00:25:06,880 --> 00:25:10,640 the idea that from one state, I can get to another, and then I can get back. 498 00:25:10,640 --> 00:25:12,880 And that's true in many search problems. 499 00:25:12,880 --> 00:25:16,240 What's going to happen if I try to apply the same approach now? 500 00:25:16,240 --> 00:25:18,480 Well, I'll begin with A, same as before. 501 00:25:18,480 --> 00:25:20,480 And I'll remove A from the frontier. 502 00:25:20,480 --> 00:25:23,200 And then I'll consider where I can get to from A. 503 00:25:23,200 --> 00:25:28,160 And after A, the only place I can get to is B. So B goes into the frontier. 504 00:25:28,160 --> 00:25:29,760 Then I'll say, all right, let's take a look at B. 505 00:25:29,760 --> 00:25:31,600 That's the only thing left in the frontier. 506 00:25:31,600 --> 00:25:33,600 Where can I get to from B? 507 00:25:33,600 --> 00:25:37,840 Before, it was just C and D. But now, because of that reverse arrow, 508 00:25:37,840 --> 00:25:43,360 I can get to A or C or D. So all three, A, C, and D, all of those 509 00:25:43,360 --> 00:25:44,560 now go into the frontier. 510 00:25:44,560 --> 00:25:48,800 They are places I can get to from B. And now I remove one from the frontier. 511 00:25:48,800 --> 00:25:53,200 And maybe I'm unlucky, and maybe I pick A. And now I'm looking at A again. 512 00:25:53,200 --> 00:25:54,880 And I consider, where can I get to from A? 513 00:25:54,880 --> 00:25:58,320 And from A, well, I can get to B. And now we start to see the problem. 514 00:25:58,320 --> 00:26:02,560 But if I'm not careful, I go from A to B, and then back to A, and then to B again. 515 00:26:02,560 --> 00:26:05,920 And I could be going in this infinite loop, where I never make any progress, 516 00:26:05,920 --> 00:26:09,200 because I'm constantly just going back and forth between two states 517 00:26:09,200 --> 00:26:10,880 that I've already seen. 518 00:26:10,880 --> 00:26:12,160 So what is the solution to this? 519 00:26:12,160 --> 00:26:14,480 We need some way to deal with this problem. 520 00:26:14,480 --> 00:26:16,320 And the way that we can deal with this problem 521 00:26:16,320 --> 00:26:20,000 is by somehow keeping track of what we've already explored. 522 00:26:20,000 --> 00:26:23,440 And the logic is going to be, well, if we've already explored the state, 523 00:26:23,440 --> 00:26:25,040 there's no reason to go back to it. 524 00:26:25,040 --> 00:26:27,120 Once we've explored a state, don't go back to it. 525 00:26:27,120 --> 00:26:29,360 Don't bother adding it to the frontier. 526 00:26:29,360 --> 00:26:31,040 There's no need to. 527 00:26:31,040 --> 00:26:33,920 So here's going to be our revised approach, a better way 528 00:26:33,920 --> 00:26:35,920 to approach this sort of search problem. 529 00:26:35,920 --> 00:26:39,520 And it's going to look very similar, just with a couple of modifications. 530 00:26:39,520 --> 00:26:43,600 We'll start with a frontier that contains the initial state, same as before. 531 00:26:43,600 --> 00:26:46,960 But now we'll start with another data structure, which 532 00:26:46,960 --> 00:26:49,840 will just be a set of nodes that we've already explored. 533 00:26:49,840 --> 00:26:51,360 So what are the states we've explored? 534 00:26:51,360 --> 00:26:52,720 Initially, it's empty. 535 00:26:52,720 --> 00:26:55,200 We have an empty explored set. 536 00:26:55,200 --> 00:26:57,040 And now we repeat. 537 00:26:57,040 --> 00:27:00,080 If the frontier is empty, no solution, same as before. 538 00:27:00,080 --> 00:27:02,000 We remove a node from the frontier. 539 00:27:02,000 --> 00:27:04,240 We check to see if it's a goal state, return the solution. 540 00:27:04,240 --> 00:27:06,400 None of this is any different so far. 541 00:27:06,400 --> 00:27:09,760 But now what we're going to do is we're going to add the node 542 00:27:09,760 --> 00:27:11,520 to the explored state. 543 00:27:11,520 --> 00:27:15,440 So if it happens to be the case that we remove a node from the frontier 544 00:27:15,440 --> 00:27:18,400 and it's not the goal, we'll add it to the explored set 545 00:27:18,400 --> 00:27:19,920 so that we know we've already explored it. 546 00:27:19,920 --> 00:27:23,680 We don't need to go back to it again if it happens to come up later. 547 00:27:23,680 --> 00:27:26,160 And then the final step, we expand the node 548 00:27:26,160 --> 00:27:28,880 and we add the resulting nodes to the frontier. 549 00:27:28,880 --> 00:27:31,680 But before, we just always added the resulting nodes to the frontier. 550 00:27:31,680 --> 00:27:34,000 We're going to be a little clever about it this time. 551 00:27:34,000 --> 00:27:36,640 We're only going to add the nodes to the frontier 552 00:27:36,640 --> 00:27:40,880 if they aren't already in the frontier and if they aren't already 553 00:27:40,880 --> 00:27:42,640 in the explored set. 554 00:27:42,640 --> 00:27:45,040 So we'll check both the frontier and the explored set, 555 00:27:45,040 --> 00:27:48,240 make sure that the node isn't already in one of those two. 556 00:27:48,240 --> 00:27:51,440 And so long as it isn't, then we'll go ahead and add it to the frontier, 557 00:27:51,440 --> 00:27:53,280 but not otherwise. 558 00:27:53,280 --> 00:27:55,120 And so that revised approach is ultimately 559 00:27:55,120 --> 00:27:58,640 what's going to help make sure that we don't go back and forth between two 560 00:27:58,640 --> 00:28:00,160 nodes. 561 00:28:00,160 --> 00:28:02,800 Now, the one point that I've kind of glossed over here so far 562 00:28:02,800 --> 00:28:06,480 is this step here, removing a node from the frontier. 563 00:28:06,480 --> 00:28:08,080 Before, I just chose arbitrarily. 564 00:28:08,080 --> 00:28:10,400 Like, let's just remove a node and that's it. 565 00:28:10,400 --> 00:28:12,800 But it turns out it's actually quite important how 566 00:28:12,800 --> 00:28:17,520 we decide to structure our frontier, how we add and how we remove our nodes. 567 00:28:17,520 --> 00:28:19,440 The frontier is a data structure and we need 568 00:28:19,440 --> 00:28:21,760 to make a choice about in what order are we 569 00:28:21,760 --> 00:28:23,760 going to be removing elements. 570 00:28:23,760 --> 00:28:27,280 And one of the simplest data structures for adding and removing elements 571 00:28:27,280 --> 00:28:28,800 is something called a stack. 572 00:28:28,800 --> 00:28:33,760 And a stack is a data structure that is a last in, first out data type, which 573 00:28:33,760 --> 00:28:36,560 means the last thing that I add to the frontier 574 00:28:36,560 --> 00:28:40,400 is going to be the first thing that I remove from the frontier. 575 00:28:40,400 --> 00:28:44,320 So the most recent thing to go into the stack or the frontier in this case 576 00:28:44,320 --> 00:28:47,280 is going to be the node that I explore. 577 00:28:47,280 --> 00:28:51,280 So let's see what happens if I apply this stack-based approach to something 578 00:28:51,280 --> 00:28:56,480 like this problem, finding a path from A to E. What's going to happen? 579 00:28:56,480 --> 00:28:58,960 Well, again, we'll start with A and we'll say, all right, 580 00:28:58,960 --> 00:29:00,640 let's go ahead and look at A first. 581 00:29:00,640 --> 00:29:04,720 And then notice this time, we've added A to the explored set. 582 00:29:04,720 --> 00:29:06,240 A is something we've now explored. 583 00:29:06,240 --> 00:29:09,040 We have this data structure that's keeping track. 584 00:29:09,040 --> 00:29:13,680 We then say from A, we can get to B. And all right, from B, what can we do? 585 00:29:13,680 --> 00:29:17,840 Well, from B, we can explore B and get to both C and D. 586 00:29:17,840 --> 00:29:21,200 So we added C and then D. So now, 587 00:29:21,200 --> 00:29:24,400 when we explore a node, we're going to treat the frontier as a stack, 588 00:29:24,400 --> 00:29:26,000 last in, first out. 589 00:29:26,000 --> 00:29:27,760 D was the last one to come in. 590 00:29:27,760 --> 00:29:30,560 So we'll go ahead and explore that next and say, all right, 591 00:29:30,560 --> 00:29:32,000 where can we get to from D? 592 00:29:32,000 --> 00:29:36,720 Well, we can get to F. And so all right, we'll put F into the frontier. 593 00:29:36,720 --> 00:29:39,040 And now, because the frontier is a stack, 594 00:29:39,040 --> 00:29:42,080 F is the most recent thing that's gone in the stack. 595 00:29:42,080 --> 00:29:43,600 So F is what we'll explore next. 596 00:29:43,600 --> 00:29:47,200 We'll explore F and say, all right, where can we get to from F? 597 00:29:47,200 --> 00:29:50,400 Well, we can't get anywhere, so nothing gets added to the frontier. 598 00:29:50,400 --> 00:29:53,280 So now, what was the new most recent thing added to the frontier? 599 00:29:53,280 --> 00:29:55,920 Well, it's now C, the only thing left in the frontier. 600 00:29:55,920 --> 00:29:59,600 We'll explore that from which we can see, all right, from C, we can get to E. 601 00:29:59,600 --> 00:30:01,280 So E goes into the frontier. 602 00:30:01,280 --> 00:30:04,560 And then we say, all right, let's look at E. And E is now the solution. 603 00:30:04,560 --> 00:30:07,120 And now, we've solved the problem. 604 00:30:07,120 --> 00:30:10,080 So when we treat the frontier like a stack, a last in, 605 00:30:10,080 --> 00:30:13,120 first out data structure, that's the result we get. 606 00:30:13,120 --> 00:30:18,880 We go from A to B to D to F. And then we sort of backed up and went down to C 607 00:30:18,880 --> 00:30:19,760 and then E. 608 00:30:19,760 --> 00:30:23,200 And it's important to get a visual sense for how this algorithm is working. 609 00:30:23,200 --> 00:30:25,840 We went very deep in this search tree, so to speak, 610 00:30:25,840 --> 00:30:28,480 all the way until the bottom where we hit a dead end. 611 00:30:28,480 --> 00:30:32,080 And then we effectively backed up and explored this other route 612 00:30:32,080 --> 00:30:33,520 that we didn't try before. 613 00:30:33,520 --> 00:30:36,400 And it's this going very deep in the search tree idea, 614 00:30:36,400 --> 00:30:39,920 this way the algorithm ends up working when we use a stack 615 00:30:39,920 --> 00:30:44,000 that we call this version of the algorithm depth first search. 616 00:30:44,000 --> 00:30:46,160 Depth first search is the search algorithm 617 00:30:46,160 --> 00:30:49,680 where we always explore the deepest node in the frontier. 618 00:30:49,680 --> 00:30:52,800 We keep going deeper and deeper through our search tree. 619 00:30:52,800 --> 00:30:57,520 And then if we hit a dead end, we back up and we try something else instead. 620 00:30:57,520 --> 00:31:00,560 But depth first search is just one of the possible search options 621 00:31:00,560 --> 00:31:01,600 that we could use. 622 00:31:01,600 --> 00:31:05,200 It turns out that there's another algorithm called breadth first search, 623 00:31:05,200 --> 00:31:08,880 which behaves very similarly to depth first search with one difference. 624 00:31:08,880 --> 00:31:12,400 Instead of always exploring the deepest node in the search tree, 625 00:31:12,400 --> 00:31:14,800 the way the depth first search does, breadth first search 626 00:31:14,800 --> 00:31:19,040 is always going to explore the shallowest node in the frontier. 627 00:31:19,040 --> 00:31:20,080 So what does that mean? 628 00:31:20,080 --> 00:31:24,480 Well, it means that instead of using a stack which depth first search or DFS 629 00:31:24,480 --> 00:31:27,520 used, where the most recent item added to the frontier 630 00:31:27,520 --> 00:31:32,160 is the one we'll explore next, in breadth first search or BFS, 631 00:31:32,160 --> 00:31:37,440 we'll instead use a queue, where a queue is a first in first out data type, 632 00:31:37,440 --> 00:31:39,760 where the very first thing we add to the frontier 633 00:31:39,760 --> 00:31:43,840 is the first one we'll explore and they effectively form a line or a queue, 634 00:31:43,840 --> 00:31:49,040 where the earlier you arrive in the frontier, the earlier you get explored. 635 00:31:49,040 --> 00:31:51,440 So what would that mean for the same exact problem, 636 00:31:51,440 --> 00:31:53,760 finding a path from A to E? 637 00:31:53,760 --> 00:31:57,680 Well, we start with A, same as before, then we'll go ahead and have explored A 638 00:31:57,680 --> 00:31:59,200 and say, where can we get to from A? 639 00:31:59,200 --> 00:32:01,920 Well, from A, we can get to B, same as before. 640 00:32:01,920 --> 00:32:04,480 From B, same as before, we can get to C and D. 641 00:32:04,480 --> 00:32:06,800 So C and D get added to the frontier. 642 00:32:06,800 --> 00:32:10,480 This time, though, we added C to the frontier before D. 643 00:32:10,480 --> 00:32:12,480 So we'll explore C first. 644 00:32:12,480 --> 00:32:14,160 So C gets explored. 645 00:32:14,160 --> 00:32:16,000 And from C, where can we get to? 646 00:32:16,000 --> 00:32:19,520 Well, we can get to E. So E gets added to the frontier. 647 00:32:19,520 --> 00:32:24,080 But because D was explored before E, we'll look at D next. 648 00:32:24,080 --> 00:32:26,400 So we'll explore D and say, where can we get to from D? 649 00:32:26,400 --> 00:32:31,440 We can get to F. And only then will we say, all right, now we can get to E. 650 00:32:31,440 --> 00:32:35,360 And so what breadth first search or BFS did is we started here, 651 00:32:35,360 --> 00:32:39,440 we looked at both C and D, and then we looked at E. 652 00:32:39,440 --> 00:32:42,640 Effectively, we're looking at things one away from the initial state, 653 00:32:42,640 --> 00:32:45,680 then two away from the initial state, and only then, 654 00:32:45,680 --> 00:32:49,760 things that are three away from the initial state, unlike depth first search, 655 00:32:49,760 --> 00:32:53,040 which just went as deep as possible into the search tree 656 00:32:53,040 --> 00:32:56,000 until it hit a dead end and then ultimately had to back up. 657 00:32:56,720 --> 00:32:59,200 So these now are two different search algorithms 658 00:32:59,200 --> 00:33:01,760 that we could apply in order to try and solve a problem. 659 00:33:01,760 --> 00:33:05,040 And let's take a look at how these would actually work in practice 660 00:33:05,040 --> 00:33:07,600 with something like maze solving, for example. 661 00:33:07,600 --> 00:33:09,200 So here's an example of a maze. 662 00:33:09,200 --> 00:33:12,400 These empty cells represent places where our agent can move. 663 00:33:12,400 --> 00:33:16,880 These darkened gray cells represent walls that the agent can't pass through. 664 00:33:16,880 --> 00:33:20,320 And ultimately, our agent, our AI, is going to try to find a way 665 00:33:20,320 --> 00:33:25,120 to get from position A to position B via some sequence of actions, 666 00:33:25,120 --> 00:33:28,000 where those actions are left, right, up, and down. 667 00:33:28,800 --> 00:33:31,200 What will depth first search do in this case? 668 00:33:31,200 --> 00:33:34,080 Well, depth first search will just follow one path. 669 00:33:34,080 --> 00:33:37,440 If it reaches a fork in the road where it has multiple different options, 670 00:33:37,440 --> 00:33:40,000 depth first search is just, in this case, going to choose one. 671 00:33:40,000 --> 00:33:41,360 That doesn't a real preference. 672 00:33:41,360 --> 00:33:45,040 But it's going to keep following one until it hits a dead end. 673 00:33:45,040 --> 00:33:48,480 And when it hits a dead end, depth first search effectively 674 00:33:48,480 --> 00:33:52,240 goes back to the last decision point and tries the other path, 675 00:33:52,240 --> 00:33:54,240 fully exhausting this entire path. 676 00:33:54,240 --> 00:33:56,720 And when it realizes that, OK, the goal is not here, 677 00:33:56,720 --> 00:33:58,560 then it turns its attention to this path. 678 00:33:58,560 --> 00:34:00,400 It goes as deep as possible. 679 00:34:00,400 --> 00:34:04,000 When it hits a dead end, it backs up and then tries this other path, 680 00:34:04,000 --> 00:34:07,120 keeps going as deep as possible down one particular path. 681 00:34:07,120 --> 00:34:10,480 And when it realizes that that's a dead end, then it'll back up, 682 00:34:10,480 --> 00:34:13,200 and then ultimately find its way to the goal. 683 00:34:13,200 --> 00:34:16,800 And maybe you got lucky, and maybe you made a different choice earlier on. 684 00:34:16,800 --> 00:34:19,680 But ultimately, this is how depth first search is going to work. 685 00:34:19,680 --> 00:34:22,000 It's going to keep following until it hits a dead end. 686 00:34:22,000 --> 00:34:26,160 And when it hits a dead end, it backs up and looks for a different solution. 687 00:34:26,160 --> 00:34:28,160 And so one thing you might reasonably ask is, 688 00:34:28,160 --> 00:34:30,160 is this algorithm always going to work? 689 00:34:30,160 --> 00:34:33,440 Will it always actually find a way to get from the initial state? 690 00:34:33,440 --> 00:34:34,480 To the goal. 691 00:34:34,480 --> 00:34:37,600 And it turns out that as long as our maze is finite, 692 00:34:37,600 --> 00:34:40,720 as long as there are only finitely many spaces where we can travel, 693 00:34:40,720 --> 00:34:44,000 then, yes, depth first search is going to find a solution. 694 00:34:44,000 --> 00:34:46,480 Because eventually, it'll just explore everything. 695 00:34:46,480 --> 00:34:49,600 If the maze happens to be infinite and there's an infinite state space, 696 00:34:49,600 --> 00:34:51,840 which does exist in certain types of problems, 697 00:34:51,840 --> 00:34:53,440 then it's a slightly different story. 698 00:34:53,440 --> 00:34:56,160 But as long as our maze has finitely many squares, 699 00:34:56,160 --> 00:34:58,400 we're going to find a solution. 700 00:34:58,400 --> 00:35:00,800 The next question, though, that we want to ask is, 701 00:35:00,800 --> 00:35:02,400 is it going to be a good solution? 702 00:35:02,400 --> 00:35:05,200 Is it the optimal solution that we can find? 703 00:35:05,200 --> 00:35:07,680 And the answer there is not necessarily. 704 00:35:07,680 --> 00:35:09,520 And let's take a look at an example of that. 705 00:35:09,520 --> 00:35:14,320 In this maze, for example, we're again trying to find our way from A to B. 706 00:35:14,320 --> 00:35:16,960 And you notice here there are multiple possible solutions. 707 00:35:16,960 --> 00:35:21,680 We could go this way or we could go up in order to make our way from A to B. 708 00:35:21,680 --> 00:35:25,600 Now, if we're lucky, depth first search will choose this way and get to B. 709 00:35:25,600 --> 00:35:28,080 But there's no reason necessarily why depth first search 710 00:35:28,080 --> 00:35:30,880 would choose between going up or going to the right. 711 00:35:30,880 --> 00:35:33,680 It's sort of an arbitrary decision point because both 712 00:35:33,680 --> 00:35:35,840 are going to be added to the frontier. 713 00:35:35,840 --> 00:35:38,720 And ultimately, if we get unlucky, depth first search 714 00:35:38,720 --> 00:35:42,000 might choose to explore this path first because it's just a random choice 715 00:35:42,000 --> 00:35:42,880 at this point. 716 00:35:42,880 --> 00:35:45,280 It'll explore, explore, explore. 717 00:35:45,280 --> 00:35:48,560 And it'll eventually find the goal, this particular path, 718 00:35:48,560 --> 00:35:50,480 when in actuality there was a better path. 719 00:35:50,480 --> 00:35:54,400 There was a more optimal solution that used fewer steps, 720 00:35:54,400 --> 00:35:58,000 assuming we're measuring the cost of a solution based on the number of steps 721 00:35:58,000 --> 00:35:59,280 that we need to take. 722 00:35:59,280 --> 00:36:01,520 So depth first search, if we're unlucky, 723 00:36:01,520 --> 00:36:05,360 might end up not finding the best solution when a better solution is 724 00:36:05,360 --> 00:36:07,200 available. 725 00:36:07,200 --> 00:36:09,680 So that's DFS, depth first search. 726 00:36:09,680 --> 00:36:12,720 How does BFS, or breadth first search, compare? 727 00:36:12,720 --> 00:36:14,960 How would it work in this particular situation? 728 00:36:14,960 --> 00:36:17,920 Well, the algorithm is going to look very different visually 729 00:36:17,920 --> 00:36:20,160 in terms of how BFS explores. 730 00:36:20,160 --> 00:36:24,640 Because BFS looks at shallower nodes first, the idea is going to be, 731 00:36:24,640 --> 00:36:29,600 BFS will first look at all of the nodes that are one away from the initial state. 732 00:36:29,600 --> 00:36:31,680 Look here and look here, for example, just 733 00:36:31,680 --> 00:36:36,000 at the two nodes that are immediately next to this initial state. 734 00:36:36,000 --> 00:36:37,840 Then it'll explore nodes that are two away, 735 00:36:37,840 --> 00:36:40,480 looking at this state and that state, for example. 736 00:36:40,480 --> 00:36:43,520 Then it'll explore nodes that are three away, this state and that state. 737 00:36:43,520 --> 00:36:47,600 Whereas depth first search just picked one path and kept following it, 738 00:36:47,600 --> 00:36:49,440 breadth first search, on the other hand, 739 00:36:49,440 --> 00:36:52,960 is taking the option of exploring all of the possible paths 740 00:36:52,960 --> 00:36:56,160 as kind of at the same time bouncing back between them, 741 00:36:56,160 --> 00:36:58,720 looking deeper and deeper at each one, but making sure 742 00:36:58,720 --> 00:37:01,360 to explore the shallower ones or the ones that 743 00:37:01,360 --> 00:37:04,080 are closer to the initial state earlier. 744 00:37:04,080 --> 00:37:07,200 So we'll keep following this pattern, looking at things that are four away, 745 00:37:07,200 --> 00:37:10,720 looking at things that are five away, looking at things that are six away, 746 00:37:10,720 --> 00:37:14,160 until eventually we make our way to the goal. 747 00:37:14,160 --> 00:37:17,520 And in this case, it's true we had to explore some states that ultimately 748 00:37:17,520 --> 00:37:20,960 didn't lead us anywhere, but the path that we found to the goal 749 00:37:20,960 --> 00:37:22,200 was the optimal path. 750 00:37:22,200 --> 00:37:25,880 This is the shortest way that we could get to the goal. 751 00:37:25,880 --> 00:37:28,880 And so what might happen then in a larger maze? 752 00:37:28,880 --> 00:37:30,840 Well, let's take a look at something like this 753 00:37:30,840 --> 00:37:32,880 and how breadth first search is going to behave. 754 00:37:32,880 --> 00:37:35,800 Well, breadth first search, again, we'll just keep following the states 755 00:37:35,800 --> 00:37:37,280 until it receives a decision point. 756 00:37:37,280 --> 00:37:39,480 It could go either left or right. 757 00:37:39,480 --> 00:37:44,600 And while DFS just picked one and kept following that until it hit a dead end, 758 00:37:44,600 --> 00:37:47,580 BFS, on the other hand, will explore both. 759 00:37:47,580 --> 00:37:50,000 It'll say look at this node, then this node, 760 00:37:50,000 --> 00:37:52,080 and it'll look at this node, then that node. 761 00:37:52,080 --> 00:37:53,440 So on and so forth. 762 00:37:53,440 --> 00:37:57,280 And when it hits a decision point here, rather than pick one left or two 763 00:37:57,280 --> 00:38:01,040 right and explore that path, it'll again explore both, 764 00:38:01,040 --> 00:38:03,280 alternating between them, going deeper and deeper. 765 00:38:03,280 --> 00:38:07,600 We'll explore here, and then maybe here and here, and then keep going. 766 00:38:07,600 --> 00:38:10,800 Explore here and slowly make our way, you can visually 767 00:38:10,800 --> 00:38:12,840 see, further and further out. 768 00:38:12,840 --> 00:38:16,520 Once we get to this decision point, we'll explore both up and down 769 00:38:16,520 --> 00:38:21,600 until ultimately we make our way to the goal. 770 00:38:21,600 --> 00:38:24,240 And what you'll notice is, yes, breadth first search 771 00:38:24,240 --> 00:38:28,640 did find our way from A to B by following this particular path, 772 00:38:28,640 --> 00:38:32,320 but it needed to explore a lot of states in order to do so. 773 00:38:32,320 --> 00:38:35,440 And so we see some trade offs here between DFS and BFS, 774 00:38:35,440 --> 00:38:39,440 that in DFS, there may be some cases where there is some memory savings 775 00:38:39,440 --> 00:38:43,760 as compared to a breadth first approach, where breadth first search in this case 776 00:38:43,760 --> 00:38:45,240 had to explore a lot of states. 777 00:38:45,240 --> 00:38:48,480 But maybe that won't always be the case. 778 00:38:48,480 --> 00:38:51,280 So now let's actually turn our attention to some code 779 00:38:51,280 --> 00:38:52,940 and look at the code that we could actually 780 00:38:52,940 --> 00:38:56,400 write in order to implement something like depth first search or breadth 781 00:38:56,400 --> 00:39:01,000 first search in the context of solving a maze, for example. 782 00:39:01,000 --> 00:39:03,360 So I'll go ahead and go into my terminal. 783 00:39:03,360 --> 00:39:07,280 And what I have here inside of maze.py is an implementation 784 00:39:07,280 --> 00:39:09,640 of this same idea of maze solving. 785 00:39:09,640 --> 00:39:12,680 I've defined a class called node that in this case 786 00:39:12,680 --> 00:39:15,520 is keeping track of the state, the parent, in other words, 787 00:39:15,520 --> 00:39:17,960 the state before the state, and the action. 788 00:39:17,960 --> 00:39:20,120 In this case, we're not keeping track of the path cost 789 00:39:20,120 --> 00:39:22,800 because we can calculate the cost of the path at the end 790 00:39:22,800 --> 00:39:26,920 after we found our way from the initial state to the goal. 791 00:39:26,920 --> 00:39:31,560 In addition to this, I've defined a class called a stack frontier. 792 00:39:31,560 --> 00:39:34,800 And if unfamiliar with a class, a class is a way for me 793 00:39:34,800 --> 00:39:37,960 to define a way to generate objects in Python. 794 00:39:37,960 --> 00:39:42,080 It refers to an idea of object oriented programming, where the idea here 795 00:39:42,080 --> 00:39:44,760 is that I would like to create an object that is 796 00:39:44,760 --> 00:39:46,960 able to store all of my frontier data. 797 00:39:46,960 --> 00:39:49,040 And I would like to have functions, otherwise known 798 00:39:49,040 --> 00:39:53,400 as methods, on that object that I can use to manipulate the object. 799 00:39:53,400 --> 00:39:57,120 And so what's going on here, if unfamiliar with the syntax, 800 00:39:57,120 --> 00:40:00,680 is I have a function that initially creates a frontier that I'm 801 00:40:00,680 --> 00:40:02,400 going to represent using a list. 802 00:40:02,400 --> 00:40:05,800 And initially, my frontier is represented by the empty list. 803 00:40:05,800 --> 00:40:08,840 There's nothing in my frontier to begin with. 804 00:40:08,840 --> 00:40:12,000 I have an add function that adds something to the frontier 805 00:40:12,000 --> 00:40:15,240 as by appending it to the end of the list. 806 00:40:15,240 --> 00:40:17,880 I have a function that checks if the frontier contains 807 00:40:17,880 --> 00:40:19,400 a particular state. 808 00:40:19,400 --> 00:40:22,240 I have an empty function that checks if the frontier is empty. 809 00:40:22,240 --> 00:40:26,200 If the frontier is empty, that just means the length of the frontier is 0. 810 00:40:26,200 --> 00:40:29,560 And then I have a function for removing something from the frontier. 811 00:40:29,560 --> 00:40:32,240 I can't remove something from the frontier if the frontier is empty, 812 00:40:32,240 --> 00:40:33,800 so I check for that first. 813 00:40:33,800 --> 00:40:36,720 But otherwise, if the frontier isn't empty, 814 00:40:36,720 --> 00:40:41,680 recall that I'm implementing this frontier as a stack, a last in first 815 00:40:41,680 --> 00:40:45,640 out data structure, which means the last thing I add to the frontier, 816 00:40:45,640 --> 00:40:48,600 in other words, the last thing in the list, is the item 817 00:40:48,600 --> 00:40:51,880 that I should remove from this frontier. 818 00:40:51,880 --> 00:40:56,200 So what you'll see here is I have removed the last item of a list. 819 00:40:56,200 --> 00:40:59,400 And if you index into a Python list with negative 1, 820 00:40:59,400 --> 00:41:01,080 that gets you the last item in the list. 821 00:41:01,080 --> 00:41:04,600 Since 0 is the first item, negative 1 kind of wraps around 822 00:41:04,600 --> 00:41:07,400 and gets you to the last item in the list. 823 00:41:07,400 --> 00:41:09,320 So we give that the node. 824 00:41:09,320 --> 00:41:10,360 We call that node. 825 00:41:10,360 --> 00:41:12,640 We update the frontier here on line 28 to say, 826 00:41:12,640 --> 00:41:16,040 go ahead and remove that node that you just removed from the frontier. 827 00:41:16,040 --> 00:41:18,720 And then we return the node as a result. 828 00:41:18,720 --> 00:41:23,080 So this class here effectively implements the idea of a frontier. 829 00:41:23,080 --> 00:41:25,400 It gives me a way to add something to a frontier 830 00:41:25,400 --> 00:41:29,440 and a way to remove something from the frontier as a stack. 831 00:41:29,440 --> 00:41:31,960 I've also, just for good measure, implemented 832 00:41:31,960 --> 00:41:36,000 an alternative version of the same thing called a queue frontier, which 833 00:41:36,000 --> 00:41:39,200 in parentheses you'll see here, it inherits from a stack frontier, 834 00:41:39,200 --> 00:41:42,680 meaning it's going to do all the same things that the stack frontier did, 835 00:41:42,680 --> 00:41:45,560 except the way we remove a node from the frontier 836 00:41:45,560 --> 00:41:47,000 is going to be slightly different. 837 00:41:47,000 --> 00:41:50,480 Instead of removing from the end of the list the way we would in a stack, 838 00:41:50,480 --> 00:41:53,160 we're instead going to remove from the beginning of the list. 839 00:41:53,160 --> 00:41:58,080 Self.frontier 0 will get me the first node in the frontier, the first one 840 00:41:58,080 --> 00:42:00,440 that was added, and that is going to be the one 841 00:42:00,440 --> 00:42:03,440 that we return in the case of a queue. 842 00:42:03,440 --> 00:42:06,360 Then under here, I have a definition of a class called maze. 843 00:42:06,360 --> 00:42:11,080 This is going to handle the process of taking a sequence, a maze-like text 844 00:42:11,080 --> 00:42:13,360 file, and figuring out how to solve it. 845 00:42:13,360 --> 00:42:16,960 So it will take as input a text file that looks something like this, 846 00:42:16,960 --> 00:42:20,720 for example, where we see hash marks that are here representing walls, 847 00:42:20,720 --> 00:42:23,880 and I have the character A representing the starting position 848 00:42:23,880 --> 00:42:27,840 and the character B representing the ending position. 849 00:42:27,840 --> 00:42:30,840 And you can take a look at the code for parsing this text file right now. 850 00:42:30,840 --> 00:42:32,360 That's the less interesting part. 851 00:42:32,360 --> 00:42:35,440 The more interesting part is this solve function here, 852 00:42:35,440 --> 00:42:37,440 the solve function is going to figure out 853 00:42:37,440 --> 00:42:41,160 how to actually get from point A to point B. 854 00:42:41,160 --> 00:42:44,160 And here we see an implementation of the exact same idea 855 00:42:44,160 --> 00:42:45,800 we saw from a moment ago. 856 00:42:45,800 --> 00:42:48,240 We're going to keep track of how many states we've explored, 857 00:42:48,240 --> 00:42:50,440 just so we can report that data later. 858 00:42:50,440 --> 00:42:55,680 But I start with a node that represents just the start state. 859 00:42:55,680 --> 00:43:00,000 And I start with a frontier that, in this case, is a stack frontier. 860 00:43:00,000 --> 00:43:02,000 And given that I'm treating my frontier as a stack, 861 00:43:02,000 --> 00:43:06,160 you might imagine that the algorithm I'm using here is now depth-first search, 862 00:43:06,160 --> 00:43:11,120 because depth-first search, or DFS, uses a stack as its data structure. 863 00:43:11,120 --> 00:43:16,320 And initially, this frontier is just going to contain the start state. 864 00:43:16,320 --> 00:43:19,280 We initialize an explored set that initially is empty. 865 00:43:19,280 --> 00:43:21,320 There's nothing we've explored so far. 866 00:43:21,320 --> 00:43:25,920 And now here's our loop, that notion of repeating something again and again. 867 00:43:25,920 --> 00:43:29,560 First, we check if the frontier is empty by calling that empty function 868 00:43:29,560 --> 00:43:31,800 that we saw the implementation of a moment ago. 869 00:43:31,800 --> 00:43:34,080 And if the frontier is indeed empty, we'll 870 00:43:34,080 --> 00:43:37,040 go ahead and raise an exception, or a Python error, to say, 871 00:43:37,040 --> 00:43:41,040 sorry, there is no solution to this problem. 872 00:43:41,040 --> 00:43:44,520 Otherwise, we'll go ahead and remove a node from the frontier 873 00:43:44,520 --> 00:43:48,920 as by calling frontier.remove and update the number of states we've explored, 874 00:43:48,920 --> 00:43:51,400 because now we've explored one additional state. 875 00:43:51,400 --> 00:43:55,240 So we say self.numexplored plus equals 1, adding 1 876 00:43:55,240 --> 00:43:57,800 to the number of states we've explored. 877 00:43:57,800 --> 00:44:00,080 Once we remove a node from the frontier, 878 00:44:00,080 --> 00:44:02,360 recall that the next step is to see whether or not 879 00:44:02,360 --> 00:44:04,320 it's the goal, the goal test. 880 00:44:04,320 --> 00:44:06,840 And in the case of the maze, the goal is pretty easy. 881 00:44:06,840 --> 00:44:11,080 I check to see whether the state of the node is equal to the goal. 882 00:44:11,080 --> 00:44:13,080 Initially, when I set up the maze, I set up 883 00:44:13,080 --> 00:44:15,760 this value called goal, which is a property of the maze, 884 00:44:15,760 --> 00:44:19,280 so I can just check to see if the node is actually the goal. 885 00:44:19,280 --> 00:44:22,040 And if it is the goal, then what I want to do 886 00:44:22,040 --> 00:44:26,400 is backtrack my way towards figuring out what actions I took in order 887 00:44:26,400 --> 00:44:28,360 to get to this goal. 888 00:44:28,360 --> 00:44:29,440 And how do I do that? 889 00:44:29,440 --> 00:44:33,400 We'll recall that every node stores its parent, the node that came before it 890 00:44:33,400 --> 00:44:37,000 that we used to get to this node, and also the action used in order to get 891 00:44:37,000 --> 00:44:37,680 there. 892 00:44:37,680 --> 00:44:40,800 So I can create this loop where I'm constantly just looking 893 00:44:40,800 --> 00:44:44,480 at the parent of every node and keeping track for all of the parents 894 00:44:44,480 --> 00:44:47,920 what action I took to get from the parent to this current node. 895 00:44:47,920 --> 00:44:50,280 So this loop is going to keep repeating this process 896 00:44:50,280 --> 00:44:52,400 of looking through all of the parent nodes 897 00:44:52,400 --> 00:44:54,680 until we get back to the initial state, which 898 00:44:54,680 --> 00:44:59,080 has no parent, where node.parent is going to be equal to none. 899 00:44:59,080 --> 00:45:01,960 As I do so, I'm going to be building up the list of all of the actions 900 00:45:01,960 --> 00:45:05,600 that I'm following and the list of all the cells that are part of the solution. 901 00:45:05,600 --> 00:45:08,240 But I'll reverse them because when I build it up, 902 00:45:08,240 --> 00:45:10,960 going from the goal back to the initial state 903 00:45:10,960 --> 00:45:14,040 and building the sequence of actions from the goal to the initial state, 904 00:45:14,040 --> 00:45:16,920 but I want to reverse them in order to get the sequence of actions 905 00:45:16,920 --> 00:45:19,640 from the initial state to the goal. 906 00:45:19,640 --> 00:45:23,400 And that is ultimately going to be the solution. 907 00:45:23,400 --> 00:45:27,320 So all of that happens if the current state is equal to the goal. 908 00:45:27,320 --> 00:45:29,320 And otherwise, if it's not the goal, well, 909 00:45:29,320 --> 00:45:32,920 then I'll go ahead and add this state to the explored set to say, 910 00:45:32,920 --> 00:45:34,280 I've explored this state now. 911 00:45:34,280 --> 00:45:37,520 No need to go back to it if I come across it in the future. 912 00:45:37,520 --> 00:45:42,840 And then this logic here implements the idea of adding neighbors to the frontier. 913 00:45:42,840 --> 00:45:44,840 I'm saying, look at all of my neighbors, and I 914 00:45:44,840 --> 00:45:47,560 implemented a function called neighbors that you can take a look at. 915 00:45:47,560 --> 00:45:49,720 And for each of those neighbors, I'm going to check, 916 00:45:49,720 --> 00:45:51,880 is the state already in the frontier? 917 00:45:51,880 --> 00:45:54,440 Is the state already in the explored set? 918 00:45:54,440 --> 00:45:58,600 And if it's not in either of those, then I'll go ahead and add this new child 919 00:45:58,600 --> 00:46:01,320 node, this new node, to the frontier. 920 00:46:01,320 --> 00:46:03,040 So there's a fair amount of syntax here, 921 00:46:03,040 --> 00:46:05,960 but the key here is not to understand all the nuances of the syntax. 922 00:46:05,960 --> 00:46:08,760 So feel free to take a closer look at this file on your own 923 00:46:08,760 --> 00:46:10,640 to get a sense for how it is working. 924 00:46:10,640 --> 00:46:13,120 But the key is to see how this is an implementation 925 00:46:13,120 --> 00:46:16,960 of the same pseudocode, the same idea that we were describing a moment ago 926 00:46:16,960 --> 00:46:19,880 on the screen when we were looking at the steps 927 00:46:19,880 --> 00:46:23,640 that we might follow in order to solve this kind of search problem. 928 00:46:23,640 --> 00:46:25,560 So now let's actually see this in action. 929 00:46:25,560 --> 00:46:31,560 I'll go ahead and run maze.py on maze1.txt, for example. 930 00:46:31,560 --> 00:46:34,200 And what we'll see is here, we have a printout 931 00:46:34,200 --> 00:46:36,400 of what the maze initially looked like. 932 00:46:36,400 --> 00:46:39,040 And then here down below is after we've solved it. 933 00:46:39,040 --> 00:46:41,480 We had to explore 11 states in order to do it, 934 00:46:41,480 --> 00:46:45,040 and we found a path from A to B. And in this program, 935 00:46:45,040 --> 00:46:48,160 I just happened to generate a graphical representation of this as well. 936 00:46:48,160 --> 00:46:50,440 So I can open up maze.png, which is generated 937 00:46:50,440 --> 00:46:54,840 by this program, that shows you where in the darker color here are the walls, 938 00:46:54,840 --> 00:46:56,880 red is the initial state, green is the goal, 939 00:46:56,880 --> 00:46:58,960 and yellow is the path that was followed. 940 00:46:58,960 --> 00:47:03,240 We found a path from the initial state to the goal. 941 00:47:03,240 --> 00:47:06,080 But now let's take a look at a more sophisticated maze 942 00:47:06,080 --> 00:47:08,160 to see what might happen instead. 943 00:47:08,160 --> 00:47:10,880 Let's look now at maze2.txt. 944 00:47:10,880 --> 00:47:11,760 We're now here. 945 00:47:11,760 --> 00:47:13,040 We have a much larger maze. 946 00:47:13,040 --> 00:47:16,320 Again, we're trying to find our way from point A to point B. 947 00:47:16,320 --> 00:47:19,560 But now you'll imagine that depth-first search might not be so lucky. 948 00:47:19,560 --> 00:47:22,040 It might not get the goal on the first try. 949 00:47:22,040 --> 00:47:26,000 It might have to follow one path, then backtrack and explore something else 950 00:47:26,000 --> 00:47:28,120 a little bit later. 951 00:47:28,120 --> 00:47:29,240 So let's try this. 952 00:47:29,240 --> 00:47:34,960 We'll run python maze.py of maze2.txt, this time trying on this other maze. 953 00:47:34,960 --> 00:47:38,160 And now, depth-first search is able to find a solution. 954 00:47:38,160 --> 00:47:42,080 Here, as indicated by the stars, is a way to get from A to B. 955 00:47:42,080 --> 00:47:45,480 And we can represent this visually by opening up this maze. 956 00:47:45,480 --> 00:47:48,040 Here's what that maze looks like, and highlighted in yellow 957 00:47:48,040 --> 00:47:52,320 is the path that was found from the initial state to the goal. 958 00:47:52,320 --> 00:47:57,360 But how many states did we have to explore before we found that path? 959 00:47:57,360 --> 00:47:59,440 Well, recall that in my program, I was keeping 960 00:47:59,440 --> 00:48:02,560 track of the number of states that we've explored so far. 961 00:48:02,560 --> 00:48:05,920 And so I can go back to the terminal and see that, all right, 962 00:48:05,920 --> 00:48:12,040 in order to solve this problem, we had to explore 399 different states. 963 00:48:12,040 --> 00:48:14,880 And in fact, if I make one small modification of the program 964 00:48:14,880 --> 00:48:17,960 and tell the program at the end when we output this image, 965 00:48:17,960 --> 00:48:21,200 I added an argument called show explored. 966 00:48:21,200 --> 00:48:26,000 And if I set show explored equal to true and rerun this program, 967 00:48:26,000 --> 00:48:30,680 python maze.py, running it on maze2, and then I open the maze, what you'll see 968 00:48:30,680 --> 00:48:33,520 here is highlighted in red are all of the states 969 00:48:33,520 --> 00:48:37,560 that had to be explored to get from the initial state to the goal. 970 00:48:37,560 --> 00:48:41,200 Depth-first search, or DFS, didn't find its way to the goal right away. 971 00:48:41,200 --> 00:48:44,200 It made a choice to first explore this direction. 972 00:48:44,200 --> 00:48:46,040 And when it explored this direction, it had 973 00:48:46,040 --> 00:48:49,040 to follow every conceivable path all the way to the very end, 974 00:48:49,040 --> 00:48:52,400 even this long and winding one, in order to realize that, you know what? 975 00:48:52,400 --> 00:48:53,480 That's a dead end. 976 00:48:53,480 --> 00:48:55,720 And instead, the program needed to backtrack. 977 00:48:55,720 --> 00:48:58,680 After going this direction, it must have gone this direction. 978 00:48:58,680 --> 00:49:01,440 It got lucky here by just not choosing this path, 979 00:49:01,440 --> 00:49:05,360 but it got unlucky here, exploring this direction, exploring a bunch of states 980 00:49:05,360 --> 00:49:07,680 it didn't need to, and then likewise exploring 981 00:49:07,680 --> 00:49:10,000 all of this top part of the graph when it probably 982 00:49:10,000 --> 00:49:12,240 didn't need to do that either. 983 00:49:12,240 --> 00:49:16,720 So all in all, depth-first search here really not performing optimally, 984 00:49:16,720 --> 00:49:19,000 or probably exploring more states than it needs to. 985 00:49:19,000 --> 00:49:22,640 It finds an optimal solution, the best path to the goal, 986 00:49:22,640 --> 00:49:25,520 but the number of states needed to explore in order to do so, 987 00:49:25,520 --> 00:49:29,080 the number of steps I had to take, that was much higher. 988 00:49:29,080 --> 00:49:30,080 So let's compare. 989 00:49:30,080 --> 00:49:35,160 How would breadth-first search, or BFS, do on this exact same maze instead? 990 00:49:35,160 --> 00:49:37,640 And in order to do so, it's a very easy change. 991 00:49:37,640 --> 00:49:42,560 The algorithm for DFS and BFS is identical with the exception 992 00:49:42,560 --> 00:49:47,000 of what data structure we use to represent the frontier, 993 00:49:47,000 --> 00:49:51,560 that in DFS, I used a stack frontier, last in, first out, 994 00:49:51,560 --> 00:49:57,320 whereas in BFS, I'm going to use a queue frontier, first in, first out, 995 00:49:57,320 --> 00:50:00,640 where the first thing I add to the frontier is the first thing that I 996 00:50:00,640 --> 00:50:01,600 remove. 997 00:50:01,600 --> 00:50:06,680 So I'll go back to the terminal, rerun this program on the same maze, 998 00:50:06,680 --> 00:50:08,800 and now you'll see that the number of states 999 00:50:08,800 --> 00:50:13,200 we had to explore was only 77 as compared to almost 400 1000 00:50:13,200 --> 00:50:15,040 when we used depth-first search. 1001 00:50:15,040 --> 00:50:16,360 And we can see exactly why. 1002 00:50:16,360 --> 00:50:21,000 We can see what happened if we open up maze.png now and take a look. 1003 00:50:21,000 --> 00:50:25,560 Again, yellow highlight is the solution that breadth-first search found, 1004 00:50:25,560 --> 00:50:29,040 which incidentally is the same solution that depth-first search found. 1005 00:50:29,040 --> 00:50:31,360 They're both finding the best solution. 1006 00:50:31,360 --> 00:50:33,640 But notice all the white unexplored cells. 1007 00:50:33,640 --> 00:50:37,000 There was much fewer states that needed to be explored in order 1008 00:50:37,000 --> 00:50:41,000 to make our way to the goal because breadth-first search operates 1009 00:50:41,000 --> 00:50:42,000 a little more shallowly. 1010 00:50:42,000 --> 00:50:45,080 It's exploring things that are close to the initial state 1011 00:50:45,080 --> 00:50:48,160 without exploring things that are further away. 1012 00:50:48,160 --> 00:50:51,240 So if the goal is not too far away, then breadth-first search 1013 00:50:51,240 --> 00:50:53,960 can actually behave quite effectively on a maze that 1014 00:50:53,960 --> 00:50:56,760 looks a little something like this. 1015 00:50:56,760 --> 00:51:01,760 Now, in this case, both BFS and DFS ended up finding the same solution, 1016 00:51:01,760 --> 00:51:03,560 but that won't always be the case. 1017 00:51:03,560 --> 00:51:06,320 And in fact, let's take a look at one more example. 1018 00:51:06,320 --> 00:51:09,400 For instance, maze3.txt. 1019 00:51:09,400 --> 00:51:12,980 In maze3.txt, notice that here there are multiple ways 1020 00:51:12,980 --> 00:51:16,440 that you could get from A to B. It's a relatively small maze, 1021 00:51:16,440 --> 00:51:18,080 but let's look at what happens. 1022 00:51:18,080 --> 00:51:21,560 If I use, and I'll go ahead and turn off show explored 1023 00:51:21,560 --> 00:51:24,320 so we just see the solution. 1024 00:51:24,320 --> 00:51:30,640 If I use BFS, breadth-first search, to solve maze3.txt, 1025 00:51:30,640 --> 00:51:33,840 well, then we find a solution, and if I open up the maze, 1026 00:51:33,840 --> 00:51:35,560 here is the solution that we found. 1027 00:51:35,560 --> 00:51:36,640 It is the optimal one. 1028 00:51:36,640 --> 00:51:39,720 With just four steps, we can get from the initial state 1029 00:51:39,720 --> 00:51:43,080 to what the goal happens to be. 1030 00:51:43,080 --> 00:51:47,920 But what happens if we tried to use depth-first search or DFS instead? 1031 00:51:47,920 --> 00:51:52,560 Well, again, I'll go back up to my Q frontier, where Q frontier means 1032 00:51:52,560 --> 00:51:57,320 that we're using breadth-first search, and I'll change it to a stack frontier, 1033 00:51:57,320 --> 00:52:00,880 which means that now we'll be using depth-first search. 1034 00:52:00,880 --> 00:52:06,520 I'll rerun pythonmaze.py, and now you'll see that we find the solution, 1035 00:52:06,520 --> 00:52:09,000 but it is not the optimal solution. 1036 00:52:09,000 --> 00:52:11,760 This instead is what our algorithm finds, 1037 00:52:11,760 --> 00:52:14,160 and maybe depth-first search would have found the solution. 1038 00:52:14,160 --> 00:52:17,400 It's possible, but it's not guaranteed that if we just 1039 00:52:17,400 --> 00:52:21,320 happen to be unlucky, if we choose this state instead of that state, 1040 00:52:21,320 --> 00:52:24,000 then depth-first search might find a longer route 1041 00:52:24,000 --> 00:52:27,280 to get from the initial state to the goal. 1042 00:52:27,280 --> 00:52:30,320 So we do see some trade-offs here, where depth-first search might not 1043 00:52:30,320 --> 00:52:32,360 find the optimal solution. 1044 00:52:32,360 --> 00:52:35,120 So at that point, it seems like breadth-first search is pretty good. 1045 00:52:35,120 --> 00:52:38,960 Is that the best we can do, where it's going to find us the optimal solution, 1046 00:52:38,960 --> 00:52:41,360 and we don't have to worry about situations 1047 00:52:41,360 --> 00:52:44,560 where we might end up finding a longer path to the solution 1048 00:52:44,560 --> 00:52:46,440 than what actually exists? 1049 00:52:46,440 --> 00:52:49,320 Where the goal is far away from the initial state, 1050 00:52:49,320 --> 00:52:51,520 and we might have to take lots of steps in order 1051 00:52:51,520 --> 00:52:55,000 to get from the initial state to the goal, what ended up happening 1052 00:52:55,000 --> 00:52:59,560 is that this algorithm, BFS, ended up exploring basically the entire graph, 1053 00:52:59,560 --> 00:53:01,920 having to go through the entire maze in order 1054 00:53:01,920 --> 00:53:05,960 to find its way from the initial state to the goal state. 1055 00:53:05,960 --> 00:53:08,120 What we'd ultimately like is for our algorithm 1056 00:53:08,120 --> 00:53:10,800 to be a little bit more intelligent. 1057 00:53:10,800 --> 00:53:13,800 And now what would it mean for our algorithm to be a little bit more 1058 00:53:13,800 --> 00:53:16,000 intelligent in this case? 1059 00:53:16,000 --> 00:53:18,680 Well, let's look back to where breadth-first search might 1060 00:53:18,680 --> 00:53:20,440 have been able to make a different decision 1061 00:53:20,440 --> 00:53:23,880 and consider human intuition in this process as well. 1062 00:53:23,880 --> 00:53:26,280 What might a human do when solving this maze 1063 00:53:26,280 --> 00:53:30,640 that is different than what BFS ultimately chose to do? 1064 00:53:30,640 --> 00:53:35,160 Well, the very first decision point that BFS made was right here, 1065 00:53:35,160 --> 00:53:38,400 when it made five steps and ended up in a position 1066 00:53:38,400 --> 00:53:39,680 where it had a fork in the row. 1067 00:53:39,680 --> 00:53:41,880 It could either go left or it could go right. 1068 00:53:41,880 --> 00:53:44,000 In these initial couple steps, there was no choice. 1069 00:53:44,000 --> 00:53:46,840 There was only one action that could be taken from each of those states. 1070 00:53:46,840 --> 00:53:49,200 And so the search algorithm did the only thing 1071 00:53:49,200 --> 00:53:53,000 that any search algorithm could do, which is keep following that state 1072 00:53:53,000 --> 00:53:54,560 after the next state. 1073 00:53:54,560 --> 00:53:57,840 But this decision point is where things get a little bit interesting. 1074 00:53:57,840 --> 00:54:01,000 Depth-first search, that very first search algorithm we looked at, 1075 00:54:01,000 --> 00:54:04,880 chose to say, let's pick one path and exhaust that path. 1076 00:54:04,880 --> 00:54:07,520 See if anything that way has the goal. 1077 00:54:07,520 --> 00:54:09,920 And if not, then let's try the other way. 1078 00:54:09,920 --> 00:54:12,720 Depth-first search took the alternative approach of saying, 1079 00:54:12,720 --> 00:54:16,480 you know what, let's explore things that are shallow, close to us first. 1080 00:54:16,480 --> 00:54:20,080 Look left and right, then back left and back right, so on and so forth, 1081 00:54:20,080 --> 00:54:24,800 alternating between our options in the hopes of finding something nearby. 1082 00:54:24,800 --> 00:54:27,520 But ultimately, what might a human do if confronted 1083 00:54:27,520 --> 00:54:30,400 with a situation like this of go left or go right? 1084 00:54:30,400 --> 00:54:33,280 Well, a human might visually see that, all right, I'm 1085 00:54:33,280 --> 00:54:36,080 trying to get to state b, which is way up there, 1086 00:54:36,080 --> 00:54:39,600 and going right just feels like it's closer to the goal. 1087 00:54:39,600 --> 00:54:42,240 It feels like going right should be better than going left 1088 00:54:42,240 --> 00:54:45,440 because I'm making progress towards getting to that goal. 1089 00:54:45,440 --> 00:54:48,480 Now, of course, there are a couple of assumptions that I'm making here. 1090 00:54:48,480 --> 00:54:51,840 I'm making the assumption that we can represent this grid 1091 00:54:51,840 --> 00:54:55,040 as like a two-dimensional grid where I know the coordinates of everything. 1092 00:54:55,040 --> 00:55:00,120 I know that a is in coordinate 0, 0, and b is in some other coordinate pair, 1093 00:55:00,120 --> 00:55:01,640 and I know what coordinate I'm at now. 1094 00:55:01,640 --> 00:55:05,640 So I can calculate that, yeah, going this way, that is closer to the goal. 1095 00:55:05,640 --> 00:55:08,840 And that might be a reasonable assumption for some types of search problems, 1096 00:55:08,840 --> 00:55:10,240 but maybe not in others. 1097 00:55:10,240 --> 00:55:12,840 But for now, we'll go ahead and assume that, 1098 00:55:12,840 --> 00:55:15,480 that I know what my current coordinate pair is, 1099 00:55:15,480 --> 00:55:19,840 and I know the coordinate, x, y, of the goal that I'm trying to get to. 1100 00:55:19,840 --> 00:55:22,520 And in this situation, I'd like an algorithm 1101 00:55:22,520 --> 00:55:25,240 that is a little bit more intelligent, that somehow knows 1102 00:55:25,240 --> 00:55:28,320 that I should be making progress towards the goal, 1103 00:55:28,320 --> 00:55:31,880 and this is probably the way to do that because in a maze, 1104 00:55:31,880 --> 00:55:34,640 moving in the coordinate direction of the goal 1105 00:55:34,640 --> 00:55:37,920 is usually, though not always, a good thing. 1106 00:55:37,920 --> 00:55:40,640 And so here we draw a distinction between two different types 1107 00:55:40,640 --> 00:55:45,200 of search algorithms, uninformed search and informed search. 1108 00:55:45,200 --> 00:55:49,480 Uninformed search algorithms are algorithms like DFS and BFS, 1109 00:55:49,480 --> 00:55:51,440 the two algorithms that we just looked at, which 1110 00:55:51,440 --> 00:55:55,720 are search strategies that don't use any problem-specific knowledge 1111 00:55:55,720 --> 00:55:57,560 to be able to solve the problem. 1112 00:55:57,560 --> 00:56:01,280 DFS and BFS didn't really care about the structure of the maze 1113 00:56:01,280 --> 00:56:05,400 or anything about the way that a maze is in order to solve the problem. 1114 00:56:05,400 --> 00:56:08,720 They just look at the actions available and choose from those actions, 1115 00:56:08,720 --> 00:56:11,480 and it doesn't matter whether it's a maze or some other problem, 1116 00:56:11,480 --> 00:56:14,200 the solution or the way that it tries to solve the problem 1117 00:56:14,200 --> 00:56:17,520 is really fundamentally going to be the same. 1118 00:56:17,520 --> 00:56:19,920 What we're going to take a look at now is an improvement 1119 00:56:19,920 --> 00:56:21,520 upon uninformed search. 1120 00:56:21,520 --> 00:56:24,000 We're going to take a look at informed search. 1121 00:56:24,000 --> 00:56:26,440 Informed search are going to be search strategies 1122 00:56:26,440 --> 00:56:29,680 that use knowledge specific to the problem 1123 00:56:29,680 --> 00:56:31,960 to be able to better find a solution. 1124 00:56:31,960 --> 00:56:35,440 And in the case of a maze, this problem-specific knowledge 1125 00:56:35,440 --> 00:56:40,400 is something like if I'm in a square that is geographically closer to the goal, 1126 00:56:40,400 --> 00:56:45,880 that is better than being in a square that is geographically further away. 1127 00:56:45,880 --> 00:56:49,400 And this is something we can only know by thinking about this problem 1128 00:56:49,400 --> 00:56:54,000 and reasoning about what knowledge might be helpful for our AI agent 1129 00:56:54,000 --> 00:56:56,360 to know a little something about. 1130 00:56:56,360 --> 00:56:58,600 There are a number of different types of informed search. 1131 00:56:58,600 --> 00:57:01,600 Specifically, first, we're going to look at a particular type of search 1132 00:57:01,600 --> 00:57:05,720 algorithm called greedy best-first search. 1133 00:57:05,720 --> 00:57:08,880 Greedy best-first search, often abbreviated G-BFS, 1134 00:57:08,880 --> 00:57:13,160 is a search algorithm that instead of expanding the deepest node like DFS 1135 00:57:13,160 --> 00:57:16,680 or the shallowest node like BFS, this algorithm 1136 00:57:16,680 --> 00:57:22,160 is always going to expand the node that it thinks is closest to the goal. 1137 00:57:22,160 --> 00:57:24,600 Now, the search algorithm isn't going to know for sure 1138 00:57:24,600 --> 00:57:27,040 whether it is the closest thing to the goal. 1139 00:57:27,040 --> 00:57:29,720 Because if we knew what was closest to the goal all the time, 1140 00:57:29,720 --> 00:57:31,600 then we would already have a solution. 1141 00:57:31,600 --> 00:57:33,360 The knowledge of what is close to the goal, 1142 00:57:33,360 --> 00:57:36,760 we could just follow those steps in order to get from the initial position 1143 00:57:36,760 --> 00:57:37,960 to the solution. 1144 00:57:37,960 --> 00:57:39,600 But if we don't know the solution, meaning 1145 00:57:39,600 --> 00:57:42,560 we don't know exactly what's closest to the goal, 1146 00:57:42,560 --> 00:57:46,200 instead we can use an estimate of what's closest to the goal, 1147 00:57:46,200 --> 00:57:50,680 otherwise known as a heuristic, just some way of estimating whether or not 1148 00:57:50,680 --> 00:57:51,960 we're close to the goal. 1149 00:57:51,960 --> 00:57:54,640 And we'll do so using a heuristic function conventionally 1150 00:57:54,640 --> 00:57:58,520 called h of n that takes a status input and returns 1151 00:57:58,520 --> 00:58:03,000 our estimate of how close we are to the goal. 1152 00:58:03,000 --> 00:58:05,160 So what might this heuristic function actually 1153 00:58:05,160 --> 00:58:08,240 look like in the case of a maze solving algorithm? 1154 00:58:08,240 --> 00:58:11,600 Where we're trying to solve a maze, what does the heuristic look like? 1155 00:58:11,600 --> 00:58:14,160 Well, the heuristic needs to answer a question 1156 00:58:14,160 --> 00:58:17,800 between these two cells, C and D, which one is better? 1157 00:58:17,800 --> 00:58:22,280 Which one would I rather be in if I'm trying to find my way to the goal? 1158 00:58:22,280 --> 00:58:24,440 Well, any human could probably look at this and tell you, 1159 00:58:24,440 --> 00:58:26,400 you know what, D looks like it's better. 1160 00:58:26,400 --> 00:58:29,680 Even if the maze is convoluted and you haven't thought about all the walls, 1161 00:58:29,680 --> 00:58:31,520 D is probably better. 1162 00:58:31,520 --> 00:58:32,760 And why is D better? 1163 00:58:32,760 --> 00:58:35,480 Well, because if you ignore the wall, so let's just pretend 1164 00:58:35,480 --> 00:58:40,440 the walls don't exist for a moment and relax the problem, so to speak, 1165 00:58:40,440 --> 00:58:44,800 D, just in terms of coordinate pairs, is closer to this goal. 1166 00:58:44,800 --> 00:58:49,080 It's fewer steps that I wouldn't take to get to the goal as compared to C, 1167 00:58:49,080 --> 00:58:50,320 even if you ignore the walls. 1168 00:58:50,320 --> 00:58:55,320 If you just know the xy-coordinate of C and the xy-coordinate of the goal, 1169 00:58:55,320 --> 00:58:57,600 and likewise you know the xy-coordinate of D, 1170 00:58:57,600 --> 00:59:00,520 you can calculate the D just geographically. 1171 00:59:00,520 --> 00:59:03,320 Ignoring the walls looks like it's better. 1172 00:59:03,320 --> 00:59:05,820 And so this is the heuristic function that we're going to use. 1173 00:59:05,820 --> 00:59:08,160 And it's something called the Manhattan distance, 1174 00:59:08,160 --> 00:59:12,440 one specific type of heuristic, where the heuristic is how many squares 1175 00:59:12,440 --> 00:59:15,080 vertically and horizontally and then left to right, 1176 00:59:15,080 --> 00:59:18,320 so not allowing myself to go diagonally, just either up or right 1177 00:59:18,320 --> 00:59:19,480 or left or down. 1178 00:59:19,480 --> 00:59:24,160 How many steps do I need to take to get from each of these cells to the goal? 1179 00:59:24,160 --> 00:59:27,040 Well, as it turns out, D is much closer. 1180 00:59:27,040 --> 00:59:28,040 There are fewer steps. 1181 00:59:28,040 --> 00:59:31,920 It only needs to take six steps in order to get to that goal. 1182 00:59:31,920 --> 00:59:33,760 Again, here, ignoring the walls. 1183 00:59:33,760 --> 00:59:35,960 We've relaxed the problem a little bit. 1184 00:59:35,960 --> 00:59:38,600 We're just concerned with if you do the math 1185 00:59:38,600 --> 00:59:41,880 to subtract the x values from each other and the y values from each other, 1186 00:59:41,880 --> 00:59:44,200 what is our estimate of how far we are away? 1187 00:59:44,200 --> 00:59:49,140 We can estimate the D is closer to the goal than C is. 1188 00:59:49,140 --> 00:59:51,000 And so now we have an approach. 1189 00:59:51,000 --> 00:59:54,040 We have a way of picking which node to remove from the frontier. 1190 00:59:54,040 --> 00:59:56,080 And at each stage in our algorithm, we're 1191 00:59:56,080 --> 00:59:57,760 going to remove a node from the frontier. 1192 00:59:57,760 --> 01:00:00,920 We're going to explore the node if it has the smallest 1193 01:00:00,920 --> 01:00:04,120 value for this heuristic function, if it has the smallest 1194 01:00:04,120 --> 01:00:06,720 Manhattan distance to the goal. 1195 01:00:06,720 --> 01:00:08,560 And so what would this actually look like? 1196 01:00:08,560 --> 01:00:11,440 Well, let me first label this graph, label this maze, 1197 01:00:11,440 --> 01:00:14,680 with a number representing the value of this heuristic function, 1198 01:00:14,680 --> 01:00:18,080 the value of the Manhattan distance from any of these cells. 1199 01:00:18,080 --> 01:00:21,200 So from this cell, for example, we're one away from the goal. 1200 01:00:21,200 --> 01:00:24,560 From this cell, we're two away from the goal, three away, four away. 1201 01:00:24,560 --> 01:00:27,120 Here, we're five away because we have to go one to the right 1202 01:00:27,120 --> 01:00:28,400 and then four up. 1203 01:00:28,400 --> 01:00:32,160 From somewhere like here, the Manhattan distance is two. 1204 01:00:32,160 --> 01:00:35,720 We're only two squares away from the goal geographically, 1205 01:00:35,720 --> 01:00:39,000 even though in practice, we're going to have to take a longer path. 1206 01:00:39,000 --> 01:00:40,080 But we don't know that yet. 1207 01:00:40,080 --> 01:00:42,920 The heuristic is just some easy way to estimate 1208 01:00:42,920 --> 01:00:44,560 how far we are away from the goal. 1209 01:00:44,560 --> 01:00:47,560 And maybe our heuristic is overly optimistic. 1210 01:00:47,560 --> 01:00:49,800 It thinks that, yeah, we're only two steps away. 1211 01:00:49,800 --> 01:00:53,680 When in practice, when you consider the walls, it might be more steps. 1212 01:00:53,680 --> 01:00:57,800 So the important thing here is that the heuristic isn't a guarantee of how 1213 01:00:57,800 --> 01:00:59,400 many steps it's going to take. 1214 01:00:59,400 --> 01:01:01,040 It is estimating. 1215 01:01:01,040 --> 01:01:03,040 It's an attempt at trying to approximate. 1216 01:01:03,040 --> 01:01:06,240 And it does seem generally the case that the squares that 1217 01:01:06,240 --> 01:01:10,120 look closer to the goal have smaller values for the heuristic function 1218 01:01:10,120 --> 01:01:13,120 than squares that are further away. 1219 01:01:13,120 --> 01:01:18,240 So now, using greedy best-first search, what might this algorithm actually do? 1220 01:01:18,240 --> 01:01:21,520 Well, again, for these first five steps, there's not much of a choice. 1221 01:01:21,520 --> 01:01:23,840 We start at this initial state a, and we say, all right, 1222 01:01:23,840 --> 01:01:26,440 we have to explore these five states. 1223 01:01:26,440 --> 01:01:28,040 But now we have a decision point. 1224 01:01:28,040 --> 01:01:30,760 Now we have a choice between going left and going right. 1225 01:01:30,760 --> 01:01:34,080 And before, when DFS and BFS would just pick arbitrarily, 1226 01:01:34,080 --> 01:01:37,760 because it just depends on the order you throw these two nodes into the frontier, 1227 01:01:37,760 --> 01:01:40,880 and we didn't specify what order you put them into the frontier, 1228 01:01:40,880 --> 01:01:45,520 only the order you take them out, here we can look at 13 and 11 1229 01:01:45,520 --> 01:01:50,800 and say that, all right, this square is a distance of 11 away from the goal 1230 01:01:50,800 --> 01:01:53,440 according to our heuristic, according to our estimate. 1231 01:01:53,440 --> 01:01:57,720 And this one, we estimate to be 13 away from the goal. 1232 01:01:57,720 --> 01:02:00,800 So between those two options, between these two choices, 1233 01:02:00,800 --> 01:02:02,280 I'd rather have the 11. 1234 01:02:02,280 --> 01:02:06,280 I'd rather be 11 steps away from the goal, so I'll go to the right. 1235 01:02:06,280 --> 01:02:09,800 We're able to make an informed decision, because we know a little something 1236 01:02:09,800 --> 01:02:11,840 more about this problem. 1237 01:02:11,840 --> 01:02:13,960 So then we keep following, 10, 9, 8. 1238 01:02:13,960 --> 01:02:17,920 Between the two 7s, we don't really have much of a way to know between those. 1239 01:02:17,920 --> 01:02:20,040 So then we do just have to make an arbitrary choice. 1240 01:02:20,040 --> 01:02:21,840 And you know what, maybe we choose wrong. 1241 01:02:21,840 --> 01:02:26,240 But that's OK, because now we can still say, all right, let's try this 7. 1242 01:02:26,240 --> 01:02:29,280 We say 7, 6, we have to make this choice, 1243 01:02:29,280 --> 01:02:31,800 even though it increases the value of the heuristic function. 1244 01:02:31,800 --> 01:02:36,440 But now we have another decision point, between 6 and 8, and between those two. 1245 01:02:36,440 --> 01:02:39,520 And really, we're also considering this 13, but that's much higher. 1246 01:02:39,520 --> 01:02:43,560 Between 6, 8, and 13, well, the 6 is the smallest value, 1247 01:02:43,560 --> 01:02:45,040 so we'd rather take the 6. 1248 01:02:45,040 --> 01:02:48,600 We're able to make an informed decision that going this way to the right 1249 01:02:48,600 --> 01:02:51,000 is probably better than going down. 1250 01:02:51,000 --> 01:02:53,000 So we turn this way, we go to 5. 1251 01:02:53,000 --> 01:02:55,320 And now we find a decision point where we'll actually 1252 01:02:55,320 --> 01:02:57,360 make a decision that we might not want to make, 1253 01:02:57,360 --> 01:03:00,440 but there's unfortunately not too much of a way around this. 1254 01:03:00,440 --> 01:03:01,800 We see 4 and 6. 1255 01:03:01,800 --> 01:03:03,760 4 looks closer to the goal, right? 1256 01:03:03,760 --> 01:03:06,320 It's going up, and the goal is further up. 1257 01:03:06,320 --> 01:03:09,840 So we end up taking that route, which ultimately leads us to a dead end. 1258 01:03:09,840 --> 01:03:13,120 But that's OK, because we can still say, all right, now let's try the 6. 1259 01:03:13,120 --> 01:03:17,400 And now follow this route that will ultimately lead us to the goal. 1260 01:03:17,400 --> 01:03:20,480 And so this now is how greedy best-for-search 1261 01:03:20,480 --> 01:03:22,640 might try to approach this problem by saying, 1262 01:03:22,640 --> 01:03:26,240 whenever we have a decision between multiple nodes that we could explore, 1263 01:03:26,240 --> 01:03:30,480 let's explore the node that has the smallest value of h of n, 1264 01:03:30,480 --> 01:03:35,360 this heuristic function that is estimating how far I have to go. 1265 01:03:35,360 --> 01:03:37,560 And it just so happens that in this case, we end up 1266 01:03:37,560 --> 01:03:41,200 doing better in terms of the number of states we needed to explore 1267 01:03:41,200 --> 01:03:42,560 than BFS needed to. 1268 01:03:42,560 --> 01:03:46,120 BFS explored all of this section and all of that section, 1269 01:03:46,120 --> 01:03:49,640 but we were able to eliminate that by taking advantage of this heuristic, 1270 01:03:49,640 --> 01:03:56,360 this knowledge about how close we are to the goal or some estimate of that idea. 1271 01:03:56,360 --> 01:03:57,480 So this seems much better. 1272 01:03:57,480 --> 01:04:01,080 So wouldn't we always prefer an algorithm like this over an algorithm 1273 01:04:01,080 --> 01:04:03,040 like breadth-first search? 1274 01:04:03,040 --> 01:04:05,560 Well, maybe one thing to take into consideration 1275 01:04:05,560 --> 01:04:09,600 is that we need to come up with a good heuristic, how good the heuristic is, 1276 01:04:09,600 --> 01:04:11,840 is going to affect how good this algorithm is. 1277 01:04:11,840 --> 01:04:16,000 And coming up with a good heuristic can oftentimes be challenging. 1278 01:04:16,000 --> 01:04:18,440 But the other thing to consider is to ask the question, 1279 01:04:18,440 --> 01:04:22,720 just as we did with the prior two algorithms, is this algorithm optimal? 1280 01:04:22,720 --> 01:04:28,400 Will it always find the shortest path from the initial state to the goal? 1281 01:04:28,400 --> 01:04:32,320 And to answer that question, let's take a look at this example for a moment. 1282 01:04:32,320 --> 01:04:33,600 Take a look at this example. 1283 01:04:33,600 --> 01:04:36,120 Again, we're trying to get from A to B. And again, 1284 01:04:36,120 --> 01:04:40,160 I've labeled each of the cells with their Manhattan distance from the goal. 1285 01:04:40,160 --> 01:04:42,480 The number of squares up and to the right, 1286 01:04:42,480 --> 01:04:46,680 you would need to travel in order to get from that square to the goal. 1287 01:04:46,680 --> 01:04:49,560 And let's think about, would greedy best-first search 1288 01:04:49,560 --> 01:04:55,520 that always picks the smallest number end up finding the optimal solution? 1289 01:04:55,520 --> 01:04:57,080 What is the shortest solution? 1290 01:04:57,080 --> 01:04:59,560 And would this algorithm find it? 1291 01:04:59,560 --> 01:05:04,360 And the important thing to realize is that right here is the decision point. 1292 01:05:04,360 --> 01:05:06,840 We're estimated to be 12 away from the goal. 1293 01:05:06,840 --> 01:05:08,360 And we have two choices. 1294 01:05:08,360 --> 01:05:11,840 We can go to the left, which we estimate to be 13 away from the goal. 1295 01:05:11,840 --> 01:05:15,840 Or we can go up, where we estimate it to be 11 away from the goal. 1296 01:05:15,840 --> 01:05:18,720 And between those two, greedy best-first search 1297 01:05:18,720 --> 01:05:23,120 is going to say the 11 looks better than the 13. 1298 01:05:23,120 --> 01:05:26,040 And in doing so, greedy best-first search will end up 1299 01:05:26,040 --> 01:05:28,960 finding this path to the goal. 1300 01:05:28,960 --> 01:05:31,120 But it turns out this path is not optimal. 1301 01:05:31,120 --> 01:05:33,600 There is a way to get to the goal using fewer steps. 1302 01:05:33,600 --> 01:05:38,520 And it's actually this way, this way that ultimately involved fewer steps, 1303 01:05:38,520 --> 01:05:43,480 even though it meant at this moment choosing the worst option between the two 1304 01:05:43,480 --> 01:05:47,280 or what we estimated to be the worst option based on the heuristics. 1305 01:05:47,280 --> 01:05:50,040 And so this is what we mean by this is a greedy algorithm. 1306 01:05:50,040 --> 01:05:52,600 It's making the best decision locally. 1307 01:05:52,600 --> 01:05:55,800 At this decision point, it looks like it's better to go here 1308 01:05:55,800 --> 01:05:57,480 than it is to go to the 13. 1309 01:05:57,480 --> 01:06:00,200 But in the big picture, it's not necessarily optimal. 1310 01:06:00,200 --> 01:06:03,200 That it might find a solution when in actuality, 1311 01:06:03,200 --> 01:06:06,200 there was a better solution available. 1312 01:06:06,200 --> 01:06:09,360 So we would like some way to solve this problem. 1313 01:06:09,360 --> 01:06:12,000 We like the idea of this heuristic, of being 1314 01:06:12,000 --> 01:06:16,280 able to estimate the path, the distance between us and the goal. 1315 01:06:16,280 --> 01:06:18,440 And that helps us to be able to make better decisions 1316 01:06:18,440 --> 01:06:23,160 and to eliminate having to search through entire parts of this state space. 1317 01:06:23,160 --> 01:06:27,080 But we would like to modify the algorithm so that we can achieve optimality, 1318 01:06:27,080 --> 01:06:28,760 so that it can be optimal. 1319 01:06:28,760 --> 01:06:30,120 And what is the way to do this? 1320 01:06:30,120 --> 01:06:31,960 What is the intuition here? 1321 01:06:31,960 --> 01:06:34,480 Well, let's take a look at this problem. 1322 01:06:34,480 --> 01:06:37,200 In this initial problem, greedy best research 1323 01:06:37,200 --> 01:06:40,240 found us this solution here, this long path. 1324 01:06:40,240 --> 01:06:43,440 And the reason why it wasn't great is because, yes, the heuristic numbers 1325 01:06:43,440 --> 01:06:44,960 went down pretty low. 1326 01:06:44,960 --> 01:06:47,320 But later on, they started to build back up. 1327 01:06:47,320 --> 01:06:52,000 They built back 8, 9, 10, 11, all the way up to 12 in this case. 1328 01:06:52,000 --> 01:06:55,440 And so how might we go about trying to improve this algorithm? 1329 01:06:55,440 --> 01:06:59,240 Well, one thing that we might realize is that if we go all the way 1330 01:06:59,240 --> 01:07:03,440 through this algorithm, through this path, and we end up going to the 12, 1331 01:07:03,440 --> 01:07:06,600 and we've had to take this many steps, who knows how many steps that is, 1332 01:07:06,600 --> 01:07:11,440 just to get to this 12, we could have also, as an alternative, 1333 01:07:11,440 --> 01:07:16,320 taken much fewer steps, just six steps, and ended up at this 13 here. 1334 01:07:16,320 --> 01:07:19,840 And yes, 13 is more than 12, so it looks like it's not as good. 1335 01:07:19,840 --> 01:07:22,120 But it required far fewer steps. 1336 01:07:22,120 --> 01:07:25,680 It only took six steps to get to this 13 versus many more steps 1337 01:07:25,680 --> 01:07:27,160 to get to this 12. 1338 01:07:27,160 --> 01:07:30,320 And while greedy best research says, oh, well, 12 is better than 13, 1339 01:07:30,320 --> 01:07:33,920 so pick the 12, we might more intelligently say, 1340 01:07:33,920 --> 01:07:37,240 I'd rather be somewhere that heuristically looks 1341 01:07:37,240 --> 01:07:42,160 like it takes slightly longer if I can get there much more quickly. 1342 01:07:42,160 --> 01:07:45,120 And we're going to encode that idea, this general idea, 1343 01:07:45,120 --> 01:07:49,200 into a more formal algorithm known as A star search. 1344 01:07:49,200 --> 01:07:51,280 A star search is going to solve this problem 1345 01:07:51,280 --> 01:07:54,120 by instead of just considering the heuristic, 1346 01:07:54,120 --> 01:07:58,800 also considering how long it took us to get to any particular state. 1347 01:07:58,800 --> 01:08:01,040 So the distinction is greedy best for search. 1348 01:08:01,040 --> 01:08:04,120 If I am in a state right now, the only thing I care about 1349 01:08:04,120 --> 01:08:07,240 is, what is the estimated distance, the heuristic value, 1350 01:08:07,240 --> 01:08:09,160 between me and the goal? 1351 01:08:09,160 --> 01:08:11,800 Whereas A star search will take into consideration 1352 01:08:11,800 --> 01:08:13,440 two pieces of information. 1353 01:08:13,440 --> 01:08:17,280 It'll take into consideration, how far do I estimate I am from the goal? 1354 01:08:17,280 --> 01:08:21,200 But also, how far did I have to travel in order to get here? 1355 01:08:21,200 --> 01:08:23,640 Because that is relevant, too. 1356 01:08:23,640 --> 01:08:26,160 So we'll search algorithms by expanding the node 1357 01:08:26,160 --> 01:08:30,200 with the lowest value of g of n plus h of n. 1358 01:08:30,200 --> 01:08:33,800 h of n is that same heuristic that we were talking about a moment ago that's 1359 01:08:33,800 --> 01:08:35,720 going to vary based on the problem. 1360 01:08:35,720 --> 01:08:40,320 But g of n is going to be the cost to reach the node, how many steps 1361 01:08:40,320 --> 01:08:45,520 I had to take, in this case, to get to my current position. 1362 01:08:45,520 --> 01:08:48,200 So what does that search algorithm look like in practice? 1363 01:08:48,200 --> 01:08:49,760 Well, let's take a look. 1364 01:08:49,760 --> 01:08:51,280 Again, we've got the same maze. 1365 01:08:51,280 --> 01:08:54,160 And again, I've labeled them with their Manhattan distance. 1366 01:08:54,160 --> 01:08:57,400 This value is the h of n value, the heuristic 1367 01:08:57,400 --> 01:09:02,400 estimate of how far each of these squares is away from the goal. 1368 01:09:02,400 --> 01:09:04,520 But now, as we begin to explore states, we 1369 01:09:04,520 --> 01:09:08,560 care not just about this heuristic value, but also about g of n, 1370 01:09:08,560 --> 01:09:11,680 the number of steps I had to take in order to get there. 1371 01:09:11,680 --> 01:09:14,280 And I care about summing those two numbers together. 1372 01:09:14,280 --> 01:09:15,400 So what does that look like? 1373 01:09:15,400 --> 01:09:19,000 On this very first step, I have taken one step. 1374 01:09:19,000 --> 01:09:22,280 And now I am estimated to be 16 steps away from the goal. 1375 01:09:22,280 --> 01:09:25,400 So the total value here is 17. 1376 01:09:25,400 --> 01:09:26,520 Then I take one more step. 1377 01:09:26,520 --> 01:09:28,160 I've now taken two steps. 1378 01:09:28,160 --> 01:09:32,800 And I estimate myself to be 15 away from the goal, again, a total value of 17. 1379 01:09:32,800 --> 01:09:34,360 Now I've taken three steps. 1380 01:09:34,360 --> 01:09:37,600 And I'm estimated to be 14 away from the goal, so on and so forth. 1381 01:09:37,600 --> 01:09:39,880 Four steps, an estimate of 13. 1382 01:09:39,880 --> 01:09:41,960 Five steps, estimate of 12. 1383 01:09:41,960 --> 01:09:44,120 And now here's a decision point. 1384 01:09:44,120 --> 01:09:48,880 I could either be six steps away from the goal with a heuristic of 13 1385 01:09:48,880 --> 01:09:52,600 for a total of 19, or I could be six steps away 1386 01:09:52,600 --> 01:09:57,840 from the goal with a heuristic of 11 with an estimate of 17 for the total. 1387 01:09:57,840 --> 01:10:03,200 So between 19 and 17, I'd rather take the 17, the 6 plus 11. 1388 01:10:03,200 --> 01:10:05,200 So so far, no different than what we saw before. 1389 01:10:05,200 --> 01:10:08,280 We're still taking this option because it appears to be better. 1390 01:10:08,280 --> 01:10:11,280 And I keep taking this option because it appears to be better. 1391 01:10:11,280 --> 01:10:15,720 But it's right about here that things get a little bit different. 1392 01:10:15,720 --> 01:10:21,760 Now I could be 15 steps away from the goal with an estimated distance of 6. 1393 01:10:21,760 --> 01:10:24,880 So 15 plus 6, total value of 21. 1394 01:10:24,880 --> 01:10:28,000 Alternatively, I could be six steps away from the goal, 1395 01:10:28,000 --> 01:10:30,800 because this is five steps away, so this is six steps away, 1396 01:10:30,800 --> 01:10:33,480 with a total value of 13 as my estimate. 1397 01:10:33,480 --> 01:10:36,320 So 6 plus 13, that's 19. 1398 01:10:36,320 --> 01:10:41,720 So here, we would evaluate g of n plus h of n to be 19, 6 plus 13. 1399 01:10:41,720 --> 01:10:46,560 Whereas here, we would be 15 plus 6, or 21. 1400 01:10:46,560 --> 01:10:49,840 And so the intuition is 19 less than 21, pick here. 1401 01:10:49,840 --> 01:10:55,360 But the idea is ultimately I'd rather be having taken fewer steps, get to a 13, 1402 01:10:55,360 --> 01:10:59,160 than having taken 15 steps and be at a 6, because it 1403 01:10:59,160 --> 01:11:01,560 means I've had to take more steps in order to get there. 1404 01:11:01,560 --> 01:11:04,640 Maybe there's a better path this way. 1405 01:11:04,640 --> 01:11:07,200 So instead, we'll explore this route. 1406 01:11:07,200 --> 01:11:11,040 Now if we go one more, this is seven steps plus 14 is 21. 1407 01:11:11,040 --> 01:11:12,960 So between those two, it's sort of a toss-up. 1408 01:11:12,960 --> 01:11:15,120 We might end up exploring that one anyways. 1409 01:11:15,120 --> 01:11:19,280 But after that, as these numbers start to get bigger in the heuristic values, 1410 01:11:19,280 --> 01:11:21,720 and these heuristic values start to get smaller, 1411 01:11:21,720 --> 01:11:25,240 you'll find that we'll actually keep exploring down this path. 1412 01:11:25,240 --> 01:11:28,400 And you can do the math to see that at every decision point, 1413 01:11:28,400 --> 01:11:31,240 A star search is going to make a choice based 1414 01:11:31,240 --> 01:11:35,200 on the sum of how many steps it took me to get to my current position, 1415 01:11:35,200 --> 01:11:39,320 and then how far I estimate I am from the goal. 1416 01:11:39,320 --> 01:11:41,920 So while we did have to explore some of these states, 1417 01:11:41,920 --> 01:11:46,640 the ultimate solution we found was, in fact, an optimal solution. 1418 01:11:46,640 --> 01:11:50,960 It did find us the quickest possible way to get from the initial state 1419 01:11:50,960 --> 01:11:51,960 to the goal. 1420 01:11:51,960 --> 01:11:55,240 And it turns out that A star is an optimal search algorithm 1421 01:11:55,240 --> 01:11:57,440 under certain conditions. 1422 01:11:57,440 --> 01:12:02,160 So the conditions are H of n, my heuristic, needs to be admissible. 1423 01:12:02,160 --> 01:12:04,120 What does it mean for a heuristic to be admissible? 1424 01:12:04,120 --> 01:12:08,840 Well, a heuristic is admissible if it never overestimates the true cost. 1425 01:12:08,840 --> 01:12:12,560 H of n always needs to either get it exactly right 1426 01:12:12,560 --> 01:12:16,680 in terms of how far away I am, or it needs to underestimate. 1427 01:12:16,680 --> 01:12:20,800 So we saw an example from before where the heuristic value was much smaller 1428 01:12:20,800 --> 01:12:22,520 than the actual cost it would take. 1429 01:12:22,520 --> 01:12:26,280 That's totally fine, but the heuristic value should never overestimate. 1430 01:12:26,280 --> 01:12:30,720 It should never think that I'm further away from the goal than I actually am. 1431 01:12:30,720 --> 01:12:34,840 And meanwhile, to make a stronger statement, H of n also needs to be 1432 01:12:34,840 --> 01:12:36,160 consistent. 1433 01:12:36,160 --> 01:12:37,960 And what does it mean for it to be consistent? 1434 01:12:37,960 --> 01:12:41,760 Mathematically, it means that for every node, which we'll call n, 1435 01:12:41,760 --> 01:12:43,840 and successor, the node after me, that I'll 1436 01:12:43,840 --> 01:12:48,920 call n prime, where it takes a cost of C to make that step, 1437 01:12:48,920 --> 01:12:52,720 the heuristic value of n needs to be less than or equal to the heuristic 1438 01:12:52,720 --> 01:12:55,240 value of n prime plus the cost. 1439 01:12:55,240 --> 01:12:58,160 So it's a lot of math, but in words what that ultimately means 1440 01:12:58,160 --> 01:13:01,040 is that if I am here at this state right now, 1441 01:13:01,040 --> 01:13:03,640 the heuristic value from me to the goal shouldn't 1442 01:13:03,640 --> 01:13:07,080 be more than the heuristic value of my successor, 1443 01:13:07,080 --> 01:13:10,200 the next place I could go to, plus however much 1444 01:13:10,200 --> 01:13:14,600 it would cost me to just make that step from one step to the next step. 1445 01:13:14,600 --> 01:13:18,600 And so this is just making sure that my heuristic is consistent between all 1446 01:13:18,600 --> 01:13:20,240 of these steps that I might take. 1447 01:13:20,240 --> 01:13:22,680 So as long as this is true, then A star search 1448 01:13:22,680 --> 01:13:25,600 is going to find me an optimal solution. 1449 01:13:25,600 --> 01:13:28,760 And this is where much of the challenge of solving these search problems 1450 01:13:28,760 --> 01:13:32,120 can sometimes come in, that A star search is an algorithm that is known 1451 01:13:32,120 --> 01:13:34,120 and you could write the code fairly easily, 1452 01:13:34,120 --> 01:13:35,800 but it's choosing the heuristic. 1453 01:13:35,800 --> 01:13:37,400 It can be the interesting challenge. 1454 01:13:37,400 --> 01:13:39,680 The better the heuristic is, the better I'll 1455 01:13:39,680 --> 01:13:43,000 be able to solve the problem in the fewer states that I'll have to explore. 1456 01:13:43,000 --> 01:13:46,320 And I need to make sure that the heuristic satisfies 1457 01:13:46,320 --> 01:13:48,680 these particular constraints. 1458 01:13:48,680 --> 01:13:52,040 So all in all, these are some of the examples of search algorithms 1459 01:13:52,040 --> 01:13:55,200 that might work, and certainly there are many more than just this. 1460 01:13:55,200 --> 01:13:58,720 A star, for example, does have a tendency to use quite a bit of memory. 1461 01:13:58,720 --> 01:14:01,560 So there are alternative approaches to A star 1462 01:14:01,560 --> 01:14:04,680 that ultimately use less memory than this version of A star 1463 01:14:04,680 --> 01:14:07,960 happens to use, and there are other search algorithms 1464 01:14:07,960 --> 01:14:11,640 that are optimized for other cases as well. 1465 01:14:11,640 --> 01:14:14,600 But now so far, we've only been looking at search algorithms 1466 01:14:14,600 --> 01:14:17,040 where there is one agent. 1467 01:14:17,040 --> 01:14:19,640 I am trying to find a solution to a problem. 1468 01:14:19,640 --> 01:14:22,080 I am trying to navigate my way through a maze. 1469 01:14:22,080 --> 01:14:24,080 I am trying to solve a 15 puzzle. 1470 01:14:24,080 --> 01:14:28,360 I am trying to find driving directions from point A to point B. 1471 01:14:28,360 --> 01:14:30,600 Sometimes in search situations, though, we'll 1472 01:14:30,600 --> 01:14:34,560 enter an adversarial situation, where I am an agent trying 1473 01:14:34,560 --> 01:14:36,120 to make intelligent decisions. 1474 01:14:36,120 --> 01:14:39,240 And there's someone else who is fighting against me, so to speak, 1475 01:14:39,240 --> 01:14:41,320 that has opposite objectives, someone where 1476 01:14:41,320 --> 01:14:45,320 I am trying to succeed, someone else that wants me to fail. 1477 01:14:45,320 --> 01:14:49,840 And this is most popular in something like a game, a game like Tic Tac Toe, 1478 01:14:49,840 --> 01:14:53,520 where we've got this 3 by 3 grid, and x and o take turns, 1479 01:14:53,520 --> 01:14:56,280 either writing an x or an o in any one of these squares. 1480 01:14:56,280 --> 01:14:59,760 And the goal is to get three x's in a row if you're the x player, 1481 01:14:59,760 --> 01:15:02,760 or three o's in a row if you're the o player. 1482 01:15:02,760 --> 01:15:05,320 And computers have gotten quite good at playing games, 1483 01:15:05,320 --> 01:15:08,520 Tic Tac Toe very easily, but even more complex games. 1484 01:15:08,520 --> 01:15:12,480 And so you might imagine, what does an intelligent decision in a game 1485 01:15:12,480 --> 01:15:13,440 look like? 1486 01:15:13,440 --> 01:15:17,280 So maybe x makes an initial move in the middle, and o plays up here. 1487 01:15:17,280 --> 01:15:20,480 What does an intelligent move for x now become? 1488 01:15:20,480 --> 01:15:22,520 Where should you move if you were x? 1489 01:15:22,520 --> 01:15:24,840 And it turns out there are a couple of possibilities. 1490 01:15:24,840 --> 01:15:27,240 But if an AI is playing this game optimally, 1491 01:15:27,240 --> 01:15:30,160 then the AI might play somewhere like the upper right, 1492 01:15:30,160 --> 01:15:34,200 where in this situation, o has the opposite objective of x. 1493 01:15:34,200 --> 01:15:37,920 x is trying to win the game to get three in a row diagonally here. 1494 01:15:37,920 --> 01:15:41,440 And o is trying to stop that objective, opposite of the objective. 1495 01:15:41,440 --> 01:15:44,000 And so o is going to place here to try to block. 1496 01:15:44,000 --> 01:15:46,400 But now, x has a pretty clever move. 1497 01:15:46,400 --> 01:15:51,000 x can make a move like this, where now x has two possible ways 1498 01:15:51,000 --> 01:15:52,200 that x can win the game. 1499 01:15:52,200 --> 01:15:55,200 x could win the game by getting three in a row across here. 1500 01:15:55,200 --> 01:15:58,520 Or x could win the game by getting three in a row vertically this way. 1501 01:15:58,520 --> 01:16:00,680 So it doesn't matter where o makes their next move. 1502 01:16:00,680 --> 01:16:04,360 o could play here, for example, blocking the three in a row horizontally. 1503 01:16:04,360 --> 01:16:09,360 But then x is going to win the game by getting a three in a row vertically. 1504 01:16:09,360 --> 01:16:11,360 And so there's a fair amount of reasoning that's 1505 01:16:11,360 --> 01:16:14,400 going on here in order for the computer to be able to solve a problem. 1506 01:16:14,400 --> 01:16:17,720 And it's similar in spirit to the problems we've looked at so far. 1507 01:16:17,720 --> 01:16:19,280 There are actions. 1508 01:16:19,280 --> 01:16:21,680 There's some sort of state of the board and some transition 1509 01:16:21,680 --> 01:16:23,360 from one action to the next. 1510 01:16:23,360 --> 01:16:25,640 But it's different in the sense that this is now 1511 01:16:25,640 --> 01:16:29,440 not just a classical search problem, but an adversarial search problem. 1512 01:16:29,440 --> 01:16:32,960 That I am at the x player trying to find the best moves to make, 1513 01:16:32,960 --> 01:16:36,680 but I know that there is some adversary that is trying to stop me. 1514 01:16:36,680 --> 01:16:41,280 So we need some sort of algorithm to deal with these adversarial type of search 1515 01:16:41,280 --> 01:16:42,560 situations. 1516 01:16:42,560 --> 01:16:44,520 And the algorithm we're going to take a look at 1517 01:16:44,520 --> 01:16:47,780 is an algorithm called Minimax, which works very well 1518 01:16:47,780 --> 01:16:51,000 for these deterministic games where there are two players. 1519 01:16:51,000 --> 01:16:52,800 It can work for other types of games as well. 1520 01:16:52,800 --> 01:16:55,440 But we'll look right now at games where I make a move, 1521 01:16:55,440 --> 01:16:56,880 then my opponent makes a move. 1522 01:16:56,880 --> 01:17:00,400 And I am trying to win, and my opponent is trying to win also. 1523 01:17:00,400 --> 01:17:04,120 Or in other words, my opponent is trying to get me to lose. 1524 01:17:04,120 --> 01:17:07,120 And so what do we need in order to make this algorithm work? 1525 01:17:07,120 --> 01:17:10,960 Well, any time we try and translate this human concept of playing a game, 1526 01:17:10,960 --> 01:17:14,100 winning and losing to a computer, we want to translate it 1527 01:17:14,100 --> 01:17:16,360 in terms that the computer can understand. 1528 01:17:16,360 --> 01:17:19,880 And ultimately, the computer really just understands the numbers. 1529 01:17:19,880 --> 01:17:23,920 And so we want some way of translating a game of x's and o's on a grid 1530 01:17:23,920 --> 01:17:26,640 to something numerical, something the computer can understand. 1531 01:17:26,640 --> 01:17:30,480 The computer doesn't normally understand notions of win or lose. 1532 01:17:30,480 --> 01:17:34,560 But it does understand the concept of bigger and smaller. 1533 01:17:34,560 --> 01:17:38,240 And so what we might do is we might take each of the possible ways 1534 01:17:38,240 --> 01:17:43,280 that a tic-tac-toe game can unfold and assign a value or a utility 1535 01:17:43,280 --> 01:17:45,240 to each one of those possible ways. 1536 01:17:45,240 --> 01:17:47,960 And in a tic-tac-toe game, and in many types of games, 1537 01:17:47,960 --> 01:17:49,960 there are three possible outcomes. 1538 01:17:49,960 --> 01:17:54,360 The outcomes are o wins, x wins, or nobody wins. 1539 01:17:54,360 --> 01:17:58,560 So player one wins, player two wins, or nobody wins. 1540 01:17:58,560 --> 01:18:02,840 And for now, let's go ahead and assign each of these possible outcomes 1541 01:18:02,840 --> 01:18:04,040 a different value. 1542 01:18:04,040 --> 01:18:07,400 We'll say o winning, that'll have a value of negative 1. 1543 01:18:07,400 --> 01:18:09,800 Nobody winning, that'll have a value of 0. 1544 01:18:09,800 --> 01:18:13,000 And x winning, that will have a value of 1. 1545 01:18:13,000 --> 01:18:17,000 So we've just assigned numbers to each of these three possible outcomes. 1546 01:18:17,000 --> 01:18:22,440 And now we have two players, we have the x player and the o player. 1547 01:18:22,440 --> 01:18:26,360 And we're going to go ahead and call the x player the max player. 1548 01:18:26,360 --> 01:18:29,160 And we'll call the o player the min player. 1549 01:18:29,160 --> 01:18:32,080 And the reason why is because in the min and max algorithm, 1550 01:18:32,080 --> 01:18:37,520 the max player, which in this case is x, is aiming to maximize the score. 1551 01:18:37,520 --> 01:18:40,880 These are the possible options for the score, negative 1, 0, and 1. 1552 01:18:40,880 --> 01:18:44,560 x wants to maximize the score, meaning if at all possible, 1553 01:18:44,560 --> 01:18:48,040 x would like this situation, where x wins the game, 1554 01:18:48,040 --> 01:18:49,760 and we give it a score of 1. 1555 01:18:49,760 --> 01:18:54,000 But if this isn't possible, if x needs to choose between these two options, 1556 01:18:54,000 --> 01:18:58,080 negative 1, meaning o winning, or 0, meaning nobody winning, 1557 01:18:58,080 --> 01:19:01,720 x would rather that nobody wins, score of 0, 1558 01:19:01,720 --> 01:19:04,400 than a score of negative 1, o winning. 1559 01:19:04,400 --> 01:19:07,240 So this notion of winning and losing and tying 1560 01:19:07,240 --> 01:19:12,240 has been reduced mathematically to just this idea of try and maximize the score. 1561 01:19:12,240 --> 01:19:16,080 The x player always wants the score to be bigger. 1562 01:19:16,080 --> 01:19:19,040 And on the flip side, the min player, in this case o, 1563 01:19:19,040 --> 01:19:20,760 is aiming to minimize the score. 1564 01:19:20,760 --> 01:19:25,640 The o player wants the score to be as small as possible. 1565 01:19:25,640 --> 01:19:29,000 So now we've taken this game of x's and o's and winning and losing 1566 01:19:29,000 --> 01:19:30,760 and turned it into something mathematical, 1567 01:19:30,760 --> 01:19:33,480 something where x is trying to maximize the score, 1568 01:19:33,480 --> 01:19:35,640 o is trying to minimize the score. 1569 01:19:35,640 --> 01:19:37,760 Let's now look at all of the parts of the game 1570 01:19:37,760 --> 01:19:40,800 that we need in order to encode it in an AI 1571 01:19:40,800 --> 01:19:44,880 so that an AI can play a game like tic-tac-toe. 1572 01:19:44,880 --> 01:19:46,920 So the game is going to need a couple of things. 1573 01:19:46,920 --> 01:19:50,680 We'll need some sort of initial state that will, in this case, call s0, 1574 01:19:50,680 --> 01:19:54,880 which is how the game begins, like an empty tic-tac-toe board, for example. 1575 01:19:54,880 --> 01:20:00,080 We'll also need a function called player, where the player function 1576 01:20:00,080 --> 01:20:04,280 is going to take as input a state here represented by s. 1577 01:20:04,280 --> 01:20:09,600 And the output of the player function is going to be which player's turn is it. 1578 01:20:09,600 --> 01:20:12,520 We need to be able to give a tic-tac-toe board to the computer, 1579 01:20:12,520 --> 01:20:16,600 run it through a function, and that function tells us whose turn it is. 1580 01:20:16,600 --> 01:20:19,040 We'll need some notion of actions that we can take. 1581 01:20:19,040 --> 01:20:21,120 We'll see examples of that in just a moment. 1582 01:20:21,120 --> 01:20:24,080 We need some notion of a transition model, same as before. 1583 01:20:24,080 --> 01:20:26,320 If I have a state and I take an action, I 1584 01:20:26,320 --> 01:20:29,200 need to know what results as a consequence of it. 1585 01:20:29,200 --> 01:20:31,960 I need some way of knowing when the game is over. 1586 01:20:31,960 --> 01:20:34,040 So this is equivalent to kind of like a goal test, 1587 01:20:34,040 --> 01:20:36,480 but I need some terminal test, some way to check 1588 01:20:36,480 --> 01:20:40,760 to see if a state is a terminal state, where a terminal state means the game is 1589 01:20:40,760 --> 01:20:41,440 over. 1590 01:20:41,440 --> 01:20:44,960 In a classic game of tic-tac-toe, a terminal state 1591 01:20:44,960 --> 01:20:47,480 means either someone has gotten three in a row 1592 01:20:47,480 --> 01:20:50,200 or all of the squares of the tic-tac-toe board are filled. 1593 01:20:50,200 --> 01:20:52,920 Either of those conditions make it a terminal state. 1594 01:20:52,920 --> 01:20:55,840 In a game of chess, it might be something like when there is checkmate 1595 01:20:55,840 --> 01:21:00,440 or if checkmate is no longer possible, that that becomes a terminal state. 1596 01:21:00,440 --> 01:21:04,560 And then finally, we'll need a utility function, a function that takes a state 1597 01:21:04,560 --> 01:21:08,040 and gives us a numerical value for that terminal state, some way of saying 1598 01:21:08,040 --> 01:21:10,680 if x wins the game, that has a value of 1. 1599 01:21:10,680 --> 01:21:13,200 If o is won the game, that has a value of negative 1. 1600 01:21:13,200 --> 01:21:16,520 If nobody has won the game, that has a value of 0. 1601 01:21:16,520 --> 01:21:18,840 So let's take a look at each of these in turn. 1602 01:21:18,840 --> 01:21:23,240 The initial state, we can just represent in tic-tac-toe as the empty game board. 1603 01:21:23,240 --> 01:21:24,480 This is where we begin. 1604 01:21:24,480 --> 01:21:27,200 It's the place from which we begin this search. 1605 01:21:27,200 --> 01:21:29,600 And again, I'll be representing these things visually, 1606 01:21:29,600 --> 01:21:32,120 but you can imagine this really just being like an array 1607 01:21:32,120 --> 01:21:36,240 or a two-dimensional array of all of these possible squares. 1608 01:21:36,240 --> 01:21:39,640 Then we need the player function that, again, takes a state 1609 01:21:39,640 --> 01:21:41,360 and tells us whose turn it is. 1610 01:21:41,360 --> 01:21:44,800 Assuming x makes the first move, if I have an empty game board, 1611 01:21:44,800 --> 01:21:47,640 then my player function is going to return x. 1612 01:21:47,640 --> 01:21:49,840 And if I have a game board where x has made a move, 1613 01:21:49,840 --> 01:21:52,520 then my player function is going to return o. 1614 01:21:52,520 --> 01:21:54,960 The player function takes a tic-tac-toe game board 1615 01:21:54,960 --> 01:21:58,320 and tells us whose turn it is. 1616 01:21:58,320 --> 01:22:01,000 Next up, we'll consider the actions function. 1617 01:22:01,000 --> 01:22:04,080 The actions function, much like it did in classical search, 1618 01:22:04,080 --> 01:22:08,000 takes a state and gives us the set of all of the possible actions 1619 01:22:08,000 --> 01:22:10,520 we can take in that state. 1620 01:22:10,520 --> 01:22:15,480 So let's imagine it's o is turned to move in a game board that looks like this. 1621 01:22:15,480 --> 01:22:18,240 What happens when we pass it into the actions function? 1622 01:22:18,240 --> 01:22:22,120 So the actions function takes this state of the game as input, 1623 01:22:22,120 --> 01:22:25,000 and the output is a set of possible actions. 1624 01:22:25,000 --> 01:22:27,560 It's a set of I could move in the upper left 1625 01:22:27,560 --> 01:22:29,720 or I could move in the bottom middle. 1626 01:22:29,720 --> 01:22:31,720 So those are the two possible action choices 1627 01:22:31,720 --> 01:22:36,320 that I have when I begin in this particular state. 1628 01:22:36,320 --> 01:22:39,240 Now, just as before, when we had states and actions, 1629 01:22:39,240 --> 01:22:41,600 we need some sort of transition model to tell us 1630 01:22:41,600 --> 01:22:45,520 when we take this action in the state, what is the new state that we get. 1631 01:22:45,520 --> 01:22:48,200 And here, we define that using the result function 1632 01:22:48,200 --> 01:22:51,600 that takes a state as input as well as an action. 1633 01:22:51,600 --> 01:22:54,640 And when we apply the result function to this state, 1634 01:22:54,640 --> 01:22:58,040 saying that let's let o move in this upper left corner, 1635 01:22:58,040 --> 01:23:01,480 the new state we get is this resulting state where o is in the upper left 1636 01:23:01,480 --> 01:23:02,040 corner. 1637 01:23:02,040 --> 01:23:04,800 And now, this seems obvious to someone who knows how to play tic-tac-toe. 1638 01:23:04,800 --> 01:23:06,840 Of course, you play in the upper left corner. 1639 01:23:06,840 --> 01:23:07,960 That's the board you get. 1640 01:23:07,960 --> 01:23:11,360 But all of this information needs to be encoded into the AI. 1641 01:23:11,360 --> 01:23:14,120 The AI doesn't know how to play tic-tac-toe until you 1642 01:23:14,120 --> 01:23:17,280 tell the AI how the rules of tic-tac-toe work. 1643 01:23:17,280 --> 01:23:19,760 And this function, defining this function here, 1644 01:23:19,760 --> 01:23:23,200 allows us to tell the AI how this game actually works 1645 01:23:23,200 --> 01:23:27,320 and how actions actually affect the outcome of the game. 1646 01:23:27,320 --> 01:23:29,720 So the AI needs to know how the game works. 1647 01:23:29,720 --> 01:23:32,240 The AI also needs to know when the game is over, 1648 01:23:32,240 --> 01:23:36,640 as by defining a function called terminal that takes as input a state s, 1649 01:23:36,640 --> 01:23:39,360 such that if we take a game that is not yet over, 1650 01:23:39,360 --> 01:23:42,280 pass it into the terminal function, the output is false. 1651 01:23:42,280 --> 01:23:43,680 The game is not over. 1652 01:23:43,680 --> 01:23:47,320 But if we take a game that is over because x has gotten three in a row 1653 01:23:47,320 --> 01:23:50,400 along that diagonal, pass that into the terminal function, 1654 01:23:50,400 --> 01:23:55,040 then the output is going to be true because the game now is, in fact, over. 1655 01:23:55,040 --> 01:23:58,160 And finally, we've told the AI how the game works 1656 01:23:58,160 --> 01:24:01,320 in terms of what moves can be made and what happens when you make those moves. 1657 01:24:01,320 --> 01:24:03,320 We've told the AI when the game is over. 1658 01:24:03,320 --> 01:24:07,400 Now we need to tell the AI what the value of each of those states is. 1659 01:24:07,400 --> 01:24:11,320 And we do that by defining this utility function that takes a state s 1660 01:24:11,320 --> 01:24:14,880 and tells us the score or the utility of that state. 1661 01:24:14,880 --> 01:24:18,880 So again, we said that if x wins the game, that utility is a value of 1, 1662 01:24:18,880 --> 01:24:23,480 whereas if o wins the game, then the utility of that is negative 1. 1663 01:24:23,480 --> 01:24:26,360 And the AI needs to know, for each of these terminal states 1664 01:24:26,360 --> 01:24:30,840 where the game is over, what is the utility of that state? 1665 01:24:30,840 --> 01:24:34,560 So if I give you a game board like this where the game is, in fact, over, 1666 01:24:34,560 --> 01:24:38,840 and I ask the AI to tell me what the value of that state is, it could do so. 1667 01:24:38,840 --> 01:24:42,000 The value of the state is 1. 1668 01:24:42,000 --> 01:24:46,400 Where things get interesting, though, is if the game is not yet over. 1669 01:24:46,400 --> 01:24:49,480 Let's imagine a game board like this, where in the middle of the game, 1670 01:24:49,480 --> 01:24:52,000 it's o's turn to make a move. 1671 01:24:52,000 --> 01:24:54,120 So how do we know it's o's turn to make a move? 1672 01:24:54,120 --> 01:24:56,480 We can calculate that using the player function. 1673 01:24:56,480 --> 01:25:00,120 We can say player of s, pass in the state, o is the answer. 1674 01:25:00,120 --> 01:25:02,280 So we know it's o's turn to move. 1675 01:25:02,280 --> 01:25:06,800 And now, what is the value of this board and what action should o take? 1676 01:25:06,800 --> 01:25:08,080 Well, that's going to depend. 1677 01:25:08,080 --> 01:25:09,880 We have to do some calculation here. 1678 01:25:09,880 --> 01:25:13,600 And this is where the minimax algorithm really comes in. 1679 01:25:13,600 --> 01:25:16,720 Recall that x is trying to maximize the score, which 1680 01:25:16,720 --> 01:25:19,840 means that o is trying to minimize the score. 1681 01:25:19,840 --> 01:25:22,800 So o would like to minimize the total value 1682 01:25:22,800 --> 01:25:25,000 that we get at the end of the game. 1683 01:25:25,000 --> 01:25:27,440 And because this game isn't over yet, we don't really 1684 01:25:27,440 --> 01:25:30,960 know just yet what the value of this game board is. 1685 01:25:30,960 --> 01:25:34,320 We have to do some calculation in order to figure that out. 1686 01:25:34,320 --> 01:25:36,560 And so how do we do that kind of calculation? 1687 01:25:36,560 --> 01:25:39,160 Well, in order to do so, we're going to consider, 1688 01:25:39,160 --> 01:25:41,800 just as we might in a classical search situation, 1689 01:25:41,800 --> 01:25:46,160 what actions could happen next and what states will that take us to. 1690 01:25:46,160 --> 01:25:50,120 And it turns out that in this position, there are only two open squares, 1691 01:25:50,120 --> 01:25:54,760 which means there are only two open places where o can make a move. 1692 01:25:54,760 --> 01:25:57,200 o could either make a move in the upper left 1693 01:25:57,200 --> 01:26:00,280 or o can make a move in the bottom middle. 1694 01:26:00,280 --> 01:26:03,080 And minimax doesn't know right out of the box which of those moves 1695 01:26:03,080 --> 01:26:04,360 is going to be better. 1696 01:26:04,360 --> 01:26:06,640 So it's going to consider both. 1697 01:26:06,640 --> 01:26:08,560 But now, we sort of run into the same situation. 1698 01:26:08,560 --> 01:26:11,280 Now, I have two more game boards, neither of which is over. 1699 01:26:11,280 --> 01:26:12,720 What happens next? 1700 01:26:12,720 --> 01:26:14,520 And now, it's in this sense that minimax is 1701 01:26:14,520 --> 01:26:16,800 what we'll call a recursive algorithm. 1702 01:26:16,800 --> 01:26:20,480 It's going to now repeat the exact same process, 1703 01:26:20,480 --> 01:26:23,760 although now considering it from the opposite perspective. 1704 01:26:23,760 --> 01:26:27,400 It's as if I am now going to put myself, if I am the o player, 1705 01:26:27,400 --> 01:26:31,680 I'm going to put myself in my opponent's shoes, my opponent as the x player, 1706 01:26:31,680 --> 01:26:36,160 and consider what would my opponent do if they were in this position? 1707 01:26:36,160 --> 01:26:40,200 What would my opponent do, the x player, if they were in that position? 1708 01:26:40,200 --> 01:26:41,600 And what would then happen? 1709 01:26:41,600 --> 01:26:44,400 Well, the other player, my opponent, the x player, 1710 01:26:44,400 --> 01:26:46,920 is trying to maximize the score, whereas I 1711 01:26:46,920 --> 01:26:49,400 am trying to minimize the score as the o player. 1712 01:26:49,400 --> 01:26:53,680 So x is trying to find the maximum possible value that they can get. 1713 01:26:53,680 --> 01:26:55,520 And so what's going to happen? 1714 01:26:55,520 --> 01:26:58,920 Well, from this board position, x only has one choice. 1715 01:26:58,920 --> 01:27:01,720 x is going to play here, and they're going to get three in a row. 1716 01:27:01,720 --> 01:27:05,680 And we know that that board, x winning, that has a value of 1. 1717 01:27:05,680 --> 01:27:09,360 If x wins the game, the value of that game board is 1. 1718 01:27:09,360 --> 01:27:14,120 And so from this position, if this state can only ever 1719 01:27:14,120 --> 01:27:16,720 lead to this state, it's the only possible option, 1720 01:27:16,720 --> 01:27:21,120 and this state has a value of 1, then the maximum possible value 1721 01:27:21,120 --> 01:27:24,560 that the x player can get from this game board is also 1. 1722 01:27:24,560 --> 01:27:27,800 From here, the only place we can get is to a game with a value of 1, 1723 01:27:27,800 --> 01:27:31,400 so this game board also has a value of 1. 1724 01:27:31,400 --> 01:27:33,680 Now we consider this one over here. 1725 01:27:33,680 --> 01:27:34,960 What's going to happen now? 1726 01:27:34,960 --> 01:27:36,480 Well, x needs to make a move. 1727 01:27:36,480 --> 01:27:39,680 The only move x can make is in the upper left, so x will go there. 1728 01:27:39,680 --> 01:27:41,400 And in this game, no one wins the game. 1729 01:27:41,400 --> 01:27:42,960 Nobody has three in a row. 1730 01:27:42,960 --> 01:27:45,760 And so the value of that game board is 0. 1731 01:27:45,760 --> 01:27:47,040 Nobody is 1. 1732 01:27:47,040 --> 01:27:50,280 And so again, by the same logic, if from this board position 1733 01:27:50,280 --> 01:27:53,920 the only place we can get to is a board where the value is 0, 1734 01:27:53,920 --> 01:27:57,440 then this state must also have a value of 0. 1735 01:27:57,440 --> 01:28:01,520 And now here comes the choice part, the idea of trying to minimize. 1736 01:28:01,520 --> 01:28:05,760 I, as the o player, now know that if I make this choice moving in the upper 1737 01:28:05,760 --> 01:28:09,320 left, that is going to result in a game with a value of 1, 1738 01:28:09,320 --> 01:28:11,400 assuming everyone plays optimally. 1739 01:28:11,400 --> 01:28:13,200 And if I instead play in the lower middle, 1740 01:28:13,200 --> 01:28:15,480 choose this fork in the road, that is going 1741 01:28:15,480 --> 01:28:17,640 to result in a game board with a value of 0. 1742 01:28:17,640 --> 01:28:18,760 I have two options. 1743 01:28:18,760 --> 01:28:22,400 I have a 1 and a 0 to choose from, and I need to pick. 1744 01:28:22,400 --> 01:28:25,200 And as the min player, I would rather choose 1745 01:28:25,200 --> 01:28:27,000 the option with the minimum value. 1746 01:28:27,000 --> 01:28:29,220 So whenever a player has multiple choices, 1747 01:28:29,220 --> 01:28:32,160 the min player will choose the option with the smallest value. 1748 01:28:32,160 --> 01:28:34,880 The max player will choose the option with the largest value. 1749 01:28:34,880 --> 01:28:37,520 Between the 1 and the 0, the 0 is smaller, 1750 01:28:37,520 --> 01:28:40,760 meaning I'd rather tie the game than lose the game. 1751 01:28:40,760 --> 01:28:44,200 And so this game board will say also has a value of 0, 1752 01:28:44,200 --> 01:28:48,400 because if I am playing optimally, I will pick this fork in the road. 1753 01:28:48,400 --> 01:28:53,000 I'll place my o here to block x's 3 in a row, x will move in the upper left, 1754 01:28:53,000 --> 01:28:56,440 and the game will be over, and no one will have won the game. 1755 01:28:56,440 --> 01:29:00,400 So this is now the logic of minimax, to consider all of the possible options 1756 01:29:00,400 --> 01:29:03,280 that I can take, all of the actions that I can take, 1757 01:29:03,280 --> 01:29:05,540 and then to put myself in my opponent's shoes. 1758 01:29:05,540 --> 01:29:08,680 I decide what move I'm going to make now by considering 1759 01:29:08,680 --> 01:29:11,000 what move my opponent will make on the next turn. 1760 01:29:11,000 --> 01:29:14,360 And to do that, I consider what move I would make on the turn after that, 1761 01:29:14,360 --> 01:29:17,240 so on and so forth, until I get all the way down 1762 01:29:17,240 --> 01:29:21,200 to the end of the game, to one of these so-called terminal states. 1763 01:29:21,200 --> 01:29:25,280 In fact, this very decision point, where I am trying to decide as the o player 1764 01:29:25,280 --> 01:29:27,360 what to make a decision about, might have just 1765 01:29:27,360 --> 01:29:31,640 been a part of the logic that the x player, my opponent, was using, 1766 01:29:31,640 --> 01:29:32,520 the move before me. 1767 01:29:32,520 --> 01:29:35,320 This might be part of some larger tree, where 1768 01:29:35,320 --> 01:29:37,720 x is trying to make a move in this situation, 1769 01:29:37,720 --> 01:29:40,300 and needs to pick between three different options in order 1770 01:29:40,300 --> 01:29:42,560 to make a decision about what to happen. 1771 01:29:42,560 --> 01:29:45,400 And the further and further away we are from the end of the game, 1772 01:29:45,400 --> 01:29:47,240 the deeper this tree has to go. 1773 01:29:47,240 --> 01:29:51,760 Because every level in this tree is going to correspond to one move, 1774 01:29:51,760 --> 01:29:55,040 one move or action that I take, one move or action 1775 01:29:55,040 --> 01:29:58,480 that my opponent takes, in order to decide what happens. 1776 01:29:58,480 --> 01:30:02,120 And in fact, it turns out that if I am the x player in this position, 1777 01:30:02,120 --> 01:30:05,480 and I recursively do the logic, and see I have a choice, three choices, 1778 01:30:05,480 --> 01:30:08,240 in fact, one of which leads to a value of 0. 1779 01:30:08,240 --> 01:30:12,040 If I play here, and if everyone plays optimally, the game will be a tie. 1780 01:30:12,040 --> 01:30:17,120 If I play here, then o is going to win, and I'll lose playing optimally. 1781 01:30:17,120 --> 01:30:21,520 Or here, where I, the x player, can win, well between a score of 0, 1782 01:30:21,520 --> 01:30:25,200 and negative 1, and 1, I'd rather pick the board with a value of 1, 1783 01:30:25,200 --> 01:30:27,320 because that's the maximum value I can get. 1784 01:30:27,320 --> 01:30:31,680 And so this board would also have a maximum value of 1. 1785 01:30:31,680 --> 01:30:35,160 And so this tree can get very, very deep, especially as the game 1786 01:30:35,160 --> 01:30:37,520 starts to have more and more moves. 1787 01:30:37,520 --> 01:30:39,600 And this logic works not just for tic-tac-toe, 1788 01:30:39,600 --> 01:30:41,880 but any of these sorts of games, where I make a move, 1789 01:30:41,880 --> 01:30:44,120 my opponent makes a move, and ultimately, we 1790 01:30:44,120 --> 01:30:46,600 have these adversarial objectives. 1791 01:30:46,600 --> 01:30:50,480 And we can simplify the diagram into a diagram that looks like this. 1792 01:30:50,480 --> 01:30:53,360 This is a more abstract version of the minimax tree, 1793 01:30:53,360 --> 01:30:56,040 where these are each states, but I'm no longer representing them 1794 01:30:56,040 --> 01:30:57,960 as exactly like tic-tac-toe boards. 1795 01:30:57,960 --> 01:31:01,840 This is just representing some generic game that might be tic-tac-toe, 1796 01:31:01,840 --> 01:31:04,280 might be some other game altogether. 1797 01:31:04,280 --> 01:31:06,720 Any of these green arrows that are pointing up, 1798 01:31:06,720 --> 01:31:08,720 that represents a maximizing state. 1799 01:31:08,720 --> 01:31:11,440 I would like the score to be as big as possible. 1800 01:31:11,440 --> 01:31:13,560 And any of these red arrows pointing down, 1801 01:31:13,560 --> 01:31:16,720 those are minimizing states, where the player is the min player, 1802 01:31:16,720 --> 01:31:20,320 and they are trying to make the score as small as possible. 1803 01:31:20,320 --> 01:31:24,320 So if you imagine in this situation, I am the maximizing player, this player 1804 01:31:24,320 --> 01:31:26,600 here, and I have three choices. 1805 01:31:26,600 --> 01:31:30,320 One choice gives me a score of 5, one choice gives me a score of 3, 1806 01:31:30,320 --> 01:31:32,360 and one choice gives me a score of 9. 1807 01:31:32,360 --> 01:31:36,200 Well, then between those three choices, my best option 1808 01:31:36,200 --> 01:31:38,920 is to choose this 9 over here, the score that 1809 01:31:38,920 --> 01:31:42,120 maximizes my options out of all the three options. 1810 01:31:42,120 --> 01:31:46,720 And so I can give this state a value of 9, because among my three options, 1811 01:31:46,720 --> 01:31:50,480 that is the best choice that I have available to me. 1812 01:31:50,480 --> 01:31:51,960 So that's my decision now. 1813 01:31:51,960 --> 01:31:55,640 You imagine it's like one move away from the end of the game. 1814 01:31:55,640 --> 01:31:57,800 But then you could also ask a reasonable question, 1815 01:31:57,800 --> 01:32:01,480 what might my opponent do two moves away from the end of the game? 1816 01:32:01,480 --> 01:32:03,160 My opponent is the minimizing player. 1817 01:32:03,160 --> 01:32:05,840 They are trying to make the score as small as possible. 1818 01:32:05,840 --> 01:32:09,840 Imagine what would have happened if they had to pick which choice to make. 1819 01:32:09,840 --> 01:32:13,520 One choice leads us to this state, where I, the maximizing player, 1820 01:32:13,520 --> 01:32:16,960 am going to opt for 9, the biggest score that I can get. 1821 01:32:16,960 --> 01:32:21,280 And 1 leads to this state, where I, the maximizing player, 1822 01:32:21,280 --> 01:32:25,040 would choose 8, which is then the largest score that I can get. 1823 01:32:25,040 --> 01:32:28,920 Now the minimizing player, forced to choose between a 9 or an 8, 1824 01:32:28,920 --> 01:32:31,200 is going to choose the smallest possible score, 1825 01:32:31,200 --> 01:32:33,240 which in this case is an 8. 1826 01:32:33,240 --> 01:32:35,480 And that is then how this process would unfold, 1827 01:32:35,480 --> 01:32:39,160 that the minimizing player in this case considers both of their options, 1828 01:32:39,160 --> 01:32:43,720 and then all of the options that would happen as a result of that. 1829 01:32:43,720 --> 01:32:47,560 So this now is a general picture of what the minimax algorithm looks like. 1830 01:32:47,560 --> 01:32:50,760 Let's now try to formalize it using a little bit of pseudocode. 1831 01:32:50,760 --> 01:32:53,880 So what exactly is happening in the minimax algorithm? 1832 01:32:53,880 --> 01:32:57,640 Well, given a state s, we need to decide what to happen. 1833 01:32:57,640 --> 01:33:00,960 The max player, if it's max's player's turn, 1834 01:33:00,960 --> 01:33:05,360 then max is going to pick an action a in actions of s. 1835 01:33:05,360 --> 01:33:08,240 Recall that actions is a function that takes a state 1836 01:33:08,240 --> 01:33:11,080 and gives me back all of the possible actions that I can take. 1837 01:33:11,080 --> 01:33:15,000 It tells me all of the moves that are possible. 1838 01:33:15,000 --> 01:33:19,560 The max player is going to specifically pick an action a in this set of actions 1839 01:33:19,560 --> 01:33:26,360 that gives me the highest value of min value of result of s and a. 1840 01:33:26,360 --> 01:33:27,480 So what does that mean? 1841 01:33:27,480 --> 01:33:30,120 Well, it means that I want to make the option that 1842 01:33:30,120 --> 01:33:34,000 gives me the highest score of all of the actions a. 1843 01:33:34,000 --> 01:33:35,760 But what score is that going to have? 1844 01:33:35,760 --> 01:33:38,920 To calculate that, I need to know what my opponent, the min player, 1845 01:33:38,920 --> 01:33:44,520 is going to do if they try to minimize the value of the state that results. 1846 01:33:44,520 --> 01:33:48,400 So we say, what state results after I take this action? 1847 01:33:48,400 --> 01:33:53,720 And what happens when the min player tries to minimize the value of that state? 1848 01:33:53,720 --> 01:33:56,440 I consider that for all of my possible options. 1849 01:33:56,440 --> 01:33:58,960 And after I've considered that for all of my possible options, 1850 01:33:58,960 --> 01:34:02,800 I pick the action a that has the highest value. 1851 01:34:02,800 --> 01:34:06,240 Likewise, the min player is going to do the same thing but backwards. 1852 01:34:06,240 --> 01:34:09,320 They're also going to consider what are all of the possible actions they 1853 01:34:09,320 --> 01:34:10,960 can take if it's their turn. 1854 01:34:10,960 --> 01:34:12,880 And they're going to pick the action a that 1855 01:34:12,880 --> 01:34:16,160 has the smallest possible value of all the options. 1856 01:34:16,160 --> 01:34:19,120 And the way they know what the smallest possible value of all the options 1857 01:34:19,120 --> 01:34:24,480 is is by considering what the max player is going to do by saying, 1858 01:34:24,480 --> 01:34:27,880 what's the result of applying this action to the current state? 1859 01:34:27,880 --> 01:34:29,800 And then what would the max player try to do? 1860 01:34:29,800 --> 01:34:34,040 What value would the max player calculate for that particular state? 1861 01:34:34,040 --> 01:34:36,400 So everyone makes their decision based on trying 1862 01:34:36,400 --> 01:34:39,920 to estimate what the other person would do. 1863 01:34:39,920 --> 01:34:43,680 And now we need to turn our attention to these two functions, max value 1864 01:34:43,680 --> 01:34:44,680 and min value. 1865 01:34:44,680 --> 01:34:47,840 How do you actually calculate the value of a state 1866 01:34:47,840 --> 01:34:50,000 if you're trying to maximize its value? 1867 01:34:50,000 --> 01:34:52,080 And how do you calculate the value of a state 1868 01:34:52,080 --> 01:34:53,840 if you're trying to minimize the value? 1869 01:34:53,840 --> 01:34:56,880 If you can do that, then we have an entire implementation 1870 01:34:56,880 --> 01:34:58,960 of this min and max algorithm. 1871 01:34:58,960 --> 01:34:59,640 So let's try it. 1872 01:34:59,640 --> 01:35:03,960 Let's try and implement this max value function that takes a state 1873 01:35:03,960 --> 01:35:06,680 and returns as output the value of that state 1874 01:35:06,680 --> 01:35:10,000 if I'm trying to maximize the value of the state. 1875 01:35:10,000 --> 01:35:13,000 Well, the first thing I can check for is to see if the game is over. 1876 01:35:13,000 --> 01:35:14,920 Because if the game is over, in other words, 1877 01:35:14,920 --> 01:35:18,120 if the state is a terminal state, then this is easy. 1878 01:35:18,120 --> 01:35:21,080 I already have this utility function that tells me 1879 01:35:21,080 --> 01:35:22,440 what the value of the board is. 1880 01:35:22,440 --> 01:35:26,280 If the game is over, I just check, did x win, did o win, is it a tie? 1881 01:35:26,280 --> 01:35:30,320 And this utility function just knows what the value of the state is. 1882 01:35:30,320 --> 01:35:32,480 What's trickier is if the game isn't over. 1883 01:35:32,480 --> 01:35:35,360 Because then I need to do this recursive reasoning about thinking, 1884 01:35:35,360 --> 01:35:39,000 what is my opponent going to do on the next move? 1885 01:35:39,000 --> 01:35:41,960 And I want to calculate the value of this state. 1886 01:35:41,960 --> 01:35:45,120 And I want the value of the state to be as high as possible. 1887 01:35:45,120 --> 01:35:48,280 And I'll keep track of that value in a variable called v. 1888 01:35:48,280 --> 01:35:50,720 And if I want the value to be as high as possible, 1889 01:35:50,720 --> 01:35:53,480 I need to give v an initial value. 1890 01:35:53,480 --> 01:35:57,640 And initially, I'll just go ahead and set it to be as low as possible. 1891 01:35:57,640 --> 01:36:00,800 Because I don't know what options are available to me yet. 1892 01:36:00,800 --> 01:36:04,760 So initially, I'll set v equal to negative infinity, which 1893 01:36:04,760 --> 01:36:06,040 seems a little bit strange. 1894 01:36:06,040 --> 01:36:08,040 But the idea here is I want the value initially 1895 01:36:08,040 --> 01:36:09,840 to be as low as possible. 1896 01:36:09,840 --> 01:36:12,360 Because as I consider my actions, I'm always 1897 01:36:12,360 --> 01:36:16,360 going to try and do better than v. And if I set v to negative infinity, 1898 01:36:16,360 --> 01:36:19,000 I know I can always do better than that. 1899 01:36:19,000 --> 01:36:21,280 So now I consider my actions. 1900 01:36:21,280 --> 01:36:22,880 And this is going to be some kind of loop 1901 01:36:22,880 --> 01:36:26,240 where for every action in actions of state, 1902 01:36:26,240 --> 01:36:29,000 recall actions as a function that takes my state 1903 01:36:29,000 --> 01:36:32,720 and gives me all the possible actions that I can use in that state. 1904 01:36:32,720 --> 01:36:37,600 So for each one of those actions, I want to compare it to v and say, 1905 01:36:37,600 --> 01:36:44,360 all right, v is going to be equal to the maximum of v and this expression. 1906 01:36:44,360 --> 01:36:46,160 So what is this expression? 1907 01:36:46,160 --> 01:36:50,800 Well, first it is get the result of taking the action in the state 1908 01:36:50,800 --> 01:36:54,320 and then get the min value of that. 1909 01:36:54,320 --> 01:36:58,240 In other words, let's say I want to find out from that state 1910 01:36:58,240 --> 01:37:00,760 what is the best that the min player can do because they're 1911 01:37:00,760 --> 01:37:02,560 going to try and minimize the score. 1912 01:37:02,560 --> 01:37:06,360 So whatever the resulting score is of the min value of that state, 1913 01:37:06,360 --> 01:37:10,040 compare it to my current best value and just pick the maximum of those two 1914 01:37:10,040 --> 01:37:12,640 because I am trying to maximize the value. 1915 01:37:12,640 --> 01:37:14,720 In short, what these three lines of code are doing 1916 01:37:14,720 --> 01:37:18,520 are going through all of my possible actions and asking the question, 1917 01:37:18,520 --> 01:37:24,040 how do I maximize the score given what my opponent is going to try to do? 1918 01:37:24,040 --> 01:37:26,800 After this entire loop, I can just return v 1919 01:37:26,800 --> 01:37:30,280 and that is now the value of that particular state. 1920 01:37:30,280 --> 01:37:32,800 And for the min player, it's the exact opposite of this, 1921 01:37:32,800 --> 01:37:35,080 the same logic just backwards. 1922 01:37:35,080 --> 01:37:37,080 To calculate the minimum value of a state, 1923 01:37:37,080 --> 01:37:38,920 first we check if it's a terminal state. 1924 01:37:38,920 --> 01:37:41,120 If it is, we return its utility. 1925 01:37:41,120 --> 01:37:45,440 Otherwise, we're going to now try to minimize the value of the state 1926 01:37:45,440 --> 01:37:47,440 given all of my possible actions. 1927 01:37:47,440 --> 01:37:50,920 So I need an initial value for v, the value of the state. 1928 01:37:50,920 --> 01:37:53,800 And initially, I'll set it to infinity because I 1929 01:37:53,800 --> 01:37:56,440 know I can always get something less than infinity. 1930 01:37:56,440 --> 01:38:00,040 So by starting with v equals infinity, I make sure that the very first action 1931 01:38:00,040 --> 01:38:03,680 I find, that will be less than this value of v. 1932 01:38:03,680 --> 01:38:07,200 And then I do the same thing, loop over all of my possible actions. 1933 01:38:07,200 --> 01:38:10,760 And for each of the results that we could get when the max player makes 1934 01:38:10,760 --> 01:38:15,280 their decision, let's take the minimum of that and the current value of v. 1935 01:38:15,280 --> 01:38:19,360 So after all is said and done, I get the smallest possible value of v 1936 01:38:19,360 --> 01:38:22,520 that I then return back to the user. 1937 01:38:22,520 --> 01:38:25,160 So that, in effect, is the pseudocode for Minimax. 1938 01:38:25,160 --> 01:38:28,120 That is how we take a gain and figure out what the best move to make 1939 01:38:28,120 --> 01:38:32,480 is by recursively using these max value and min value functions, 1940 01:38:32,480 --> 01:38:36,920 where max value calls min value, min value calls max value back and forth, 1941 01:38:36,920 --> 01:38:39,680 all the way until we reach a terminal state, at which point 1942 01:38:39,680 --> 01:38:45,080 our algorithm can simply return the utility of that particular state. 1943 01:38:45,080 --> 01:38:48,760 So what you might imagine is that this is going to start to be a long process, 1944 01:38:48,760 --> 01:38:51,160 especially as games start to get more complex, 1945 01:38:51,160 --> 01:38:54,720 as we start to add more moves and more possible options and games that 1946 01:38:54,720 --> 01:38:56,840 might last quite a bit longer. 1947 01:38:56,840 --> 01:39:00,520 So the next question to ask is, what sort of optimizations can we make here? 1948 01:39:00,520 --> 01:39:05,360 How can we do better in order to use less space or take less time 1949 01:39:05,360 --> 01:39:08,120 to be able to solve this kind of problem? 1950 01:39:08,120 --> 01:39:10,880 And we'll take a look at a couple of possible optimizations. 1951 01:39:10,880 --> 01:39:13,360 But for one, we'll take a look at this example. 1952 01:39:13,360 --> 01:39:15,880 Again, returning to these up arrows and down arrows, 1953 01:39:15,880 --> 01:39:20,200 let's imagine that I now am the max player, this green arrow. 1954 01:39:20,200 --> 01:39:23,400 I am trying to make this score as high as possible. 1955 01:39:23,400 --> 01:39:26,480 And this is an easy game where there are just two moves. 1956 01:39:26,480 --> 01:39:29,120 I make a move, one of these three options. 1957 01:39:29,120 --> 01:39:32,040 And then my opponent makes a move, one of these three options, 1958 01:39:32,040 --> 01:39:33,440 based on what move I make. 1959 01:39:33,440 --> 01:39:36,480 And as a result, we get some value. 1960 01:39:36,480 --> 01:39:39,600 Let's look at the order in which I do these calculations 1961 01:39:39,600 --> 01:39:41,760 and figure out if there are any optimizations I 1962 01:39:41,760 --> 01:39:44,600 might be able to make to this calculation process. 1963 01:39:44,600 --> 01:39:47,240 I'm going to have to look at these states one at a time. 1964 01:39:47,240 --> 01:39:49,600 So let's say I start here on the left and say, all right, 1965 01:39:49,600 --> 01:39:52,960 now I'm going to consider, what will the min player, my opponent, 1966 01:39:52,960 --> 01:39:54,400 try to do here? 1967 01:39:54,400 --> 01:39:57,960 Well, the min player is going to look at all three of their possible actions 1968 01:39:57,960 --> 01:40:00,400 and look at their value, because these are terminal states. 1969 01:40:00,400 --> 01:40:01,560 They're the end of the game. 1970 01:40:01,560 --> 01:40:04,980 And so they'll see, all right, this node is a value of four, value of eight, 1971 01:40:04,980 --> 01:40:06,560 value of five. 1972 01:40:06,560 --> 01:40:08,940 And the min player is going to say, well, all right, 1973 01:40:08,940 --> 01:40:13,160 between these three options, four, eight, and five, I'll take the smallest one. 1974 01:40:13,160 --> 01:40:14,200 I'll take the four. 1975 01:40:14,200 --> 01:40:16,760 So this state now has a value of four. 1976 01:40:16,760 --> 01:40:20,200 Then I, as the max player, say, all right, if I take this action, 1977 01:40:20,200 --> 01:40:21,320 it will have a value of four. 1978 01:40:21,320 --> 01:40:23,400 That's the best that I can do, because min player 1979 01:40:23,400 --> 01:40:25,920 is going to try and minimize my score. 1980 01:40:25,920 --> 01:40:27,400 So now what if I take this option? 1981 01:40:27,400 --> 01:40:28,760 We'll explore this next. 1982 01:40:28,760 --> 01:40:32,360 And now explore what the min player would do if I choose this action. 1983 01:40:32,360 --> 01:40:35,400 And the min player is going to say, all right, what are the three options? 1984 01:40:35,400 --> 01:40:39,800 The min player has options between nine, three, and seven. 1985 01:40:39,800 --> 01:40:42,660 And so three is the smallest among nine, three, and seven. 1986 01:40:42,660 --> 01:40:45,720 So we'll go ahead and say this state has a value of three. 1987 01:40:45,720 --> 01:40:49,520 So now I, as the max player, I have now explored two of my three options. 1988 01:40:49,520 --> 01:40:53,560 I know that one of my options will guarantee me a score of four, at least. 1989 01:40:53,560 --> 01:40:57,240 And one of my options will guarantee me a score of three. 1990 01:40:57,240 --> 01:41:00,320 And now I consider my third option and say, all right, what happens here? 1991 01:41:00,320 --> 01:41:01,240 Same exact logic. 1992 01:41:01,240 --> 01:41:04,280 The min player is going to look at these three states, two, four, and six. 1993 01:41:04,280 --> 01:41:06,400 I'll say the minimum possible option is two. 1994 01:41:06,400 --> 01:41:08,640 So the min player wants the two. 1995 01:41:08,640 --> 01:41:11,920 Now I, as the max player, have calculated all of the information 1996 01:41:11,920 --> 01:41:15,400 by looking two layers deep, by looking at all of these nodes. 1997 01:41:15,400 --> 01:41:18,920 And I can now say, between the four, the three, and the two, you know what? 1998 01:41:18,920 --> 01:41:20,840 I'd rather take the four. 1999 01:41:20,840 --> 01:41:24,360 Because if I choose this option, if my opponent plays optimally, 2000 01:41:24,360 --> 01:41:26,400 they will try and get me to the four. 2001 01:41:26,400 --> 01:41:27,760 But that's the best I can do. 2002 01:41:27,760 --> 01:41:29,960 I can't guarantee a higher score. 2003 01:41:29,960 --> 01:41:32,840 Because if I pick either of these two options, I might get a three 2004 01:41:32,840 --> 01:41:33,920 or I might get a two. 2005 01:41:33,920 --> 01:41:36,440 And it's true that down here is a nine. 2006 01:41:36,440 --> 01:41:38,760 And that's the highest score out of any of the scores. 2007 01:41:38,760 --> 01:41:40,760 So I might be tempted to say, you know what? 2008 01:41:40,760 --> 01:41:43,520 Maybe I should take this option because I might get the nine. 2009 01:41:43,520 --> 01:41:46,120 But if the min player is playing intelligently, 2010 01:41:46,120 --> 01:41:48,800 if they're making the best moves at each possible option 2011 01:41:48,800 --> 01:41:52,520 they have when they get to make a choice, I'll be left with a three. 2012 01:41:52,520 --> 01:41:54,640 Whereas I could better, playing optimally, 2013 01:41:54,640 --> 01:41:58,040 have guaranteed that I would get the four. 2014 01:41:58,040 --> 01:42:01,560 So that is, in effect, the logic that I would use as a min and max player 2015 01:42:01,560 --> 01:42:05,040 trying to maximize my score from that node there. 2016 01:42:05,040 --> 01:42:08,360 But it turns out they took quite a bit of computation for me to figure that out. 2017 01:42:08,360 --> 01:42:10,240 I had to reason through all of these nodes 2018 01:42:10,240 --> 01:42:11,840 in order to draw this conclusion. 2019 01:42:11,840 --> 01:42:14,920 And this is for a pretty simple game where I have three choices, 2020 01:42:14,920 --> 01:42:18,400 my opponent has three choices, and then the game's over. 2021 01:42:18,400 --> 01:42:21,160 So what I'd like to do is come up with some way to optimize this. 2022 01:42:21,160 --> 01:42:24,520 Maybe I don't need to do all of this calculation 2023 01:42:24,520 --> 01:42:28,080 to still reach the conclusion that, you know what, this action to the left, 2024 01:42:28,080 --> 01:42:29,960 that's the best that I could do. 2025 01:42:29,960 --> 01:42:33,840 Let's go ahead and try again and try to be a little more intelligent about how 2026 01:42:33,840 --> 01:42:36,200 I go about doing this. 2027 01:42:36,200 --> 01:42:38,600 So first, I start the exact same way. 2028 01:42:38,600 --> 01:42:40,320 I don't know what to do initially, so I just 2029 01:42:40,320 --> 01:42:45,080 have to consider one of the options and consider what the min player might do. 2030 01:42:45,080 --> 01:42:47,720 Min has three options, four, eight, and five. 2031 01:42:47,720 --> 01:42:51,640 And between those three options, min says four is the best they can do 2032 01:42:51,640 --> 01:42:54,520 because they want to try to minimize the score. 2033 01:42:54,520 --> 01:42:58,120 Now I, the max player, will consider my second option, 2034 01:42:58,120 --> 01:43:02,880 making this move here, and considering what my opponent would do in response. 2035 01:43:02,880 --> 01:43:04,600 What will the min player do? 2036 01:43:04,600 --> 01:43:07,720 Well, the min player is going to, from that state, look at their options. 2037 01:43:07,720 --> 01:43:12,040 And I would say, all right, nine is an option, three is an option. 2038 01:43:12,040 --> 01:43:14,360 And if I am doing the math from this initial state, 2039 01:43:14,360 --> 01:43:17,560 doing all this calculation, when I see a three, 2040 01:43:17,560 --> 01:43:20,400 that should immediately be a red flag for me. 2041 01:43:20,400 --> 01:43:23,040 Because when I see a three down here at this state, 2042 01:43:23,040 --> 01:43:28,000 I know that the value of this state is going to be at most three. 2043 01:43:28,000 --> 01:43:30,760 It's going to be three or something less than three, 2044 01:43:30,760 --> 01:43:34,180 even though I haven't yet looked at this last action or even further actions 2045 01:43:34,180 --> 01:43:37,000 if there were more actions that could be taken here. 2046 01:43:37,000 --> 01:43:37,920 How do I know that? 2047 01:43:37,920 --> 01:43:42,120 Well, I know that the min player is going to try to minimize my score. 2048 01:43:42,120 --> 01:43:44,640 And if they see a three, the only way this 2049 01:43:44,640 --> 01:43:47,640 could be something other than a three is if this remaining thing 2050 01:43:47,640 --> 01:43:50,840 that I haven't yet looked at is less than three, which 2051 01:43:50,840 --> 01:43:54,960 means there is no way for this value to be anything more than three 2052 01:43:54,960 --> 01:43:57,520 because the min player can already guarantee a three 2053 01:43:57,520 --> 01:44:01,080 and they are trying to minimize my score. 2054 01:44:01,080 --> 01:44:02,400 So what does that tell me? 2055 01:44:02,400 --> 01:44:04,880 Well, it tells me that if I choose this action, 2056 01:44:04,880 --> 01:44:09,400 my score is going to be three or maybe even less than three if I'm unlucky. 2057 01:44:09,400 --> 01:44:13,720 But I already know that this action will guarantee me a four. 2058 01:44:13,720 --> 01:44:17,360 And so given that I know that this action guarantees me a score of four 2059 01:44:17,360 --> 01:44:20,280 and this action means I can't do better than three, 2060 01:44:20,280 --> 01:44:22,440 if I'm trying to maximize my options, there 2061 01:44:22,440 --> 01:44:25,440 is no need for me to consider this triangle here. 2062 01:44:25,440 --> 01:44:28,120 There is no value, no number that could go here 2063 01:44:28,120 --> 01:44:30,880 that would change my mind between these two options. 2064 01:44:30,880 --> 01:44:34,600 I'm always going to opt for this path that gets me a four as opposed 2065 01:44:34,600 --> 01:44:39,880 to this path where the best I can do is a three if my opponent plays optimally. 2066 01:44:39,880 --> 01:44:43,080 And this is going to be true for all the future states that I look at too. 2067 01:44:43,080 --> 01:44:45,600 That if I look over here at what min player might do over here, 2068 01:44:45,600 --> 01:44:50,640 if I see that this state is a two, I know that this state is at most a two 2069 01:44:50,640 --> 01:44:54,640 because the only way this value could be something other than two 2070 01:44:54,640 --> 01:44:57,960 is if one of these remaining states is less than a two 2071 01:44:57,960 --> 01:45:00,640 and so the min player would opt for that instead. 2072 01:45:00,640 --> 01:45:03,600 So even without looking at these remaining states, 2073 01:45:03,600 --> 01:45:08,760 I as the maximizing player can know that choosing this path to the left 2074 01:45:08,760 --> 01:45:13,400 is going to be better than choosing either of those two paths to the right 2075 01:45:13,400 --> 01:45:16,080 because this one can't be better than three. 2076 01:45:16,080 --> 01:45:17,960 This one can't be better than two. 2077 01:45:17,960 --> 01:45:21,440 And so four in this case is the best that I can do. 2078 01:45:21,440 --> 01:45:23,360 So in order to do this cut, and I can say now 2079 01:45:23,360 --> 01:45:25,720 that this state has a value of four. 2080 01:45:25,720 --> 01:45:27,840 So in order to do this type of calculation, 2081 01:45:27,840 --> 01:45:31,120 I was doing a little bit more bookkeeping, keeping track of things, 2082 01:45:31,120 --> 01:45:34,680 keeping track all the time of what is the best that I can do, 2083 01:45:34,680 --> 01:45:37,280 what is the worst that I can do, and for each of these states 2084 01:45:37,280 --> 01:45:41,440 saying, all right, well, if I already know that I can get a four, 2085 01:45:41,440 --> 01:45:44,200 then if the best I can do at this state is a three, 2086 01:45:44,200 --> 01:45:48,440 no reason for me to consider it, I can effectively prune this leaf 2087 01:45:48,440 --> 01:45:51,160 and anything below it from the tree. 2088 01:45:51,160 --> 01:45:54,560 And it's for that reason this approach, this optimization to minimax, 2089 01:45:54,560 --> 01:45:56,640 is called alpha, beta pruning. 2090 01:45:56,640 --> 01:45:58,600 Alpha and beta stand for these two values 2091 01:45:58,600 --> 01:46:01,100 that you'll have to keep track of of the best you can do so far 2092 01:46:01,100 --> 01:46:02,720 and the worst you can do so far. 2093 01:46:02,720 --> 01:46:07,200 And pruning is the idea of if I have a big, long, deep search tree, 2094 01:46:07,200 --> 01:46:09,280 I might be able to search it more efficiently 2095 01:46:09,280 --> 01:46:11,200 if I don't need to search through everything, 2096 01:46:11,200 --> 01:46:15,240 if I can remove some of the nodes to try and optimize the way that I 2097 01:46:15,240 --> 01:46:18,320 look through this entire search space. 2098 01:46:18,320 --> 01:46:21,640 So alpha, beta pruning can definitely save us a lot of time 2099 01:46:21,640 --> 01:46:25,600 as we go about the search process by making our searches more efficient. 2100 01:46:25,600 --> 01:46:29,880 But even then, it's still not great as games get more complex. 2101 01:46:29,880 --> 01:46:33,120 Tic-tac-toe, fortunately, is a relatively simple game. 2102 01:46:33,120 --> 01:46:35,880 And we might reasonably ask a question like, 2103 01:46:35,880 --> 01:46:39,560 how many total possible tic-tac-toe games are there? 2104 01:46:39,560 --> 01:46:40,640 You can think about it. 2105 01:46:40,640 --> 01:46:43,760 You can try and estimate how many moves are there at any given point, 2106 01:46:43,760 --> 01:46:45,640 how many moves long can the game last. 2107 01:46:45,640 --> 01:46:52,280 It turns out there are about 255,000 possible tic-tac-toe games 2108 01:46:52,280 --> 01:46:53,920 that can be played. 2109 01:46:53,920 --> 01:46:56,360 But compare that to a more complex game, something 2110 01:46:56,360 --> 01:46:58,200 like a game of chess, for example. 2111 01:46:58,200 --> 01:47:01,960 Far more pieces, far more moves, games that last much longer. 2112 01:47:01,960 --> 01:47:05,040 How many total possible chess games could there be? 2113 01:47:05,040 --> 01:47:08,760 It turns out that after just four moves each, four moves by the white player, 2114 01:47:08,760 --> 01:47:10,600 four moves by the black player, that there are 2115 01:47:10,600 --> 01:47:15,960 288 billion possible chess games that can result from that situation, 2116 01:47:15,960 --> 01:47:17,440 after just four moves each. 2117 01:47:17,440 --> 01:47:20,080 And going even further, if you look at entire chess games 2118 01:47:20,080 --> 01:47:23,520 and how many possible chess games there could be as a result there, 2119 01:47:23,520 --> 01:47:27,560 there are more than 10 to the 29,000 possible chess games, 2120 01:47:27,560 --> 01:47:30,560 far more chess games than could ever be considered. 2121 01:47:30,560 --> 01:47:33,400 And this is a pretty big problem for the Minimax algorithm, 2122 01:47:33,400 --> 01:47:36,440 because the Minimax algorithm starts with an initial state, 2123 01:47:36,440 --> 01:47:39,520 considers all the possible actions, and all the possible actions 2124 01:47:39,520 --> 01:47:44,660 after that, all the way until we get to the end of the game. 2125 01:47:44,660 --> 01:47:46,920 And that's going to be a problem if the computer is going 2126 01:47:46,920 --> 01:47:51,000 to need to look through this many states, which is far more than any computer 2127 01:47:51,000 --> 01:47:54,920 could ever do in any reasonable amount of time. 2128 01:47:54,920 --> 01:47:57,040 So what do we do in order to solve this problem? 2129 01:47:57,040 --> 01:47:59,120 Instead of looking through all these states which 2130 01:47:59,120 --> 01:48:02,560 is totally intractable for a computer, we need some better approach. 2131 01:48:02,560 --> 01:48:05,760 And it turns out that better approach generally takes the form of something 2132 01:48:05,760 --> 01:48:09,240 called depth-limited Minimax, where normally Minimax 2133 01:48:09,240 --> 01:48:10,760 is depth-unlimited. 2134 01:48:10,760 --> 01:48:13,360 We just keep going layer after layer, move after move, 2135 01:48:13,360 --> 01:48:15,080 until we get to the end of the game. 2136 01:48:15,080 --> 01:48:17,800 Depth-limited Minimax is instead going to say, 2137 01:48:17,800 --> 01:48:21,040 you know what, after a certain number of moves, maybe I'll look 10 moves ahead, 2138 01:48:21,040 --> 01:48:23,540 maybe I'll look 12 moves ahead, but after that point, 2139 01:48:23,540 --> 01:48:26,680 I'm going to stop and not consider additional moves that 2140 01:48:26,680 --> 01:48:30,400 might come after that, just because it would be computationally intractable 2141 01:48:30,400 --> 01:48:34,080 to consider all of those possible options. 2142 01:48:34,080 --> 01:48:36,880 But what do we do after we get 10 or 12 moves deep 2143 01:48:36,880 --> 01:48:40,120 when we arrive at a situation where the game's not over? 2144 01:48:40,120 --> 01:48:43,640 Minimax still needs a way to assign a score to that game board or game 2145 01:48:43,640 --> 01:48:47,280 state to figure out what its current value is, which is easy to do 2146 01:48:47,280 --> 01:48:51,720 if the game is over, but not so easy to do if the game is not yet over. 2147 01:48:51,720 --> 01:48:54,120 So in order to do that, we need to add one additional feature 2148 01:48:54,120 --> 01:48:57,760 to depth-limited Minimax called an evaluation function, which 2149 01:48:57,760 --> 01:49:01,920 is just some function that is going to estimate the expected utility 2150 01:49:01,920 --> 01:49:04,200 of a game from a given state. 2151 01:49:04,200 --> 01:49:07,160 So in a game like chess, if you imagine that a game value of 1 2152 01:49:07,160 --> 01:49:12,120 means white wins, negative 1 means black wins, 0 means it's a draw, 2153 01:49:12,120 --> 01:49:15,440 then you might imagine that a score of 0.8 2154 01:49:15,440 --> 01:49:19,160 means white is very likely to win, though certainly not guaranteed. 2155 01:49:19,160 --> 01:49:21,440 And you would have an evaluation function 2156 01:49:21,440 --> 01:49:25,640 that estimates how good the game state happens to be. 2157 01:49:25,640 --> 01:49:28,880 And depending on how good that evaluation function is, 2158 01:49:28,880 --> 01:49:32,240 that is ultimately what's going to constrain how good the AI is. 2159 01:49:32,240 --> 01:49:36,120 The better the AI is at estimating how good or how bad 2160 01:49:36,120 --> 01:49:38,600 any particular game state is, the better the AI 2161 01:49:38,600 --> 01:49:40,840 is going to be able to play that game. 2162 01:49:40,840 --> 01:49:44,160 If the evaluation function is worse and not as good as it estimating 2163 01:49:44,160 --> 01:49:47,840 what the expected utility is, then it's going to be a whole lot harder. 2164 01:49:47,840 --> 01:49:51,280 And you can imagine trying to come up with these evaluation functions. 2165 01:49:51,280 --> 01:49:54,040 In chess, for example, you might write an evaluation function 2166 01:49:54,040 --> 01:49:56,360 based on how many pieces you have as compared 2167 01:49:56,360 --> 01:49:59,640 to how many pieces your opponent has, because each one has a value. 2168 01:49:59,640 --> 01:50:02,240 And your evaluation function probably needs 2169 01:50:02,240 --> 01:50:04,280 to be a little bit more complicated than that 2170 01:50:04,280 --> 01:50:08,160 to consider other possible situations that might arise as well. 2171 01:50:08,160 --> 01:50:11,640 And there are many other variants on Minimax that add additional features 2172 01:50:11,640 --> 01:50:15,840 in order to help it perform better under these larger, more computationally 2173 01:50:15,840 --> 01:50:18,400 untractable situations where we couldn't possibly 2174 01:50:18,400 --> 01:50:20,760 explore all of the possible moves. 2175 01:50:20,760 --> 01:50:25,240 So we need to figure out how to use evaluation functions and other techniques 2176 01:50:25,240 --> 01:50:28,520 to be able to play these games ultimately better. 2177 01:50:28,520 --> 01:50:31,600 But this now was a look at this kind of adversarial search, these search 2178 01:50:31,600 --> 01:50:35,000 problems where we have situations where I am trying 2179 01:50:35,000 --> 01:50:37,480 to play against some sort of opponent. 2180 01:50:37,480 --> 01:50:40,000 And these search problems show up all over the place 2181 01:50:40,000 --> 01:50:41,720 throughout artificial intelligence. 2182 01:50:41,720 --> 01:50:44,840 We've been talking a lot today about more classical search problems, 2183 01:50:44,840 --> 01:50:48,120 like trying to find directions from one location to another. 2184 01:50:48,120 --> 01:50:51,480 But any time an AI is faced with trying to make a decision, 2185 01:50:51,480 --> 01:50:54,600 like what do I do now in order to do something that is rational, 2186 01:50:54,600 --> 01:50:57,360 or do something that is intelligent, or trying to play a game, 2187 01:50:57,360 --> 01:51:00,000 like figuring out what move to make, these sort of algorithms 2188 01:51:00,000 --> 01:51:01,760 can really come in handy. 2189 01:51:01,760 --> 01:51:04,560 It turns out that for tic-tac-toe, the solution is pretty simple 2190 01:51:04,560 --> 01:51:05,760 because it's a small game. 2191 01:51:05,760 --> 01:51:08,600 XKCD has famously put together a web comic 2192 01:51:08,600 --> 01:51:12,120 where he will tell you exactly what move to make as the optimal move to make 2193 01:51:12,120 --> 01:51:14,440 no matter what your opponent happens to do. 2194 01:51:14,440 --> 01:51:17,680 This type of thing is not quite as possible for a much larger game 2195 01:51:17,680 --> 01:51:21,520 like Checkers or Chess, for example, where chess is totally computationally 2196 01:51:21,520 --> 01:51:25,480 untractable for most computers to be able to explore all the possible states. 2197 01:51:25,480 --> 01:51:29,800 So we really need our AI to be far more intelligent about how 2198 01:51:29,800 --> 01:51:31,880 they go about trying to deal with these problems 2199 01:51:31,880 --> 01:51:35,560 and how they go about taking this environment that they find themselves in 2200 01:51:35,560 --> 01:51:38,880 and ultimately searching for one of these solutions. 2201 01:51:38,880 --> 01:51:41,840 So this, then, was a look at search in artificial intelligence. 2202 01:51:41,840 --> 01:51:43,760 Next time, we'll take a look at knowledge, 2203 01:51:43,760 --> 01:51:47,360 thinking about how it is that our AIs are able to know information, reason 2204 01:51:47,360 --> 01:51:51,320 about that information, and draw conclusions, all in our look at AI 2205 01:51:51,320 --> 01:51:52,880 and the principles behind it. 2206 01:51:52,880 --> 01:51:55,840 We'll see you next time. 2207 01:51:55,840 --> 01:51:58,800 ["AIMS INTRO MUSIC"] 2208 01:52:13,800 --> 01:52:16,000 All right, welcome back, everyone, to an introduction 2209 01:52:16,000 --> 01:52:18,160 to artificial intelligence with Python. 2210 01:52:18,160 --> 01:52:20,840 Last time, we took a look at search problems, in particular, 2211 01:52:20,840 --> 01:52:24,280 where we have AI agents that are trying to solve some sort of problem 2212 01:52:24,280 --> 01:52:26,680 by taking actions in some sort of environment, 2213 01:52:26,680 --> 01:52:30,720 whether that environment is trying to take actions by playing moves in a game 2214 01:52:30,720 --> 01:52:32,760 or whether those actions are something like trying 2215 01:52:32,760 --> 01:52:35,680 to figure out where to make turns in order to get driving directions 2216 01:52:35,680 --> 01:52:38,400 from point A to point B. This time, we're 2217 01:52:38,400 --> 01:52:42,160 going to turn our attention more generally to just this idea of knowledge, 2218 01:52:42,160 --> 01:52:44,920 the idea that a lot of intelligence is based on knowledge, 2219 01:52:44,920 --> 01:52:47,200 especially if we think about human intelligence. 2220 01:52:47,200 --> 01:52:48,840 People know information. 2221 01:52:48,840 --> 01:52:50,600 We know facts about the world. 2222 01:52:50,600 --> 01:52:52,720 And using that information that we know, we're 2223 01:52:52,720 --> 01:52:55,520 able to draw conclusions, reason about the information 2224 01:52:55,520 --> 01:52:58,440 that we know in order to figure out how to do something 2225 01:52:58,440 --> 01:53:00,680 or figure out some other piece of information 2226 01:53:00,680 --> 01:53:05,080 that we conclude based on the information we already have available to us. 2227 01:53:05,080 --> 01:53:07,360 What we'd like to focus on now is the ability 2228 01:53:07,360 --> 01:53:11,360 to take this idea of knowledge and being able to reason based on knowledge 2229 01:53:11,360 --> 01:53:14,280 and apply those ideas to artificial intelligence. 2230 01:53:14,280 --> 01:53:16,200 In particular, we're going to be building what 2231 01:53:16,200 --> 01:53:19,200 are known as knowledge-based agents, agents that 2232 01:53:19,200 --> 01:53:23,040 are able to reason and act by representing knowledge internally. 2233 01:53:23,040 --> 01:53:25,960 Somehow inside of our AI, they have some understanding 2234 01:53:25,960 --> 01:53:27,960 of what it means to know something. 2235 01:53:27,960 --> 01:53:30,600 And ideally, they have some algorithms or some techniques 2236 01:53:30,600 --> 01:53:34,440 they can use based on that knowledge that they know in order to figure out 2237 01:53:34,440 --> 01:53:38,560 the solution to a problem or figure out some additional piece of information 2238 01:53:38,560 --> 01:53:40,800 that can be helpful in some sense. 2239 01:53:40,800 --> 01:53:43,120 So what do we mean by reasoning based on knowledge 2240 01:53:43,120 --> 01:53:44,680 to be able to draw conclusions? 2241 01:53:44,680 --> 01:53:47,960 Well, let's look at a simple example drawn from the world of Harry Potter. 2242 01:53:47,960 --> 01:53:50,600 We take one sentence that we know to be true. 2243 01:53:50,600 --> 01:53:55,080 Imagine if it didn't rain, then Harry visited Hagrid today. 2244 01:53:55,080 --> 01:53:57,840 So one fact that we might know about the world. 2245 01:53:57,840 --> 01:53:59,160 And then we take another fact. 2246 01:53:59,160 --> 01:54:02,960 Harry visited Hagrid or Dumbledore today, but not both. 2247 01:54:02,960 --> 01:54:05,560 So it tells us something about the world, that Harry either visited 2248 01:54:05,560 --> 01:54:09,600 Hagrid but not Dumbledore, or Harry visited Dumbledore but not Hagrid. 2249 01:54:09,600 --> 01:54:12,120 And now we have a third piece of information about the world 2250 01:54:12,120 --> 01:54:14,720 that Harry visited Dumbledore today. 2251 01:54:14,720 --> 01:54:17,920 So we now have three pieces of information now, three facts. 2252 01:54:17,920 --> 01:54:21,600 Inside of a knowledge base, so to speak, information that we know. 2253 01:54:21,600 --> 01:54:23,760 And now we, as humans, can try and reason about this 2254 01:54:23,760 --> 01:54:27,640 and figure out, based on this information, what additional information 2255 01:54:27,640 --> 01:54:29,280 can we begin to conclude? 2256 01:54:29,280 --> 01:54:31,440 And well, looking at these last two statements, 2257 01:54:31,440 --> 01:54:35,240 Harry either visited Hagrid or Dumbledore but not both, 2258 01:54:35,240 --> 01:54:38,140 and we know that Harry visited Dumbledore today, well, 2259 01:54:38,140 --> 01:54:40,680 then it's pretty reasonable that we could draw the conclusion that, 2260 01:54:40,680 --> 01:54:43,800 you know what, Harry must not have visited Hagrid today. 2261 01:54:43,800 --> 01:54:46,520 Because based on a combination of these two statements, 2262 01:54:46,520 --> 01:54:50,560 we can draw this inference, so to speak, a conclusion that Harry did not 2263 01:54:50,560 --> 01:54:52,120 visit Hagrid today. 2264 01:54:52,120 --> 01:54:54,560 But it turns out we can even do a little bit better than that, 2265 01:54:54,560 --> 01:54:57,720 get some more information by taking a look at this first statement 2266 01:54:57,720 --> 01:54:59,180 and reasoning about that. 2267 01:54:59,180 --> 01:55:01,920 This first statement says, if it didn't rain, 2268 01:55:01,920 --> 01:55:04,200 then Harry visited Hagrid today. 2269 01:55:04,200 --> 01:55:05,080 So what does that mean? 2270 01:55:05,080 --> 01:55:09,080 In all cases where it didn't rain, then we know that Harry visited Hagrid. 2271 01:55:09,080 --> 01:55:12,680 But if we also know now that Harry did not visit Hagrid, 2272 01:55:12,680 --> 01:55:15,540 then that tells us something about our initial premise 2273 01:55:15,540 --> 01:55:16,760 that we were thinking about. 2274 01:55:16,760 --> 01:55:21,200 In particular, it tells us that it did rain today, because we can reason, 2275 01:55:21,200 --> 01:55:24,240 if it didn't rain, that Harry would have visited Hagrid. 2276 01:55:24,240 --> 01:55:28,840 But we know for a fact that Harry did not visit Hagrid today. 2277 01:55:28,840 --> 01:55:31,760 So it's this kind of reason, this sort of logical reasoning, 2278 01:55:31,760 --> 01:55:33,960 where we use logic based on the information 2279 01:55:33,960 --> 01:55:38,040 that we know in order to take information and reach conclusions that 2280 01:55:38,040 --> 01:55:40,880 is going to be the focus of what we're going to be talking about today. 2281 01:55:40,880 --> 01:55:43,600 How can we make our artificial intelligence 2282 01:55:43,600 --> 01:55:47,220 logical so that they can perform the same kinds of deduction, 2283 01:55:47,220 --> 01:55:50,640 the same kinds of reasoning that we've been doing so far? 2284 01:55:50,640 --> 01:55:53,200 Of course, humans reason about logic generally 2285 01:55:53,200 --> 01:55:54,760 in terms of human language. 2286 01:55:54,760 --> 01:55:58,640 That I just now was speaking in English, talking in English about these 2287 01:55:58,640 --> 01:56:01,080 sentences and trying to reason through how it 2288 01:56:01,080 --> 01:56:02,600 is that they relate to one another. 2289 01:56:02,600 --> 01:56:05,000 We're going to need to be a little bit more formal when 2290 01:56:05,000 --> 01:56:07,440 we turn our attention to computers and being 2291 01:56:07,440 --> 01:56:11,200 able to encode this notion of logic and truthhood and falsehood 2292 01:56:11,200 --> 01:56:12,640 inside of a machine. 2293 01:56:12,640 --> 01:56:16,040 So we're going to need to introduce a few more terms and a few symbols that 2294 01:56:16,040 --> 01:56:18,440 will help us reason through this idea of logic 2295 01:56:18,440 --> 01:56:20,480 inside of an artificial intelligence. 2296 01:56:20,480 --> 01:56:22,840 And we'll begin with the idea of a sentence. 2297 01:56:22,840 --> 01:56:24,880 Now, a sentence in a natural language like English 2298 01:56:24,880 --> 01:56:28,040 is just something that I'm saying, like what I'm saying right now. 2299 01:56:28,040 --> 01:56:32,920 In the context of AI, though, a sentence is just an assertion about the world 2300 01:56:32,920 --> 01:56:36,740 in what we're going to call a knowledge representation language, 2301 01:56:36,740 --> 01:56:40,940 some way of representing knowledge inside of our computers. 2302 01:56:40,940 --> 01:56:44,680 And the way that we're going to spend most of today reasoning about knowledge 2303 01:56:44,680 --> 01:56:47,600 is through a type of logic known as propositional logic. 2304 01:56:47,600 --> 01:56:50,800 There are a number of different types of logic, some of which we'll touch on. 2305 01:56:50,800 --> 01:56:54,680 But propositional logic is based on a logic of propositions, 2306 01:56:54,680 --> 01:56:56,640 or just statements about the world. 2307 01:56:56,640 --> 01:57:01,040 And so we begin in propositional logic with a notion of propositional symbols. 2308 01:57:01,040 --> 01:57:04,080 We will have certain symbols that are oftentimes just letters, 2309 01:57:04,080 --> 01:57:07,760 something like P or Q or R, where each of those symbols 2310 01:57:07,760 --> 01:57:11,840 is going to represent some fact or sentence about the world. 2311 01:57:11,840 --> 01:57:15,800 So P, for example, might represent the fact that it is raining. 2312 01:57:15,800 --> 01:57:19,200 And so P is going to be a symbol that represents that idea. 2313 01:57:19,200 --> 01:57:22,960 And Q, for example, might represent Harry visited Hagrid today. 2314 01:57:22,960 --> 01:57:26,600 Each of these propositional symbols represents some sentence 2315 01:57:26,600 --> 01:57:29,320 or some fact about the world. 2316 01:57:29,320 --> 01:57:32,400 But in addition to just having individual facts about the world, 2317 01:57:32,400 --> 01:57:36,040 we want some way to connect these propositional symbols together 2318 01:57:36,040 --> 01:57:39,520 in order to reason more complexly about other facts that 2319 01:57:39,520 --> 01:57:42,200 might exist inside of the world in which we're reasoning. 2320 01:57:42,200 --> 01:57:45,240 So in order to do that, we'll need to introduce some additional symbols 2321 01:57:45,240 --> 01:57:47,600 that are known as logical connectives. 2322 01:57:47,600 --> 01:57:49,840 Now, there are a number of these logical connectives. 2323 01:57:49,840 --> 01:57:52,920 But five of the most important, and the ones we're going to focus on today, 2324 01:57:52,920 --> 01:57:56,520 are these five up here, each represented by a logical symbol. 2325 01:57:56,520 --> 01:58:00,520 Not is represented by this symbol here, and is represented 2326 01:58:00,520 --> 01:58:04,600 as sort of an upside down V, or is represented by a V shape. 2327 01:58:04,600 --> 01:58:07,600 Implication, and we'll talk about what that means in just a moment, 2328 01:58:07,600 --> 01:58:09,320 is represented by an arrow. 2329 01:58:09,320 --> 01:58:12,520 And biconditional, again, we'll talk about what that means in a moment, 2330 01:58:12,520 --> 01:58:14,560 is represented by these double arrows. 2331 01:58:14,560 --> 01:58:17,200 But these five logical connectives are the main ones 2332 01:58:17,200 --> 01:58:20,280 we're going to be focusing on in terms of thinking about how 2333 01:58:20,280 --> 01:58:22,920 it is that a computer can reason about facts 2334 01:58:22,920 --> 01:58:26,560 and draw conclusions based on the facts that it knows. 2335 01:58:26,560 --> 01:58:28,200 But in order to get there, we need to take 2336 01:58:28,200 --> 01:58:30,360 a look at each of these logical connectives 2337 01:58:30,360 --> 01:58:34,040 and build up an understanding for what it is that they actually mean. 2338 01:58:34,040 --> 01:58:38,200 So let's go ahead and begin with the not symbol, so this not symbol here. 2339 01:58:38,200 --> 01:58:41,160 And what we're going to show for each of these logical connectives 2340 01:58:41,160 --> 01:58:43,880 is what we're going to call a truth table, a table that 2341 01:58:43,880 --> 01:58:47,640 demonstrates what this word not means when we attach it 2342 01:58:47,640 --> 01:58:52,560 to a propositional symbol or any sentence inside of our logical language. 2343 01:58:52,560 --> 01:58:56,880 And so the truth table for not is shown right here. 2344 01:58:56,880 --> 01:59:01,560 If P, some propositional symbol, or some other sentence even, is false, 2345 01:59:01,560 --> 01:59:04,600 then not P is true. 2346 01:59:04,600 --> 01:59:08,960 And if P is true, then not P is false. 2347 01:59:08,960 --> 01:59:11,200 So you can imagine that placing this not symbol 2348 01:59:11,200 --> 01:59:14,080 in front of some sentence of propositional logic 2349 01:59:14,080 --> 01:59:16,200 just says the opposite of that. 2350 01:59:16,200 --> 01:59:19,840 So if, for example, P represented it is raining, 2351 01:59:19,840 --> 01:59:23,880 then not P would represent the idea that it is not raining. 2352 01:59:23,880 --> 01:59:27,560 And as you might expect, if P is false, meaning if the sentence, 2353 01:59:27,560 --> 01:59:32,920 it is raining, is false, well then the sentence not P must be true. 2354 01:59:32,920 --> 01:59:36,240 The sentence that it is not raining is therefore true. 2355 01:59:36,240 --> 01:59:40,000 So not, you can imagine, just takes whatever is in P and it inverts it. 2356 01:59:40,000 --> 01:59:43,440 It turns false into true and true into false, 2357 01:59:43,440 --> 01:59:46,520 much analogously to what the English word not means, 2358 01:59:46,520 --> 01:59:51,200 just taking whatever comes after it and inverting it to mean the opposite. 2359 01:59:51,200 --> 01:59:53,760 Next up, and also very English-like, is this idea 2360 01:59:53,760 --> 01:59:58,160 of and represented by this upside-down V shape or this point shape. 2361 01:59:58,160 --> 02:00:01,440 And as opposed to just taking a single argument the way not does, 2362 02:00:01,440 --> 02:00:07,040 we have P and we have not P. And is going to combine two different sentences 2363 02:00:07,040 --> 02:00:09,120 in propositional logic together. 2364 02:00:09,120 --> 02:00:12,480 So I might have one sentence P and another sentence Q, 2365 02:00:12,480 --> 02:00:16,800 and I want to combine them together to say P and Q. 2366 02:00:16,800 --> 02:00:19,760 And the general logic for what P and Q means 2367 02:00:19,760 --> 02:00:22,520 is it means that both of its operands are true. 2368 02:00:22,520 --> 02:00:26,600 P is true and also Q is true. 2369 02:00:26,600 --> 02:00:29,160 And so here's what that truth table looks like. 2370 02:00:29,160 --> 02:00:33,800 This time we have two variables, P and Q. And when we have two variables, each 2371 02:00:33,800 --> 02:00:36,920 of which can be in two possible states, true or false, 2372 02:00:36,920 --> 02:00:41,320 that leads to two squared or four possible combinations 2373 02:00:41,320 --> 02:00:42,640 of truth and falsehood. 2374 02:00:42,640 --> 02:00:45,000 So we have P is false and Q is false. 2375 02:00:45,000 --> 02:00:47,040 We have P is false and Q is true. 2376 02:00:47,040 --> 02:00:48,680 P is true and Q is false. 2377 02:00:48,680 --> 02:00:51,080 And then P and Q both are true. 2378 02:00:51,080 --> 02:00:55,520 And those are the only four possibilities for what P and Q could mean. 2379 02:00:55,520 --> 02:00:59,400 And in each of those situations, this third column here, P and Q, 2380 02:00:59,400 --> 02:01:03,760 is telling us a little bit about what it actually means for P and Q to be true. 2381 02:01:03,760 --> 02:01:08,040 And we see that the only case where P and Q is true is in this fourth row 2382 02:01:08,040 --> 02:01:12,840 here, where P happens to be true, Q also happens to be true. 2383 02:01:12,840 --> 02:01:18,080 And in all other situations, P and Q is going to evaluate to false. 2384 02:01:18,080 --> 02:01:21,600 So this, again, is much in line with what our intuition of and might mean. 2385 02:01:21,600 --> 02:01:29,320 If I say P and Q, I probably mean that I expect both P and Q to be true. 2386 02:01:29,320 --> 02:01:32,320 Next up, also potentially consistent with what we mean, 2387 02:01:32,320 --> 02:01:37,720 is this word or, represented by this V shape, sort of an upside down and symbol. 2388 02:01:37,720 --> 02:01:41,560 And or, as the name might suggest, is true if either of its arguments 2389 02:01:41,560 --> 02:01:47,440 are true, as long as P is true or Q is true, then P or Q is going to be true. 2390 02:01:47,440 --> 02:01:50,960 Which means the only time that P or Q is false 2391 02:01:50,960 --> 02:01:53,440 is if both of its operands are false. 2392 02:01:53,440 --> 02:01:58,760 If P is false and Q is false, then P or Q is going to be false. 2393 02:01:58,760 --> 02:02:03,160 But in all other cases, at least one of the operands is true. 2394 02:02:03,160 --> 02:02:08,600 Maybe they're both true, in which case P or Q is going to evaluate to true. 2395 02:02:08,600 --> 02:02:10,880 Now, this is mostly consistent with the way 2396 02:02:10,880 --> 02:02:14,200 that most people might use the word or, in the sense of speaking the word 2397 02:02:14,200 --> 02:02:17,080 or in normal English, though there is sometimes when we might say 2398 02:02:17,080 --> 02:02:21,440 or, where we mean P or Q, but not both, where we mean, sort of, 2399 02:02:21,440 --> 02:02:23,480 it can only be one or the other. 2400 02:02:23,480 --> 02:02:26,560 It's important to note that this symbol here, this or, 2401 02:02:26,560 --> 02:02:30,360 means P or Q or both, that those are totally OK. 2402 02:02:30,360 --> 02:02:33,120 As long as either or both of them are true, 2403 02:02:33,120 --> 02:02:36,320 then the or is going to evaluate to be true, as well. 2404 02:02:36,320 --> 02:02:38,760 It's only in the case where all of the operands 2405 02:02:38,760 --> 02:02:43,320 are false that P or Q ultimately evaluates to false, as well. 2406 02:02:43,320 --> 02:02:46,760 In logic, there's another symbol known as the exclusive or, 2407 02:02:46,760 --> 02:02:51,160 which encodes this idea of exclusivity of one or the other, but not both. 2408 02:02:51,160 --> 02:02:53,080 But we're not going to be focusing on that today. 2409 02:02:53,080 --> 02:02:56,720 Whenever we talk about or, we're always talking about either or both, 2410 02:02:56,720 --> 02:03:01,520 in this case, as represented by this truth table here. 2411 02:03:01,520 --> 02:03:04,840 So that now is not an and an or. 2412 02:03:04,840 --> 02:03:07,280 And next up is what we might call implication, 2413 02:03:07,280 --> 02:03:09,280 as denoted by this arrow symbol. 2414 02:03:09,280 --> 02:03:13,200 So we have P and Q. And this sentence here will generally 2415 02:03:13,200 --> 02:03:16,240 read as P implies Q. 2416 02:03:16,240 --> 02:03:23,400 And what P implies Q means is that if P is true, then Q is also true. 2417 02:03:23,400 --> 02:03:27,760 So I might say something like, if it is raining, then I will be indoors. 2418 02:03:27,760 --> 02:03:31,840 Meaning, it is raining implies I will be indoors, 2419 02:03:31,840 --> 02:03:34,640 as the logical sentence that I'm saying there. 2420 02:03:34,640 --> 02:03:37,760 And the truth table for this can sometimes be a little bit tricky. 2421 02:03:37,760 --> 02:03:44,280 So obviously, if P is true and Q is true, then P implies Q. That's true. 2422 02:03:44,280 --> 02:03:46,120 That definitely makes sense. 2423 02:03:46,120 --> 02:03:50,640 And it should also stand to reason that when P is true and Q is false, 2424 02:03:50,640 --> 02:03:52,600 then P implies Q is false. 2425 02:03:52,600 --> 02:03:57,400 Because if I said to you, if it is raining, then I will be out indoors. 2426 02:03:57,400 --> 02:04:01,000 And it is raining, but I'm not indoors? 2427 02:04:01,000 --> 02:04:04,680 Well, then it would seem to be that my original statement was not true. 2428 02:04:04,680 --> 02:04:09,360 P implies Q means that if P is true, then Q also needs to be true. 2429 02:04:09,360 --> 02:04:13,200 And if it's not, well, then the statement is false. 2430 02:04:13,200 --> 02:04:17,560 What's also worth noting, though, is what happens when P is false. 2431 02:04:17,560 --> 02:04:22,280 When P is false, the implication makes no claim at all. 2432 02:04:22,280 --> 02:04:26,680 If I say something like, if it is raining, then I will be indoors. 2433 02:04:26,680 --> 02:04:28,640 And it turns out it's not raining. 2434 02:04:28,640 --> 02:04:31,040 Then in that case, I am not making any statement 2435 02:04:31,040 --> 02:04:33,880 as to whether or not I will be indoors or not. 2436 02:04:33,880 --> 02:04:37,720 P implies Q just means that if P is true, Q must be true. 2437 02:04:37,720 --> 02:04:42,040 But if P is not true, then we make no claim about whether or not Q 2438 02:04:42,040 --> 02:04:43,040 is true at all. 2439 02:04:43,040 --> 02:04:46,840 So in either case, if P is false, it doesn't matter what Q is. 2440 02:04:46,840 --> 02:04:50,560 Whether it's false or true, we're not making any claim about Q whatsoever. 2441 02:04:50,560 --> 02:04:53,640 We can still evaluate the implication to true. 2442 02:04:53,640 --> 02:04:56,600 The only way that the implication is ever false 2443 02:04:56,600 --> 02:05:01,680 is if our premise, P, is true, but the conclusion that we're drawing Q 2444 02:05:01,680 --> 02:05:03,040 happens to be false. 2445 02:05:03,040 --> 02:05:09,400 So in that case, we would say P does not imply Q in that case. 2446 02:05:09,400 --> 02:05:13,200 Finally, the last connective that we'll discuss is this bi-conditional. 2447 02:05:13,200 --> 02:05:15,400 You can think of a bi-conditional as a condition 2448 02:05:15,400 --> 02:05:17,480 that goes in both directions. 2449 02:05:17,480 --> 02:05:20,440 So originally, when I said something like, if it is raining, 2450 02:05:20,440 --> 02:05:22,520 then I will be indoors. 2451 02:05:22,520 --> 02:05:24,920 I didn't say what would happen if it wasn't raining. 2452 02:05:24,920 --> 02:05:27,360 Maybe I'll be indoors, maybe I'll be outdoors. 2453 02:05:27,360 --> 02:05:31,440 This bi-conditional, you can read as an if and only if. 2454 02:05:31,440 --> 02:05:36,960 So I can say, I will be indoors if and only if it is raining, 2455 02:05:36,960 --> 02:05:39,560 meaning if it is raining, then I will be indoors. 2456 02:05:39,560 --> 02:05:43,560 And if I am indoors, it's reasonable to conclude that it is also raining. 2457 02:05:43,560 --> 02:05:48,640 So this bi-conditional is only true when P and Q are the same. 2458 02:05:48,640 --> 02:05:53,440 So if P is true and Q is true, then this bi-conditional is also true. 2459 02:05:53,440 --> 02:05:56,000 P implies Q, but also the reverse is true. 2460 02:05:56,000 --> 02:06:01,160 Q also implies P. So if P and Q both happen to be false, 2461 02:06:01,160 --> 02:06:02,440 we would still say it's true. 2462 02:06:02,440 --> 02:06:04,640 But in any of these other two situations, 2463 02:06:04,640 --> 02:06:08,920 this P if and only if Q is going to ultimately evaluate to false. 2464 02:06:08,920 --> 02:06:11,200 So a lot of trues and falses going on there, 2465 02:06:11,200 --> 02:06:13,840 but these five basic logical connectives 2466 02:06:13,840 --> 02:06:16,960 are going to form the core of the language of propositional logic, 2467 02:06:16,960 --> 02:06:20,040 the language that we're going to use in order to describe ideas, 2468 02:06:20,040 --> 02:06:21,960 and the language that we're going to use in order 2469 02:06:21,960 --> 02:06:26,520 to reason about those ideas in order to draw conclusions. 2470 02:06:26,520 --> 02:06:29,000 So let's now take a look at some of the additional terms 2471 02:06:29,000 --> 02:06:31,280 that we'll need to know about in order to go about trying 2472 02:06:31,280 --> 02:06:33,740 to form this language of propositional logic 2473 02:06:33,740 --> 02:06:37,600 and writing AI that's actually able to understand this sort of logic. 2474 02:06:37,600 --> 02:06:40,200 The next thing we're going to need is the notion of what 2475 02:06:40,200 --> 02:06:42,480 is actually true about the world. 2476 02:06:42,480 --> 02:06:46,880 We have a whole bunch of propositional symbols, P and Q and R and maybe others, 2477 02:06:46,880 --> 02:06:50,120 but we need some way of knowing what actually is true in the world. 2478 02:06:50,120 --> 02:06:51,200 Is P true or false? 2479 02:06:51,200 --> 02:06:52,580 Is Q true or false? 2480 02:06:52,580 --> 02:06:54,360 So on and so forth. 2481 02:06:54,360 --> 02:06:57,440 And to do that, we'll introduce the notion of a model. 2482 02:06:57,440 --> 02:07:02,320 A model just assigns a truth value, where a truth value is either true 2483 02:07:02,320 --> 02:07:05,680 or false, to every propositional symbol. 2484 02:07:05,680 --> 02:07:09,400 In other words, it's creating what we might call a possible world. 2485 02:07:09,400 --> 02:07:10,840 So let me give an example. 2486 02:07:10,840 --> 02:07:15,320 If, for example, I have two propositional symbols, P is it is raining 2487 02:07:15,320 --> 02:07:21,000 and Q is it is a Tuesday, a model just takes each of these two symbols 2488 02:07:21,000 --> 02:07:24,720 and assigns a truth value to them, either true or false. 2489 02:07:24,720 --> 02:07:26,040 So here's a sample model. 2490 02:07:26,040 --> 02:07:29,400 In this model, in other words, in this possible world, 2491 02:07:29,400 --> 02:07:33,920 it is possible that P is true, meaning it is raining, and Q is false, 2492 02:07:33,920 --> 02:07:36,000 meaning it is not a Tuesday. 2493 02:07:36,000 --> 02:07:39,240 But there are other possible worlds or other models as well. 2494 02:07:39,240 --> 02:07:41,920 There is some model where both of these variables are true, 2495 02:07:41,920 --> 02:07:44,320 some model where both of these variables are false. 2496 02:07:44,320 --> 02:07:48,320 In fact, if there are n variables that are propositional symbols like this 2497 02:07:48,320 --> 02:07:51,720 that are either true or false, then the number of possible models 2498 02:07:51,720 --> 02:07:55,600 is 2 to the n, because each of these possible models, 2499 02:07:55,600 --> 02:08:00,080 possible variables within my model, could be set to either true or false 2500 02:08:00,080 --> 02:08:03,840 if I don't know any information about it. 2501 02:08:03,840 --> 02:08:07,040 So now that I have the symbols and the connectives 2502 02:08:07,040 --> 02:08:11,080 that I'm going to need in order to construct these parts of knowledge, 2503 02:08:11,080 --> 02:08:13,400 we need some way to represent that knowledge. 2504 02:08:13,400 --> 02:08:15,880 And to do so, we're going to allow our AI access 2505 02:08:15,880 --> 02:08:18,400 to what we'll call a knowledge base. 2506 02:08:18,400 --> 02:08:21,880 And a knowledge base is really just a set of sentences 2507 02:08:21,880 --> 02:08:24,200 that our AI knows to be true. 2508 02:08:24,200 --> 02:08:27,160 Some set of sentences in propositional logic 2509 02:08:27,160 --> 02:08:30,960 that are things that our AI knows about the world. 2510 02:08:30,960 --> 02:08:35,360 And so we might tell our AI some information, information about a situation 2511 02:08:35,360 --> 02:08:38,200 that it finds itself in, or a situation about a problem 2512 02:08:38,200 --> 02:08:39,880 that it happens to be trying to solve. 2513 02:08:39,880 --> 02:08:41,720 And we would give that information to the AI 2514 02:08:41,720 --> 02:08:44,920 that the AI would store inside of its knowledge base. 2515 02:08:44,920 --> 02:08:47,440 And what happens next is the AI would like 2516 02:08:47,440 --> 02:08:49,880 to use that information in the knowledge base 2517 02:08:49,880 --> 02:08:53,440 to be able to draw conclusions about the rest of the world. 2518 02:08:53,440 --> 02:08:55,200 And what do those conclusions look like? 2519 02:08:55,200 --> 02:08:56,960 Well, to understand those conclusions, we'll 2520 02:08:56,960 --> 02:08:59,840 need to introduce one more idea, one more symbol. 2521 02:08:59,840 --> 02:09:02,600 And that is the notion of entailment. 2522 02:09:02,600 --> 02:09:06,500 So this sentence here, with this double turnstile in these Greek letters, 2523 02:09:06,500 --> 02:09:08,960 this is the Greek letter alpha and the Greek letter beta. 2524 02:09:08,960 --> 02:09:12,920 And we read this as alpha entails beta. 2525 02:09:12,920 --> 02:09:17,320 And alpha and beta here are just sentences in propositional logic. 2526 02:09:17,320 --> 02:09:20,680 And what this means is that alpha entails beta 2527 02:09:20,680 --> 02:09:23,360 means that in every model, in other words, 2528 02:09:23,360 --> 02:09:28,960 in every possible world in which sentence alpha is true, 2529 02:09:28,960 --> 02:09:31,600 then sentence beta is also true. 2530 02:09:31,600 --> 02:09:35,520 So if something entails something else, if alpha entails beta, 2531 02:09:35,520 --> 02:09:40,520 it means that if I know alpha to be true, then beta must therefore also 2532 02:09:40,520 --> 02:09:41,320 be true. 2533 02:09:41,320 --> 02:09:47,840 So if my alpha is something like I know that it is a Tuesday in January, 2534 02:09:47,840 --> 02:09:52,600 then a reasonable beta might be something like I know that it is January. 2535 02:09:52,600 --> 02:09:55,520 Because in all worlds where it is a Tuesday in January, 2536 02:09:55,520 --> 02:09:59,200 I know for sure that it must be January, just by definition. 2537 02:09:59,200 --> 02:10:01,940 This first statement or sentence about the world 2538 02:10:01,940 --> 02:10:03,840 entails the second statement. 2539 02:10:03,840 --> 02:10:07,440 And we can reasonably use deduction based on that first sentence 2540 02:10:07,440 --> 02:10:12,340 to figure out that the second sentence is, in fact, true as well. 2541 02:10:12,340 --> 02:10:14,840 And ultimately, it's this idea of entailment 2542 02:10:14,840 --> 02:10:17,240 that we're going to try and encode into our computer. 2543 02:10:17,240 --> 02:10:20,200 We want our AI agent to be able to figure out 2544 02:10:20,200 --> 02:10:22,040 what the possible entailments are. 2545 02:10:22,040 --> 02:10:26,080 We want our AI to be able to take these three sentences, sentences like, 2546 02:10:26,080 --> 02:10:28,480 if it didn't rain, Harry visited Hagrid. 2547 02:10:28,480 --> 02:10:31,440 That Harry visited Hagrid or Dumbledore, but not both. 2548 02:10:31,440 --> 02:10:33,160 And that Harry visited Dumbledore. 2549 02:10:33,160 --> 02:10:36,040 And just using that information, we'd like our AI 2550 02:10:36,040 --> 02:10:41,040 to be able to infer or figure out that using these three sentences inside 2551 02:10:41,040 --> 02:10:44,080 of a knowledge base, we can draw some conclusions. 2552 02:10:44,080 --> 02:10:47,520 In particular, we can draw the conclusions here that, one, 2553 02:10:47,520 --> 02:10:49,520 Harry did not visit Hagrid today. 2554 02:10:49,520 --> 02:10:53,840 And we can draw the entailment, too, that it did, in fact, rain today. 2555 02:10:53,840 --> 02:10:56,120 And this process is known as inference. 2556 02:10:56,120 --> 02:10:58,320 And that's what we're going to be focusing on today, 2557 02:10:58,320 --> 02:11:01,920 this process of deriving new sentences from old ones, 2558 02:11:01,920 --> 02:11:04,240 that I give you these three sentences, you put them 2559 02:11:04,240 --> 02:11:06,480 in the knowledge base in, say, the AI. 2560 02:11:06,480 --> 02:11:09,680 And the AI is able to use some sort of inference algorithm 2561 02:11:09,680 --> 02:11:14,040 to figure out that these two sentences must also be true. 2562 02:11:14,040 --> 02:11:16,600 And that is how we define inference. 2563 02:11:16,600 --> 02:11:18,760 So let's take a look at an inference example 2564 02:11:18,760 --> 02:11:22,520 to see how we might actually go about inferring things in a human sense 2565 02:11:22,520 --> 02:11:24,360 before we take a more algorithmic approach 2566 02:11:24,360 --> 02:11:27,360 to see how we could encode this idea of inference in AI. 2567 02:11:27,360 --> 02:11:30,920 And we'll see there are a number of ways that we can actually achieve this. 2568 02:11:30,920 --> 02:11:33,920 So again, we'll deal with a couple of propositional symbols. 2569 02:11:33,920 --> 02:11:37,640 We'll deal with P, Q, and R. P is it is a Tuesday. 2570 02:11:37,640 --> 02:11:39,200 Q is it is raining. 2571 02:11:39,200 --> 02:11:42,720 And R is Harry will go for a run, three propositional symbols 2572 02:11:42,720 --> 02:11:44,600 that we are just defining to mean this. 2573 02:11:44,600 --> 02:11:47,400 We're not saying anything yet about whether they're true or false. 2574 02:11:47,400 --> 02:11:50,240 We're just defining what they are. 2575 02:11:50,240 --> 02:11:53,960 Now, we'll give ourselves or an AI access to a knowledge base, 2576 02:11:53,960 --> 02:11:57,600 abbreviated to KB, the knowledge that we know about the world. 2577 02:11:57,600 --> 02:11:59,440 We know this statement. 2578 02:11:59,440 --> 02:11:59,920 All right. 2579 02:11:59,920 --> 02:12:00,880 So let's try to parse it. 2580 02:12:00,880 --> 02:12:02,840 The parentheses here are just used for precedent, 2581 02:12:02,840 --> 02:12:05,280 so we can see what associates with what. 2582 02:12:05,280 --> 02:12:11,720 But you would read this as P and not Q implies R. 2583 02:12:11,720 --> 02:12:12,240 All right. 2584 02:12:12,240 --> 02:12:13,040 So what does that mean? 2585 02:12:13,040 --> 02:12:14,520 Let's put it piece by piece. 2586 02:12:14,520 --> 02:12:16,880 P is it is a Tuesday. 2587 02:12:16,880 --> 02:12:21,600 Q is it is raining, so not Q is it is not raining, 2588 02:12:21,600 --> 02:12:25,080 and implies R is Harry will go for a run. 2589 02:12:25,080 --> 02:12:28,080 So the way to read this entire sentence in human natural language 2590 02:12:28,080 --> 02:12:33,240 at least is if it is a Tuesday and it is not raining, 2591 02:12:33,240 --> 02:12:35,600 then Harry will go for a run. 2592 02:12:35,600 --> 02:12:37,800 So if it is a Tuesday and it is not raining, 2593 02:12:37,800 --> 02:12:39,520 then Harry will go for a run. 2594 02:12:39,520 --> 02:12:41,600 And that is now inside of our knowledge base. 2595 02:12:41,600 --> 02:12:43,600 And let's now imagine that our knowledge base has 2596 02:12:43,600 --> 02:12:45,520 two other pieces of information as well. 2597 02:12:45,520 --> 02:12:49,720 It has information that P is true, that it is a Tuesday. 2598 02:12:49,720 --> 02:12:53,880 And we also have the information not Q, that it is not raining, 2599 02:12:53,880 --> 02:12:57,120 that this sentence Q, it is raining, happens to be false. 2600 02:12:57,120 --> 02:12:59,800 And those are the three sentences that we have access to. 2601 02:12:59,800 --> 02:13:05,520 P and not Q implies R, P and not Q. Using that information, 2602 02:13:05,520 --> 02:13:08,000 we should be able to draw some inferences. 2603 02:13:08,000 --> 02:13:14,120 P and not Q is only true if both P and not Q are true. 2604 02:13:14,120 --> 02:13:18,120 All right, we know that P is true and we know that not Q is true. 2605 02:13:18,120 --> 02:13:20,600 So we know that this whole expression is true. 2606 02:13:20,600 --> 02:13:24,000 And the definition of implication is if this whole thing on the left 2607 02:13:24,000 --> 02:13:27,080 is true, then this thing on the right must also be true. 2608 02:13:27,080 --> 02:13:31,480 So if we know that P and not Q is true, then R must be true as well. 2609 02:13:31,480 --> 02:13:34,160 So the inference we should be able to draw from all of this 2610 02:13:34,160 --> 02:13:38,200 is that R is true and we know that Harry will go for a run 2611 02:13:38,200 --> 02:13:40,560 by taking this knowledge inside of our knowledge base 2612 02:13:40,560 --> 02:13:43,760 and being able to reason based on that idea. 2613 02:13:43,760 --> 02:13:46,480 And so this ultimately is the beginning of what 2614 02:13:46,480 --> 02:13:49,480 we might consider to be some sort of inference algorithm, 2615 02:13:49,480 --> 02:13:52,840 some process that we can use to try and figure out 2616 02:13:52,840 --> 02:13:55,040 whether or not we can draw some conclusion. 2617 02:13:55,040 --> 02:13:58,040 And ultimately, what these inference algorithms are going to answer 2618 02:13:58,040 --> 02:14:00,880 is the central question about entailment. 2619 02:14:00,880 --> 02:14:02,720 Given some query about the world, something 2620 02:14:02,720 --> 02:14:06,120 we're wondering about the world, and we'll call that query alpha, 2621 02:14:06,120 --> 02:14:09,120 the question we want to ask using these inference algorithms 2622 02:14:09,120 --> 02:14:14,680 is does KB, our knowledge base, entail alpha? 2623 02:14:14,680 --> 02:14:16,640 In other words, using only the information 2624 02:14:16,640 --> 02:14:20,200 we know inside of our knowledge base, the knowledge that we have access to, 2625 02:14:20,200 --> 02:14:24,200 can we conclude that this sentence alpha is true? 2626 02:14:24,200 --> 02:14:26,840 And that's ultimately what we would like to do. 2627 02:14:26,840 --> 02:14:28,040 So how can we do that? 2628 02:14:28,040 --> 02:14:30,200 How can we go about writing an algorithm that 2629 02:14:30,200 --> 02:14:33,920 can look at this knowledge base and figure out whether or not this query 2630 02:14:33,920 --> 02:14:35,720 alpha is actually true? 2631 02:14:35,720 --> 02:14:39,000 Well, it turns out there are a couple of different algorithms for doing so. 2632 02:14:39,000 --> 02:14:43,120 And one of the simplest, perhaps, is known as model checking. 2633 02:14:43,120 --> 02:14:45,640 Now, remember that a model is just some assignment 2634 02:14:45,640 --> 02:14:49,120 of all of the propositional symbols inside of our language to a truth 2635 02:14:49,120 --> 02:14:51,080 value, true or false. 2636 02:14:51,080 --> 02:14:53,560 And you can think of a model as a possible world, 2637 02:14:53,560 --> 02:14:55,980 that there are many possible worlds where different things might 2638 02:14:55,980 --> 02:14:59,080 be true or false, and we can enumerate all of them. 2639 02:14:59,080 --> 02:15:02,480 And the model checking algorithm does exactly that. 2640 02:15:02,480 --> 02:15:04,600 So what does our model checking algorithm do? 2641 02:15:04,600 --> 02:15:08,280 Well, if we wanted to determine if our knowledge base entails 2642 02:15:08,280 --> 02:15:13,000 some query alpha, then we are going to enumerate all possible models. 2643 02:15:13,000 --> 02:15:16,760 In other words, consider all possible values of true and false 2644 02:15:16,760 --> 02:15:21,240 for our variables, all possible states in which our world can be in. 2645 02:15:21,240 --> 02:15:25,760 And if in every model where our knowledge base is true, 2646 02:15:25,760 --> 02:15:30,480 alpha is also true, then we know that the knowledge base entails alpha. 2647 02:15:30,480 --> 02:15:32,320 So let's take a closer look at that sentence 2648 02:15:32,320 --> 02:15:34,120 and try and figure out what it actually means. 2649 02:15:34,120 --> 02:15:38,120 If we know that in every model, in other words, in every possible world, 2650 02:15:38,120 --> 02:15:41,520 no matter what assignment of true and false to variables you give, 2651 02:15:41,520 --> 02:15:44,440 if we know that whenever our knowledge is true, what 2652 02:15:44,440 --> 02:15:49,400 we know to be true is true, that this query alpha is also true, 2653 02:15:49,400 --> 02:15:52,960 well, then it stands to reason that as long as our knowledge base is true, 2654 02:15:52,960 --> 02:15:56,080 then alpha must also be true. 2655 02:15:56,080 --> 02:15:58,600 And so this is going to form the foundation of our model checking 2656 02:15:58,600 --> 02:15:59,280 algorithm. 2657 02:15:59,280 --> 02:16:01,720 We're going to enumerate all of the possible worlds 2658 02:16:01,720 --> 02:16:05,720 and ask ourselves whenever the knowledge base is true, is alpha true? 2659 02:16:05,720 --> 02:16:09,320 And if that's the case, then we know alpha to be true. 2660 02:16:09,320 --> 02:16:11,520 And otherwise, there is no entailment. 2661 02:16:11,520 --> 02:16:14,960 Our knowledge base does not entail alpha. 2662 02:16:14,960 --> 02:16:15,440 All right. 2663 02:16:15,440 --> 02:16:17,300 So this is a little bit abstract, but let's 2664 02:16:17,300 --> 02:16:20,960 take a look at an example to try and put real propositional symbols 2665 02:16:20,960 --> 02:16:22,160 to this idea. 2666 02:16:22,160 --> 02:16:24,560 So again, we'll work with the same example. 2667 02:16:24,560 --> 02:16:29,200 P is it is a Tuesday, Q is it is raining, R as Harry will go for a run. 2668 02:16:29,200 --> 02:16:32,000 Our knowledge base contains these pieces of information. 2669 02:16:32,000 --> 02:16:35,280 P and not Q implies R. We also know P. 2670 02:16:35,280 --> 02:16:38,840 It is a Tuesday and not Q. It is not raining. 2671 02:16:38,840 --> 02:16:43,520 And our query, our alpha in this case, the thing we want to ask is R. 2672 02:16:43,520 --> 02:16:45,520 We want to know, is it guaranteed? 2673 02:16:45,520 --> 02:16:49,120 Is it entailed that Harry will go for a run? 2674 02:16:49,120 --> 02:16:52,480 So the first step is to enumerate all of the possible models. 2675 02:16:52,480 --> 02:16:55,800 We have three propositional symbols here, P, Q, and R, 2676 02:16:55,800 --> 02:16:59,800 which means we have 2 to the third power, or eight possible models. 2677 02:16:59,800 --> 02:17:04,680 All false, false, false true, false true, false, false true, true, et cetera. 2678 02:17:04,680 --> 02:17:09,560 Eight possible ways you could assign true and false to all of these models. 2679 02:17:09,560 --> 02:17:13,920 And we might ask in each one of them, is the knowledge base true? 2680 02:17:13,920 --> 02:17:15,840 Here are the set of things that we know. 2681 02:17:15,840 --> 02:17:20,160 In which of these worlds could this knowledge base possibly apply to? 2682 02:17:20,160 --> 02:17:22,960 In which world is this knowledge base true? 2683 02:17:22,960 --> 02:17:26,240 Well, in the knowledge base, for example, we know P. 2684 02:17:26,240 --> 02:17:31,680 We know it is a Tuesday, which means we know that these four first four rows 2685 02:17:31,680 --> 02:17:35,080 where P is false, none of those are going to be true 2686 02:17:35,080 --> 02:17:37,680 or are going to work for this particular knowledge base. 2687 02:17:37,680 --> 02:17:40,840 Our knowledge base is not true in those worlds. 2688 02:17:40,840 --> 02:17:46,200 Likewise, we also know not Q. We know that it is not raining. 2689 02:17:46,200 --> 02:17:51,120 So any of these models where Q is true, like these two and these two here, 2690 02:17:51,120 --> 02:17:55,360 those aren't going to work either because we know that Q is not true. 2691 02:17:55,360 --> 02:18:00,240 And finally, we also know that P and not Q implies R, 2692 02:18:00,240 --> 02:18:04,680 which means that when P is true or P is true here and Q is false, 2693 02:18:04,680 --> 02:18:08,800 Q is false in these two, then R must be true. 2694 02:18:08,800 --> 02:18:14,520 And if ever P is true, Q is false, but R is also false, 2695 02:18:14,520 --> 02:18:17,760 well, that doesn't satisfy this implication here. 2696 02:18:17,760 --> 02:18:21,840 That implication does not hold true under those situations. 2697 02:18:21,840 --> 02:18:24,080 So we could say that for our knowledge base, 2698 02:18:24,080 --> 02:18:27,240 we can conclude under which of these possible worlds 2699 02:18:27,240 --> 02:18:30,440 is our knowledge base true and under which of the possible worlds 2700 02:18:30,440 --> 02:18:31,880 is our knowledge base false. 2701 02:18:31,880 --> 02:18:35,040 And it turns out there is only one possible world 2702 02:18:35,040 --> 02:18:37,160 where our knowledge base is actually true. 2703 02:18:37,160 --> 02:18:39,280 In some cases, there might be multiple possible worlds 2704 02:18:39,280 --> 02:18:40,640 where the knowledge base is true. 2705 02:18:40,640 --> 02:18:44,880 But in this case, it just so happens that there's only one, one possible world 2706 02:18:44,880 --> 02:18:48,400 where we can definitively say something about our knowledge base. 2707 02:18:48,400 --> 02:18:50,920 And in this case, we would look at the query. 2708 02:18:50,920 --> 02:18:56,120 The query of R is R true, R is true, and so as a result, 2709 02:18:56,120 --> 02:18:58,840 we can draw that conclusion. 2710 02:18:58,840 --> 02:19:01,120 And so this is this idea of model check-in. 2711 02:19:01,120 --> 02:19:04,600 Enumerate all the possible models and look in those possible models 2712 02:19:04,600 --> 02:19:08,000 to see whether or not, if our knowledge base is true, 2713 02:19:08,000 --> 02:19:11,600 is the query in question true as well. 2714 02:19:11,600 --> 02:19:14,800 So let's now take a look at how we might actually go about writing this 2715 02:19:14,800 --> 02:19:16,360 in a programming language like Python. 2716 02:19:16,360 --> 02:19:18,280 Take a look at some actual code that would 2717 02:19:18,280 --> 02:19:21,400 encode this notion of propositional symbols and logic 2718 02:19:21,400 --> 02:19:25,560 and these connectives like and and or and not and implication and so forth 2719 02:19:25,560 --> 02:19:28,160 and see what that code might actually look like. 2720 02:19:28,160 --> 02:19:30,960 So I've written in advance a logic library that's 2721 02:19:30,960 --> 02:19:33,480 more detailed than we need to worry about entirely today. 2722 02:19:33,480 --> 02:19:37,480 But the important thing is that we have one class for every type 2723 02:19:37,480 --> 02:19:40,600 of logical symbol or connective that we might have. 2724 02:19:40,600 --> 02:19:44,040 So we just have one class for logical symbols, for example, 2725 02:19:44,040 --> 02:19:46,720 where every symbol is going to represent and store 2726 02:19:46,720 --> 02:19:49,360 some name for that particular symbol. 2727 02:19:49,360 --> 02:19:52,920 And we also have a class for not that takes an operand. 2728 02:19:52,920 --> 02:19:56,320 So we might say not one symbol to say something is not true 2729 02:19:56,320 --> 02:19:58,200 or some other sentence is not true. 2730 02:19:58,200 --> 02:20:02,000 We have one for and, one for or, so on and so forth. 2731 02:20:02,000 --> 02:20:03,760 And I'll just demonstrate how this works. 2732 02:20:03,760 --> 02:20:07,480 And you can take a look at the actual logic.py later on. 2733 02:20:07,480 --> 02:20:11,200 But I'll go ahead and call this file harry.py. 2734 02:20:11,200 --> 02:20:15,080 We're going to store information about this world of Harry Potter, 2735 02:20:15,080 --> 02:20:16,320 for example. 2736 02:20:16,320 --> 02:20:19,140 So I'll go ahead and import from my logic module. 2737 02:20:19,140 --> 02:20:20,640 I'll import everything. 2738 02:20:20,640 --> 02:20:25,520 And in this library, in order to create a symbol, you use capital S symbol. 2739 02:20:25,520 --> 02:20:30,720 And I'll create a symbol for rain, to mean it is raining, for example. 2740 02:20:30,720 --> 02:20:35,240 And I'll create a symbol for Hagrid, to mean Harry visited Hagrid, 2741 02:20:35,240 --> 02:20:36,920 is what this symbol is going to mean. 2742 02:20:36,920 --> 02:20:38,960 So this symbol means it is raining. 2743 02:20:38,960 --> 02:20:41,960 This symbol means Harry visited Hagrid. 2744 02:20:41,960 --> 02:20:49,760 And I'll add another symbol called Dumbledore for Harry visited Dumbledore. 2745 02:20:49,760 --> 02:20:52,400 Now, I'd like to save these symbols so that I can use them later 2746 02:20:52,400 --> 02:20:54,000 as I do some logical analysis. 2747 02:20:54,000 --> 02:20:56,760 So I'll go ahead and save each one of them inside of a variable. 2748 02:20:56,760 --> 02:21:02,560 So like rain, Hagrid, and Dumbledore, so you could call the variables anything. 2749 02:21:02,560 --> 02:21:04,360 And now that I have these logical symbols, 2750 02:21:04,360 --> 02:21:07,880 I can use logical connectives to combine them together. 2751 02:21:07,880 --> 02:21:14,680 So for example, if I have a sentence like and rain and Hagrid, 2752 02:21:14,680 --> 02:21:18,560 for example, which is not necessarily true, but just for demonstration, 2753 02:21:18,560 --> 02:21:22,160 I can now try and print out sentence.formula, which 2754 02:21:22,160 --> 02:21:25,440 is a function I wrote that takes a sentence in propositional logic 2755 02:21:25,440 --> 02:21:27,560 and just prints it out so that we, the programmers, 2756 02:21:27,560 --> 02:21:32,000 can now see this in order to get an understanding for how it actually works. 2757 02:21:32,000 --> 02:21:36,280 So if I run python harry.py, what we'll see 2758 02:21:36,280 --> 02:21:40,040 is this sentence in propositional logic, rain and Hagrid. 2759 02:21:40,040 --> 02:21:44,360 This is the logical representation of what we have here in our Python program 2760 02:21:44,360 --> 02:21:48,040 of saying and whose arguments are rain and Hagrid. 2761 02:21:48,040 --> 02:21:51,800 So we're saying rain and Hagrid by encoding that idea. 2762 02:21:51,800 --> 02:21:54,680 And this is quite common in Python object-oriented programming, 2763 02:21:54,680 --> 02:21:56,680 where you have a number of different classes, 2764 02:21:56,680 --> 02:22:01,000 and you pass arguments into them in order to create a new and object, 2765 02:22:01,000 --> 02:22:03,800 for example, in order to represent this idea. 2766 02:22:03,800 --> 02:22:07,480 But now what I'd like to do is somehow encode the knowledge 2767 02:22:07,480 --> 02:22:09,600 that I have about the world in order to solve 2768 02:22:09,600 --> 02:22:11,560 that problem from the beginning of class, where 2769 02:22:11,560 --> 02:22:14,360 we talked about trying to figure out who Harry visited 2770 02:22:14,360 --> 02:22:17,240 and trying to figure out if it's raining or if it's not raining. 2771 02:22:17,240 --> 02:22:19,520 And so what knowledge do I have? 2772 02:22:19,520 --> 02:22:22,600 I'll go ahead and create a new variable called knowledge. 2773 02:22:22,600 --> 02:22:23,440 And what do I know? 2774 02:22:23,440 --> 02:22:25,920 Well, I know the very first sentence that we talked about 2775 02:22:25,920 --> 02:22:30,840 was the idea that if it is not raining, then Harry will visit Hagrid. 2776 02:22:30,840 --> 02:22:33,720 So all right, how do I encode the idea that it is not raining? 2777 02:22:33,720 --> 02:22:36,640 Well, I can use not and then the rain symbol. 2778 02:22:36,640 --> 02:22:39,240 So here's me saying that it is not raining. 2779 02:22:39,240 --> 02:22:42,400 And now the implication is that if it is not raining, 2780 02:22:42,400 --> 02:22:45,040 then Harry visited Hagrid. 2781 02:22:45,040 --> 02:22:48,520 So I'll wrap this inside of an implication to say, 2782 02:22:48,520 --> 02:22:52,240 if it is not raining, this first argument to the implication 2783 02:22:52,240 --> 02:22:56,640 will then Harry visited Hagrid. 2784 02:22:56,640 --> 02:23:00,400 So I'm saying implication, the premise is that it's not raining. 2785 02:23:00,400 --> 02:23:04,000 And if it is not raining, then Harry visited Hagrid. 2786 02:23:04,000 --> 02:23:07,640 And I can print out knowledge.formula to see the logical formula 2787 02:23:07,640 --> 02:23:09,600 equivalent of that same idea. 2788 02:23:09,600 --> 02:23:11,760 So I run Python of harry.py. 2789 02:23:11,760 --> 02:23:13,840 And this is the logical formula that we see 2790 02:23:13,840 --> 02:23:16,080 as a result, which is a text-based version of what 2791 02:23:16,080 --> 02:23:18,760 we were looking at before, that if it is not raining, 2792 02:23:18,760 --> 02:23:23,160 then that implies that Harry visited Hagrid. 2793 02:23:23,160 --> 02:23:26,560 But there was additional information that we had access to as well. 2794 02:23:26,560 --> 02:23:31,640 In this case, we had access to the fact that Harry visited either Hagrid 2795 02:23:31,640 --> 02:23:32,920 or Dumbledore. 2796 02:23:32,920 --> 02:23:34,520 So how do I encode that? 2797 02:23:34,520 --> 02:23:36,520 Well, this means that in my knowledge, I've really 2798 02:23:36,520 --> 02:23:38,400 got multiple pieces of knowledge going on. 2799 02:23:38,400 --> 02:23:41,520 I know one thing and another thing and another thing. 2800 02:23:41,520 --> 02:23:44,920 So I'll go ahead and wrap all of my knowledge inside of an and. 2801 02:23:44,920 --> 02:23:47,480 And I'll move things on to new lines just for good measure. 2802 02:23:47,480 --> 02:23:49,120 But I know multiple things. 2803 02:23:49,120 --> 02:23:52,600 So I'm saying knowledge is an and of multiple different sentences. 2804 02:23:52,600 --> 02:23:55,800 I know multiple different sentences to be true. 2805 02:23:55,800 --> 02:23:59,280 One such sentence that I know to be true is this implication, 2806 02:23:59,280 --> 02:24:02,600 that if it is not raining, then Harry visited Hagrid. 2807 02:24:02,600 --> 02:24:08,640 Another such sentence that I know to be true is or Hagrid Dumbledore. 2808 02:24:08,640 --> 02:24:12,400 In other words, Hagrid or Dumbledore is true, 2809 02:24:12,400 --> 02:24:16,440 because I know that Harry visited Hagrid or Dumbledore. 2810 02:24:16,440 --> 02:24:17,800 But I know more than that, actually. 2811 02:24:17,800 --> 02:24:22,000 That initial sentence from before said that Harry visited Hagrid or Dumbledore, 2812 02:24:22,000 --> 02:24:23,320 but not both. 2813 02:24:23,320 --> 02:24:26,560 So now I want a sentence that will encode the idea that Harry didn't 2814 02:24:26,560 --> 02:24:29,840 visit both Hagrid and Dumbledore. 2815 02:24:29,840 --> 02:24:33,120 Well, the notion of Harry visiting Hagrid and Dumbledore 2816 02:24:33,120 --> 02:24:38,400 would be represented like this, and of Hagrid and Dumbledore. 2817 02:24:38,400 --> 02:24:41,600 And if that is not true, if I want to say not that, 2818 02:24:41,600 --> 02:24:46,000 then I'll just wrap this whole thing inside of a not. 2819 02:24:46,000 --> 02:24:50,040 So now these three lines, line 8 says that if it is not raining, 2820 02:24:50,040 --> 02:24:51,720 then Harry visited Hagrid. 2821 02:24:51,720 --> 02:24:55,760 Line 9 says Harry visited Hagrid or Dumbledore. 2822 02:24:55,760 --> 02:25:01,000 And line 10 says Harry didn't visit both Hagrid and Dumbledore, 2823 02:25:01,000 --> 02:25:04,840 that it is not true that both the Hagrid symbol and the Dumbledore 2824 02:25:04,840 --> 02:25:05,920 symbol are true. 2825 02:25:05,920 --> 02:25:08,240 Only one of them can be true. 2826 02:25:08,240 --> 02:25:11,360 And finally, the last piece of information that I knew 2827 02:25:11,360 --> 02:25:15,280 was the fact that Harry visited Dumbledore. 2828 02:25:15,280 --> 02:25:18,800 So these now are the pieces of knowledge that I know, one sentence 2829 02:25:18,800 --> 02:25:21,400 and another sentence and another and another. 2830 02:25:21,400 --> 02:25:24,600 And I can print out what I know just to see it a little bit more visually. 2831 02:25:24,600 --> 02:25:28,640 And here now is a logical representation of the information 2832 02:25:28,640 --> 02:25:31,320 that my computer is now internally representing 2833 02:25:31,320 --> 02:25:33,720 using these various different Python objects. 2834 02:25:33,720 --> 02:25:37,120 And again, take a look at logic.py if you want to take a look at how exactly 2835 02:25:37,120 --> 02:25:40,240 it's implementing this, but no need to worry too much about all of the details 2836 02:25:40,240 --> 02:25:40,840 there. 2837 02:25:40,840 --> 02:25:44,800 We're here saying that if it is not raining, then Harry visited Hagrid. 2838 02:25:44,800 --> 02:25:47,880 We're saying that Hagrid or Dumbledore is true. 2839 02:25:47,880 --> 02:25:52,560 And we're saying it is not the case that Hagrid and Dumbledore is true, 2840 02:25:52,560 --> 02:25:54,120 that they're not both true. 2841 02:25:54,120 --> 02:25:57,280 And we also know that Dumbledore is true. 2842 02:25:57,280 --> 02:26:01,200 So this long logical sentence represents our knowledge base. 2843 02:26:01,200 --> 02:26:03,600 It is the thing that we know. 2844 02:26:03,600 --> 02:26:06,600 And now what we'd like to do is we'd like to use model checking 2845 02:26:06,600 --> 02:26:10,320 to ask a query, to ask a question like, based on this information, 2846 02:26:10,320 --> 02:26:12,160 do I know whether or not it's raining? 2847 02:26:12,160 --> 02:26:15,200 And we as humans were able to logic our way through it and figure out that, 2848 02:26:15,200 --> 02:26:18,040 all right, based on these sentences, we can conclude this and that 2849 02:26:18,040 --> 02:26:20,400 to figure out that, yes, it must have been raining. 2850 02:26:20,400 --> 02:26:23,640 But now we'd like for the computer to do that as well. 2851 02:26:23,640 --> 02:26:26,000 So let's take a look at the model checking algorithm 2852 02:26:26,000 --> 02:26:27,840 that is going to follow that same pattern 2853 02:26:27,840 --> 02:26:30,400 that we drew out in pseudocode a moment ago. 2854 02:26:30,400 --> 02:26:32,400 So I've defined a function here in logic.py 2855 02:26:32,400 --> 02:26:35,880 that you can take a look at called model check. 2856 02:26:35,880 --> 02:26:39,480 Model check takes two arguments, the knowledge that I already know, 2857 02:26:39,480 --> 02:26:41,000 and the query. 2858 02:26:41,000 --> 02:26:43,440 And the idea is, in order to do model checking, 2859 02:26:43,440 --> 02:26:46,160 I need to enumerate all of the possible models. 2860 02:26:46,160 --> 02:26:49,280 And for each of the possible models, I need to ask myself, 2861 02:26:49,280 --> 02:26:50,960 is the knowledge base true? 2862 02:26:50,960 --> 02:26:52,800 And is the query true? 2863 02:26:52,800 --> 02:26:54,560 So the first thing I need to do is somehow 2864 02:26:54,560 --> 02:26:57,040 enumerate all of the possible models, meaning 2865 02:26:57,040 --> 02:26:59,720 for all possible symbols that exist, I need 2866 02:26:59,720 --> 02:27:02,440 to assign true and false to each one of them 2867 02:27:02,440 --> 02:27:05,200 and see whether or not it's still true. 2868 02:27:05,200 --> 02:27:07,520 And so here is the way we're going to do that. 2869 02:27:07,520 --> 02:27:08,800 We're going to start. 2870 02:27:08,800 --> 02:27:10,840 So I've defined another helper function internally 2871 02:27:10,840 --> 02:27:12,400 that we'll get to in just a moment. 2872 02:27:12,400 --> 02:27:17,080 But this function starts by getting all of the symbols in both the knowledge 2873 02:27:17,080 --> 02:27:20,000 and the query, by figuring out what symbols am I dealing with. 2874 02:27:20,000 --> 02:27:24,080 In this case, the symbols I'm dealing with are rain and Hagrid and Dumbledore, 2875 02:27:24,080 --> 02:27:26,600 but there might be other symbols depending on the problem. 2876 02:27:26,600 --> 02:27:29,680 And we'll take a look soon at some examples of situations 2877 02:27:29,680 --> 02:27:32,880 where ultimately we're going to need some additional symbols in order 2878 02:27:32,880 --> 02:27:34,720 to represent the problem. 2879 02:27:34,720 --> 02:27:38,200 And then we're going to run this check all function, which 2880 02:27:38,200 --> 02:27:41,960 is a helper function that's basically going to recursively call itself 2881 02:27:41,960 --> 02:27:46,880 checking every possible configuration of propositional symbols. 2882 02:27:46,880 --> 02:27:51,080 So we start out by looking at this check all function. 2883 02:27:51,080 --> 02:27:52,320 And what do we do? 2884 02:27:52,320 --> 02:27:57,040 So if not symbols means if we finish assigning all of the symbols. 2885 02:27:57,040 --> 02:27:58,840 We've assigned every symbol a value. 2886 02:27:58,840 --> 02:28:03,280 So far we haven't done that, but if we ever do, then we check. 2887 02:28:03,280 --> 02:28:05,440 In this model, is the knowledge true? 2888 02:28:05,440 --> 02:28:06,720 That's what this line is saying. 2889 02:28:06,720 --> 02:28:10,000 If we evaluate the knowledge propositional logic formula 2890 02:28:10,000 --> 02:28:14,280 using the model's assignment of truth values, is the knowledge true? 2891 02:28:14,280 --> 02:28:19,480 If the knowledge is true, then we should return true only if the query is true. 2892 02:28:19,480 --> 02:28:22,080 Because if the knowledge is true, we want the query 2893 02:28:22,080 --> 02:28:25,400 to be true as well in order for there to be entailment. 2894 02:28:25,400 --> 02:28:29,040 Otherwise, we don't know that there otherwise there won't be an entailment 2895 02:28:29,040 --> 02:28:33,000 if there's ever a situation where what we know in our knowledge is true, 2896 02:28:33,000 --> 02:28:36,200 but the query, the thing we're asking, happens to be false. 2897 02:28:36,200 --> 02:28:38,120 So this line here is checking that same idea 2898 02:28:38,120 --> 02:28:44,000 that in all worlds where the knowledge is true, the query must also be true. 2899 02:28:44,000 --> 02:28:47,720 Otherwise, we can just return true because if the knowledge isn't true, 2900 02:28:47,720 --> 02:28:48,720 then we don't care. 2901 02:28:48,720 --> 02:28:50,520 This is equivalent to when we were enumerating 2902 02:28:50,520 --> 02:28:52,240 this table from a moment ago. 2903 02:28:52,240 --> 02:28:56,080 In all situations where the knowledge base wasn't true, all of these seven 2904 02:28:56,080 --> 02:29:00,160 rows here, we didn't care whether or not our query was true or not. 2905 02:29:00,160 --> 02:29:03,080 We only care to check whether the query is true 2906 02:29:03,080 --> 02:29:06,920 when the knowledge base is actually true, which was just this green highlighted 2907 02:29:06,920 --> 02:29:08,840 row right there. 2908 02:29:08,840 --> 02:29:12,560 So that logic is encoded using that statement there. 2909 02:29:12,560 --> 02:29:15,200 And otherwise, if we haven't assigned symbols yet, 2910 02:29:15,200 --> 02:29:18,200 which we haven't seen anything yet, then the first thing we do 2911 02:29:18,200 --> 02:29:20,640 is pop one of the symbols. 2912 02:29:20,640 --> 02:29:23,520 I make a copy of the symbols first just to save an existing copy. 2913 02:29:23,520 --> 02:29:26,200 But I pop one symbol off of the remaining symbols 2914 02:29:26,200 --> 02:29:29,080 so that I just pick one symbol at random. 2915 02:29:29,080 --> 02:29:33,680 And I create one copy of the model where that symbol is true. 2916 02:29:33,680 --> 02:29:38,040 And I create a second copy of the model where that symbol is false. 2917 02:29:38,040 --> 02:29:41,480 So I now have two copies of the model, one where the symbol is true 2918 02:29:41,480 --> 02:29:43,200 and one where the symbol is false. 2919 02:29:43,200 --> 02:29:47,920 And I need to make sure that this entailment holds in both of those models. 2920 02:29:47,920 --> 02:29:52,200 So I recursively check all on the model where the statement is true 2921 02:29:52,200 --> 02:29:57,080 and check all on the model where the statement is false. 2922 02:29:57,080 --> 02:29:59,120 So again, you can take a look at that function 2923 02:29:59,120 --> 02:30:02,120 to try to get a sense for how exactly this logic is working. 2924 02:30:02,120 --> 02:30:03,960 But in effect, what it's doing is recursively 2925 02:30:03,960 --> 02:30:07,000 calling this check all function again and again and again. 2926 02:30:07,000 --> 02:30:09,160 And on every level of the recursion, we're 2927 02:30:09,160 --> 02:30:13,280 saying let's pick a new symbol that we haven't yet assigned, 2928 02:30:13,280 --> 02:30:16,000 assign it to true and assign it to false, 2929 02:30:16,000 --> 02:30:19,360 and then check to make sure that the entailment holds in both cases. 2930 02:30:19,360 --> 02:30:22,160 Because ultimately, I need to check every possible world. 2931 02:30:22,160 --> 02:30:24,360 I need to take every combination of symbols 2932 02:30:24,360 --> 02:30:27,520 and try every combination of true and false 2933 02:30:27,520 --> 02:30:31,960 in order to figure out whether the entailment relation actually holds. 2934 02:30:31,960 --> 02:30:34,320 So that function we've written for you. 2935 02:30:34,320 --> 02:30:37,720 But in order to use that function inside of harry.py, 2936 02:30:37,720 --> 02:30:39,520 what I'll write is something like this. 2937 02:30:39,520 --> 02:30:43,300 I would like to model check based on the knowledge. 2938 02:30:43,300 --> 02:30:46,240 And then I provide as a second argument what the query is, 2939 02:30:46,240 --> 02:30:48,120 what the thing I want to ask is. 2940 02:30:48,120 --> 02:30:51,880 And what I want to ask in this case is, is it raining? 2941 02:30:51,880 --> 02:30:54,040 So model check again takes two arguments. 2942 02:30:54,040 --> 02:30:57,480 The first argument is the information that I know, this knowledge, 2943 02:30:57,480 --> 02:31:01,960 which in this case is this information that was given to me at the beginning. 2944 02:31:01,960 --> 02:31:06,800 And the second argument, rain, is encoding the idea of the query. 2945 02:31:06,800 --> 02:31:07,720 What am I asking? 2946 02:31:07,720 --> 02:31:10,120 I would like to ask, based on this knowledge, 2947 02:31:10,120 --> 02:31:13,360 do I know for sure that it is raining? 2948 02:31:13,360 --> 02:31:17,200 And I can try and print out the result of that. 2949 02:31:17,200 --> 02:31:20,680 And when I run this program, I see that the answer is true. 2950 02:31:20,680 --> 02:31:23,200 That based on this information, I can conclusively 2951 02:31:23,200 --> 02:31:26,800 say that it is raining, because using this model checking algorithm, 2952 02:31:26,800 --> 02:31:30,920 we were able to check that in every world where this knowledge is true, 2953 02:31:30,920 --> 02:31:31,720 it is raining. 2954 02:31:31,720 --> 02:31:35,240 In other words, there is no world where this knowledge is true, 2955 02:31:35,240 --> 02:31:36,680 and it is not raining. 2956 02:31:36,680 --> 02:31:41,200 So you can conclude that it is, in fact, raining. 2957 02:31:41,200 --> 02:31:43,640 And this sort of logic can be applied to a number 2958 02:31:43,640 --> 02:31:47,200 of different types of problems, that if confronted with a problem where 2959 02:31:47,200 --> 02:31:50,880 some sort of logical deduction can be used in order to try to solve it, 2960 02:31:50,880 --> 02:31:54,080 you might try thinking about what propositional symbols you might 2961 02:31:54,080 --> 02:31:56,440 need in order to represent that information, 2962 02:31:56,440 --> 02:31:58,880 and what statements and propositional logic 2963 02:31:58,880 --> 02:32:03,400 you might use in order to encode that information which you know. 2964 02:32:03,400 --> 02:32:05,720 And this process of trying to take a problem 2965 02:32:05,720 --> 02:32:08,420 and figure out what propositional symbols to use in order 2966 02:32:08,420 --> 02:32:11,520 to encode that idea, or how to represent it logically, 2967 02:32:11,520 --> 02:32:13,640 is known as knowledge engineering. 2968 02:32:13,640 --> 02:32:16,520 That software engineers and AI engineers will take a problem 2969 02:32:16,520 --> 02:32:19,000 and try and figure out how to distill it down 2970 02:32:19,000 --> 02:32:22,640 into knowledge that is representable by a computer. 2971 02:32:22,640 --> 02:32:25,240 And if we can take any general purpose problem, some problem 2972 02:32:25,240 --> 02:32:27,200 that we find in the human world, and turn it 2973 02:32:27,200 --> 02:32:30,160 into a problem that computers know how to solve 2974 02:32:30,160 --> 02:32:32,600 as by using any number of different variables, well, 2975 02:32:32,600 --> 02:32:35,320 then we can take a computer that is able to do something 2976 02:32:35,320 --> 02:32:37,960 like model checking or some other inference algorithm 2977 02:32:37,960 --> 02:32:41,960 and actually figure out how to solve that problem. 2978 02:32:41,960 --> 02:32:45,320 So now we'll take a look at two or three examples of knowledge engineering 2979 02:32:45,320 --> 02:32:47,960 and practice, of taking some problem and figuring out 2980 02:32:47,960 --> 02:32:51,760 how we can apply logical symbols and use logical formulas 2981 02:32:51,760 --> 02:32:53,960 to be able to encode that idea. 2982 02:32:53,960 --> 02:32:57,040 And we'll start with a very popular board game in the US and the UK 2983 02:32:57,040 --> 02:32:58,040 known as Clue. 2984 02:32:58,040 --> 02:33:00,880 Now, in the game of Clue, there's a number of different factors 2985 02:33:00,880 --> 02:33:01,720 that are going on. 2986 02:33:01,720 --> 02:33:04,360 But the basic premise of the game, if you've never played it before, 2987 02:33:04,360 --> 02:33:06,200 is that there are a number of different people. 2988 02:33:06,200 --> 02:33:09,120 For now, we'll just use three, Colonel Mustard, Professor Plumb, 2989 02:33:09,120 --> 02:33:10,160 and Miss Scarlet. 2990 02:33:10,160 --> 02:33:12,800 There are a number of different rooms, like a ballroom, a kitchen, 2991 02:33:12,800 --> 02:33:13,480 and a library. 2992 02:33:13,480 --> 02:33:17,400 And there are a number of different weapons, a knife, a revolver, and a wrench. 2993 02:33:17,400 --> 02:33:21,880 And three of these, one person, one room, and one weapon, 2994 02:33:21,880 --> 02:33:26,720 is the solution to the mystery, the murderer and what room they were in 2995 02:33:26,720 --> 02:33:28,520 and what weapon they happened to use. 2996 02:33:28,520 --> 02:33:30,640 And what happens at the beginning of the game 2997 02:33:30,640 --> 02:33:32,800 is that all these cards are randomly shuffled together. 2998 02:33:32,800 --> 02:33:35,360 And three of them, one person, one room, and one weapon, 2999 02:33:35,360 --> 02:33:37,800 are placed into a sealed envelope that we don't know. 3000 02:33:37,800 --> 02:33:41,360 And we would like to figure out, using some sort of logical process, 3001 02:33:41,360 --> 02:33:45,560 what's inside the envelope, which person, which room, and which weapon. 3002 02:33:45,560 --> 02:33:50,000 And we do so by looking at some, but not all, of these cards here, 3003 02:33:50,000 --> 02:33:54,920 by looking at these cards to try and figure out what might be going on. 3004 02:33:54,920 --> 02:33:56,480 And so this is a very popular game. 3005 02:33:56,480 --> 02:33:58,280 But let's now try and formalize it and see 3006 02:33:58,280 --> 02:34:01,960 if we could train a computer to be able to play this game by reasoning 3007 02:34:01,960 --> 02:34:04,120 through it logically. 3008 02:34:04,120 --> 02:34:06,620 So in order to do this, we'll begin by thinking about what 3009 02:34:06,620 --> 02:34:09,480 propositional symbols we're ultimately going to need. 3010 02:34:09,480 --> 02:34:12,560 Remember, again, that propositional symbols are just some symbol, 3011 02:34:12,560 --> 02:34:17,560 some variable, that can be either true or false in the world. 3012 02:34:17,560 --> 02:34:20,040 And so in this case, the propositional symbols 3013 02:34:20,040 --> 02:34:25,000 are really just going to correspond to each of the possible things that 3014 02:34:25,000 --> 02:34:26,480 could be inside the envelope. 3015 02:34:26,480 --> 02:34:29,360 Mustard is a propositional symbol that, in this case, 3016 02:34:29,360 --> 02:34:32,640 will just be true if Colonel Mustard is inside the envelope, 3017 02:34:32,640 --> 02:34:35,200 if he is the murderer, and false otherwise. 3018 02:34:35,200 --> 02:34:38,520 And likewise for Plum, for Professor Plum, and Scarlet, for Miss Scarlet. 3019 02:34:38,520 --> 02:34:41,600 And likewise for each of the rooms and for each of the weapons. 3020 02:34:41,600 --> 02:34:46,120 We have one propositional symbol for each of these ideas. 3021 02:34:46,120 --> 02:34:48,560 Then using those propositional symbols, we 3022 02:34:48,560 --> 02:34:52,320 can begin to create logical sentences, create knowledge 3023 02:34:52,320 --> 02:34:54,320 that we know about the world. 3024 02:34:54,320 --> 02:34:57,520 So for example, we know that someone is the murderer, 3025 02:34:57,520 --> 02:35:00,560 that one of the three people is, in fact, the murderer. 3026 02:35:00,560 --> 02:35:01,880 And how would we encode that? 3027 02:35:01,880 --> 02:35:04,280 Well, we don't know for sure who the murderer is. 3028 02:35:04,280 --> 02:35:09,320 But we know it is one person or the second person or the third person. 3029 02:35:09,320 --> 02:35:10,760 So I could say something like this. 3030 02:35:10,760 --> 02:35:13,760 Mustard or Plum or Scarlet. 3031 02:35:13,760 --> 02:35:17,280 And this piece of knowledge encodes that one of these three people 3032 02:35:17,280 --> 02:35:17,960 is the murderer. 3033 02:35:17,960 --> 02:35:22,680 We don't know which, but one of these three things must be true. 3034 02:35:22,680 --> 02:35:24,320 What other information do we know? 3035 02:35:24,320 --> 02:35:26,640 Well, we know that, for example, one of the rooms 3036 02:35:26,640 --> 02:35:28,960 must have been the room in the envelope. 3037 02:35:28,960 --> 02:35:33,120 The crime was committed either in the ballroom or the kitchen or the library. 3038 02:35:33,120 --> 02:35:34,720 Again, right now, we don't know which. 3039 02:35:34,720 --> 02:35:36,440 But this is knowledge we know at the outset, 3040 02:35:36,440 --> 02:35:40,440 knowledge that one of these three must be inside the envelope. 3041 02:35:40,440 --> 02:35:42,480 And likewise, we can say the same thing about the weapon, 3042 02:35:42,480 --> 02:35:45,480 that it was either the knife or the revolver or the wrench, 3043 02:35:45,480 --> 02:35:48,480 that one of those weapons must have been the weapon of choice 3044 02:35:48,480 --> 02:35:51,640 and therefore the weapon in the envelope. 3045 02:35:51,640 --> 02:35:53,680 And then as the game progresses, the gameplay 3046 02:35:53,680 --> 02:35:55,840 works by people get various different cards. 3047 02:35:55,840 --> 02:35:59,000 And using those cards, you can deduce information. 3048 02:35:59,000 --> 02:36:01,080 That if someone gives you a card, for example, 3049 02:36:01,080 --> 02:36:04,200 I have the Professor Plum card in my hand, 3050 02:36:04,200 --> 02:36:07,960 then I know the Professor Plum card can't be inside the envelope. 3051 02:36:07,960 --> 02:36:11,320 I know that Professor Plum is not the criminal, 3052 02:36:11,320 --> 02:36:15,040 so I know a piece of information like not Plum, for example. 3053 02:36:15,040 --> 02:36:18,440 I know that Professor Plum has to be false. 3054 02:36:18,440 --> 02:36:21,360 This propositional symbol is not true. 3055 02:36:21,360 --> 02:36:24,760 And sometimes I might not know for sure that a particular card is not 3056 02:36:24,760 --> 02:36:27,080 in the middle, but sometimes someone will make a guess 3057 02:36:27,080 --> 02:36:30,280 and I'll know that one of three possibilities is not true. 3058 02:36:30,280 --> 02:36:33,600 Someone will guess Colonel Mustard in the library with the revolver 3059 02:36:33,600 --> 02:36:35,000 or something to that effect. 3060 02:36:35,000 --> 02:36:38,080 And in that case, a card might be revealed that I don't see. 3061 02:36:38,080 --> 02:36:43,040 But if it is a card and it is either Colonel Mustard or the revolver 3062 02:36:43,040 --> 02:36:46,760 or the library, then I know that at least one of them 3063 02:36:46,760 --> 02:36:47,760 can't be in the middle. 3064 02:36:47,760 --> 02:36:51,240 So I know something like it is either not Mustard 3065 02:36:51,240 --> 02:36:55,360 or it is not the library or it is not the revolver. 3066 02:36:55,360 --> 02:36:57,200 Now maybe multiple of these are not true, 3067 02:36:57,200 --> 02:37:01,200 but I know that at least one of Mustard, Library, and Revolver 3068 02:37:01,200 --> 02:37:03,920 must, in fact, be false. 3069 02:37:03,920 --> 02:37:07,920 And so this now is a propositional logic representation 3070 02:37:07,920 --> 02:37:10,640 of this game of Clue, a way of encoding the knowledge that we 3071 02:37:10,640 --> 02:37:13,560 know inside this game using propositional logic 3072 02:37:13,560 --> 02:37:15,960 that a computer algorithm, something like model checking 3073 02:37:15,960 --> 02:37:19,920 that we saw a moment ago, can actually look at and understand. 3074 02:37:19,920 --> 02:37:21,920 So let's now take a look at some code to see 3075 02:37:21,920 --> 02:37:26,920 how this algorithm might actually work in practice. 3076 02:37:26,920 --> 02:37:30,000 All right, so I'm now going to open up a file called Clue.py, which 3077 02:37:30,000 --> 02:37:31,200 I've started already. 3078 02:37:31,200 --> 02:37:33,520 And what we'll see here is I've defined a couple of things. 3079 02:37:33,520 --> 02:37:35,720 To find some symbols initially, notice I 3080 02:37:35,720 --> 02:37:38,680 have a symbol for Colonel Mustard, a symbol for Professor Plum, 3081 02:37:38,680 --> 02:37:40,480 a symbol for Miss Scarlett, all of which 3082 02:37:40,480 --> 02:37:42,600 I've put inside of this list of characters. 3083 02:37:42,600 --> 02:37:45,400 I have a symbol for Ballroom and Kitchen and Library 3084 02:37:45,400 --> 02:37:46,960 inside of a list of rooms. 3085 02:37:46,960 --> 02:37:49,440 And then I have symbols for Knife and Revolver and Wrench. 3086 02:37:49,440 --> 02:37:50,760 These are my weapons. 3087 02:37:50,760 --> 02:37:53,760 And so all of these characters and rooms and weapons altogether, 3088 02:37:53,760 --> 02:37:55,840 those are my symbols. 3089 02:37:55,840 --> 02:37:59,200 And now I also have this check knowledge function. 3090 02:37:59,200 --> 02:38:02,760 And what the check knowledge function does is it takes my knowledge 3091 02:38:02,760 --> 02:38:07,280 and it's going to try and draw conclusions about what I know. 3092 02:38:07,280 --> 02:38:10,920 So for example, we'll loop over all of the possible symbols 3093 02:38:10,920 --> 02:38:13,680 and we'll check, do I know that that symbol is true? 3094 02:38:13,680 --> 02:38:15,960 And a symbol is going to be something like Professor Plum 3095 02:38:15,960 --> 02:38:17,400 or the Knife or the Library. 3096 02:38:17,400 --> 02:38:19,520 And if I know that it is true, in other words, 3097 02:38:19,520 --> 02:38:22,400 I know that it must be the card in the envelope, 3098 02:38:22,400 --> 02:38:24,880 then I'm going to print out using a function called 3099 02:38:24,880 --> 02:38:26,720 cprint, which prints things in color. 3100 02:38:26,720 --> 02:38:28,520 I'm going to print out the word yes, and I'm 3101 02:38:28,520 --> 02:38:32,240 going to print that in green, just to make it very clear to us. 3102 02:38:32,240 --> 02:38:35,160 If we're not sure that the symbol is true, 3103 02:38:35,160 --> 02:38:38,640 maybe I can check to see if I'm sure that the symbol is not true. 3104 02:38:38,640 --> 02:38:42,560 Like if I know for sure that it is not Professor Plum, for example. 3105 02:38:42,560 --> 02:38:44,840 And I do that by running model check again, 3106 02:38:44,840 --> 02:38:48,320 this time checking if my knowledge is not the symbol, 3107 02:38:48,320 --> 02:38:52,200 if I know for sure that the symbol is not true. 3108 02:38:52,200 --> 02:38:55,160 And if I don't know for sure that the symbol is not true, 3109 02:38:55,160 --> 02:38:59,480 because I say if not model check, meaning I'm not sure that the symbol is 3110 02:38:59,480 --> 02:39:03,080 false, well, then I'll go ahead and print out maybe next to the symbol. 3111 02:39:03,080 --> 02:39:07,920 Because maybe the symbol is true, maybe it's not, I don't actually know. 3112 02:39:07,920 --> 02:39:10,280 So what knowledge do I actually have? 3113 02:39:10,280 --> 02:39:12,360 Well, let's try and represent my knowledge now. 3114 02:39:12,360 --> 02:39:16,440 So my knowledge is, I know a couple of things, so I'll put them in an and. 3115 02:39:16,440 --> 02:39:20,280 And I know that one of the three people must be the criminal. 3116 02:39:20,280 --> 02:39:23,920 So I know or mustard, plum, scarlet. 3117 02:39:23,920 --> 02:39:26,920 This is my way of encoding that it is either Colonel Mustard or Professor 3118 02:39:26,920 --> 02:39:28,680 Plum or Miss Scarlet. 3119 02:39:28,680 --> 02:39:31,080 I know that it must have happened in one of the rooms. 3120 02:39:31,080 --> 02:39:36,040 So I know or ballroom, kitchen, library, for example. 3121 02:39:36,040 --> 02:39:38,800 And I know that one of the weapons must have been used as well. 3122 02:39:38,800 --> 02:39:43,320 So I know or knife, revolver, wrench. 3123 02:39:43,320 --> 02:39:45,280 So that might be my initial knowledge, that I 3124 02:39:45,280 --> 02:39:47,040 know that it must have been one of the people, 3125 02:39:47,040 --> 02:39:48,840 I know it must have been in one of the rooms, 3126 02:39:48,840 --> 02:39:51,800 and I know that it must have been one of the weapons. 3127 02:39:51,800 --> 02:39:54,120 And I can see what that knowledge looks like as a formula 3128 02:39:54,120 --> 02:39:56,920 by printing out knowledge.formula. 3129 02:39:56,920 --> 02:39:58,960 So I'll run python clue.py. 3130 02:39:58,960 --> 02:40:02,440 And here now is the information that I know in logical format. 3131 02:40:02,440 --> 02:40:05,800 I know that it is Colonel Mustard or Professor Plum or Miss Scarlet. 3132 02:40:05,800 --> 02:40:08,760 And I know that it is the ballroom, the kitchen, or the library. 3133 02:40:08,760 --> 02:40:11,800 And I know that it is the knife, the revolver, or the wrench. 3134 02:40:11,800 --> 02:40:13,800 But I don't know much more than that. 3135 02:40:13,800 --> 02:40:16,240 I can't really draw any firm conclusions. 3136 02:40:16,240 --> 02:40:19,320 And in fact, we can see that if I try and do, 3137 02:40:19,320 --> 02:40:24,000 let me go ahead and run my knowledge check function on my knowledge. 3138 02:40:24,000 --> 02:40:27,880 Knowledge check is this function that I, or check knowledge rather, 3139 02:40:27,880 --> 02:40:31,240 is this function that I just wrote that looks over all of the symbols 3140 02:40:31,240 --> 02:40:33,600 and tries to see what conclusions I can actually 3141 02:40:33,600 --> 02:40:36,280 draw about any of the symbols. 3142 02:40:36,280 --> 02:40:41,600 So I'll go ahead and run clue.py and see what it is that I know. 3143 02:40:41,600 --> 02:40:43,840 And it seems that I don't really know anything for sure. 3144 02:40:43,840 --> 02:40:47,200 I have all three people are maybes, all three of the rooms are maybes, 3145 02:40:47,200 --> 02:40:48,840 all three of the weapons are maybes. 3146 02:40:48,840 --> 02:40:52,120 I don't really know anything for certain just yet. 3147 02:40:52,120 --> 02:40:54,720 But now let me try and add some additional information 3148 02:40:54,720 --> 02:40:57,400 and see if additional information, additional knowledge, 3149 02:40:57,400 --> 02:41:00,560 can help us to logically reason our way through this process. 3150 02:41:00,560 --> 02:41:02,640 And we are just going to provide the information. 3151 02:41:02,640 --> 02:41:05,760 Our AI is going to take care of doing the inference 3152 02:41:05,760 --> 02:41:09,200 and figuring out what conclusions it's able to draw. 3153 02:41:09,200 --> 02:41:11,200 So I start with some cards. 3154 02:41:11,200 --> 02:41:12,720 And those cards tell me something. 3155 02:41:12,720 --> 02:41:15,600 So if I have the kernel mustard card, for example, 3156 02:41:15,600 --> 02:41:19,480 I know that the mustard symbol must be false. 3157 02:41:19,480 --> 02:41:22,480 In other words, mustard is not the one in the envelope, 3158 02:41:22,480 --> 02:41:23,680 is not the criminal. 3159 02:41:23,680 --> 02:41:26,840 So I can say, knowledge supports something called, 3160 02:41:26,840 --> 02:41:30,240 every and in this library supports dot add, 3161 02:41:30,240 --> 02:41:32,280 which is a way of adding knowledge or adding 3162 02:41:32,280 --> 02:41:35,480 an additional logical sentence to an and clause. 3163 02:41:35,480 --> 02:41:40,280 So I can say, knowledge dot add, not mustard. 3164 02:41:40,280 --> 02:41:42,920 I happen to know, because I have the mustard card, 3165 02:41:42,920 --> 02:41:44,960 that kernel mustard is not the suspect. 3166 02:41:44,960 --> 02:41:46,840 And maybe I have a couple of other cards too. 3167 02:41:46,840 --> 02:41:49,200 Maybe I also have a card for the kitchen. 3168 02:41:49,200 --> 02:41:50,760 So I know it's not the kitchen. 3169 02:41:50,760 --> 02:41:54,480 And maybe I have another card that says that it is not the revolver. 3170 02:41:54,480 --> 02:41:57,480 So I have three cards, kernel mustard, the kitchen, and the revolver. 3171 02:41:57,480 --> 02:42:01,880 And I encode that into my AI this way by saying, it's not kernel mustard, 3172 02:42:01,880 --> 02:42:04,400 it's not the kitchen, and it's not the revolver. 3173 02:42:04,400 --> 02:42:06,320 And I know those to be true. 3174 02:42:06,320 --> 02:42:09,640 So now, when I rerun clue.py, we'll see that I've 3175 02:42:09,640 --> 02:42:12,240 been able to eliminate some possibilities. 3176 02:42:12,240 --> 02:42:15,920 Before, I wasn't sure if it was the knife or the revolver or the wrench. 3177 02:42:15,920 --> 02:42:18,760 If a knife was maybe, a revolver was maybe, wrench is maybe. 3178 02:42:18,760 --> 02:42:21,080 Now I'm down to just the knife and the wrench. 3179 02:42:21,080 --> 02:42:23,080 Between those two, I don't know which one it is. 3180 02:42:23,080 --> 02:42:24,160 They're both maybes. 3181 02:42:24,160 --> 02:42:27,040 But I've been able to eliminate the revolver, which 3182 02:42:27,040 --> 02:42:31,840 is one that I know to be false, because I have the revolver card. 3183 02:42:31,840 --> 02:42:34,640 And so additional information might be acquired 3184 02:42:34,640 --> 02:42:36,080 over the course of this game. 3185 02:42:36,080 --> 02:42:41,280 And we would represent that just by adding knowledge to our knowledge set 3186 02:42:41,280 --> 02:42:43,320 or knowledge base that we've been building here. 3187 02:42:43,320 --> 02:42:46,120 So if, for example, we additionally got the information 3188 02:42:46,120 --> 02:42:49,320 that someone made a guess, someone guessed like Miss Scarlet 3189 02:42:49,320 --> 02:42:51,000 in the library with the wrench. 3190 02:42:51,000 --> 02:42:53,880 And we know that a card was revealed, which 3191 02:42:53,880 --> 02:42:56,520 means that one of those three cards, either Miss Scarlet 3192 02:42:56,520 --> 02:42:59,760 or the library or the wrench, one of those at minimum 3193 02:42:59,760 --> 02:43:02,400 must not be inside of the envelope. 3194 02:43:02,400 --> 02:43:05,640 So I could add some knowledge, say knowledge.add. 3195 02:43:05,640 --> 02:43:09,080 And I'm going to add an or clause, because I don't know for sure which one 3196 02:43:09,080 --> 02:43:12,200 it's not, but I know one of them is not in the envelope. 3197 02:43:12,200 --> 02:43:15,600 So it's either not Scarlet, or it's not the library, 3198 02:43:15,600 --> 02:43:17,080 and or supports multiple arguments. 3199 02:43:17,080 --> 02:43:20,600 I can say it's also or not the wrench. 3200 02:43:20,600 --> 02:43:23,320 So at least one of those needs a Scarlet library and wrench. 3201 02:43:23,320 --> 02:43:25,280 At least one of those needs to be false. 3202 02:43:25,280 --> 02:43:26,320 I don't know which, though. 3203 02:43:26,320 --> 02:43:27,240 Maybe it's multiple. 3204 02:43:27,240 --> 02:43:32,280 Maybe it's just one, but at least one I know needs to hold. 3205 02:43:32,280 --> 02:43:35,120 And so now if I rerun clue.py, I don't actually 3206 02:43:35,120 --> 02:43:37,560 have any additional information just yet. 3207 02:43:37,560 --> 02:43:38,880 Nothing I can say conclusively. 3208 02:43:38,880 --> 02:43:41,960 I still know that maybe it's Professor Plum, maybe it's Miss Scarlet. 3209 02:43:41,960 --> 02:43:44,520 I haven't eliminated any options. 3210 02:43:44,520 --> 02:43:46,520 But let's imagine that I get some more information, 3211 02:43:46,520 --> 02:43:50,360 that someone shows me the Professor Plum card, for example. 3212 02:43:50,360 --> 02:43:57,040 So I say, all right, let's go back here, knowledge.add, not Plum. 3213 02:43:57,040 --> 02:43:58,440 So I have the Professor Plum card. 3214 02:43:58,440 --> 02:44:00,600 I know the Professor Plum is not in the middle. 3215 02:44:00,600 --> 02:44:02,400 I rerun clue.py. 3216 02:44:02,400 --> 02:44:04,920 And right now, I'm able to draw some conclusions. 3217 02:44:04,920 --> 02:44:07,160 Now I've been able to eliminate Professor Plum, 3218 02:44:07,160 --> 02:44:10,320 and the only person it could left remaining be is Miss Scarlet. 3219 02:44:10,320 --> 02:44:14,320 So I know, yes, Miss Scarlet, this variable must be true. 3220 02:44:14,320 --> 02:44:17,720 And I've been able to infer that based on the information I already had. 3221 02:44:17,720 --> 02:44:20,640 Now between the ballroom and the library and the knife and the wrench, 3222 02:44:20,640 --> 02:44:22,600 for those two, I'm still not sure. 3223 02:44:22,600 --> 02:44:25,200 So let's add one more piece of information. 3224 02:44:25,200 --> 02:44:28,200 Let's say that I know that it's not the ballroom. 3225 02:44:28,200 --> 02:44:30,960 Someone has shown me the ballroom card, so I know it's not the ballroom. 3226 02:44:30,960 --> 02:44:33,960 Which means at this point, I should be able to conclude that it's the library. 3227 02:44:33,960 --> 02:44:35,040 Let's see. 3228 02:44:35,040 --> 02:44:40,000 I'll say knowledge.add, not the ballroom. 3229 02:44:40,000 --> 02:44:43,080 And we'll go ahead and run that. 3230 02:44:43,080 --> 02:44:46,240 And it turns out that after all of this, not only can I conclude that I 3231 02:44:46,240 --> 02:44:49,840 know that it's the library, but I also know that the weapon was the knife. 3232 02:44:49,840 --> 02:44:52,720 And that might have been an inference that was a little bit trickier, something 3233 02:44:52,720 --> 02:44:55,160 I wouldn't have realized immediately, but the AI, 3234 02:44:55,160 --> 02:44:58,320 via this model checking algorithm, is able to draw that conclusion, 3235 02:44:58,320 --> 02:45:02,400 that we know for sure that it must be Miss Scarlet in the library with the knife. 3236 02:45:02,400 --> 02:45:03,800 And how did we know that? 3237 02:45:03,800 --> 02:45:07,480 Well, we know it from this or clause up here, 3238 02:45:07,480 --> 02:45:11,520 that we know that it's either not Scarlet, or it's not the library, 3239 02:45:11,520 --> 02:45:13,440 or it's not the wrench. 3240 02:45:13,440 --> 02:45:16,000 And given that we know that it is Miss Scarlet, 3241 02:45:16,000 --> 02:45:20,200 and we know that it is the library, then the only remaining option for the weapon 3242 02:45:20,200 --> 02:45:24,360 is that it is not the wrench, which means that it must be the knife. 3243 02:45:24,360 --> 02:45:26,760 So we as humans now can go back and reason through that, 3244 02:45:26,760 --> 02:45:28,920 even though it might not have been immediately clear. 3245 02:45:28,920 --> 02:45:32,840 And that's one of the advantages of using an AI or some sort of algorithm 3246 02:45:32,840 --> 02:45:36,520 in order to do this, is that the computer can exhaust all of these possibilities 3247 02:45:36,520 --> 02:45:40,720 and try and figure out what the solution actually should be. 3248 02:45:40,720 --> 02:45:43,240 And so for that reason, it's often helpful to be 3249 02:45:43,240 --> 02:45:45,040 able to represent knowledge in this way. 3250 02:45:45,040 --> 02:45:47,280 Knowledge engineering, some situation where 3251 02:45:47,280 --> 02:45:50,240 we can use a computer to be able to represent knowledge 3252 02:45:50,240 --> 02:45:52,760 and draw conclusions based on that knowledge. 3253 02:45:52,760 --> 02:45:56,440 And any time we can translate something into propositional logic symbols 3254 02:45:56,440 --> 02:45:59,400 like this, this type of approach can be useful. 3255 02:45:59,400 --> 02:46:01,360 So you might be familiar with logic puzzles, 3256 02:46:01,360 --> 02:46:04,120 where you have to puzzle your way through trying to figure something out. 3257 02:46:04,120 --> 02:46:06,520 This is what a classic logic puzzle might look like. 3258 02:46:06,520 --> 02:46:09,640 Something like Gilderoy, Minerva, Pomona, and Horace each 3259 02:46:09,640 --> 02:46:14,000 belong to a different one of the four houses, Gryffindor, Hufflepuff, Ravenclaw, 3260 02:46:14,000 --> 02:46:15,080 and Slytherin. 3261 02:46:15,080 --> 02:46:16,640 And then we have some information. 3262 02:46:16,640 --> 02:46:20,160 The Gilderoy belongs to Gryffindor or Ravenclaw, Pomona 3263 02:46:20,160 --> 02:46:24,360 does not belong in Slytherin, and Minerva does belong to Gryffindor. 3264 02:46:24,360 --> 02:46:26,200 So we have a couple pieces of information. 3265 02:46:26,200 --> 02:46:28,200 And using that information, we need to be 3266 02:46:28,200 --> 02:46:31,200 able to draw some conclusions about which person should 3267 02:46:31,200 --> 02:46:33,240 be assigned to which house. 3268 02:46:33,240 --> 02:46:37,600 And again, we can use the exact same idea to try and implement this notion. 3269 02:46:37,600 --> 02:46:39,720 So we need some propositional symbols. 3270 02:46:39,720 --> 02:46:41,440 And in this case, the propositional symbols 3271 02:46:41,440 --> 02:46:43,800 are going to get a little more complex, although we'll 3272 02:46:43,800 --> 02:46:46,480 see ways to make this a little bit cleaner later on. 3273 02:46:46,480 --> 02:46:51,960 But we'll need 16 propositional symbols, one for each person and house. 3274 02:46:51,960 --> 02:46:54,560 So we need to say, remember, every propositional symbol 3275 02:46:54,560 --> 02:46:56,480 is either true or false. 3276 02:46:56,480 --> 02:46:59,280 So Gilderoy Gryffindor is either true or false. 3277 02:46:59,280 --> 02:47:01,480 Either he's in Gryffindor or he is not. 3278 02:47:01,480 --> 02:47:03,720 Likewise, Gilderoy Hufflepuff also true or false. 3279 02:47:03,720 --> 02:47:05,880 Either it is true or it's false. 3280 02:47:05,880 --> 02:47:09,760 And that's true for every combination of person and house 3281 02:47:09,760 --> 02:47:10,880 that we could come up with. 3282 02:47:10,880 --> 02:47:14,880 We have some sort of propositional symbol for each one of those. 3283 02:47:14,880 --> 02:47:17,200 Using this type of knowledge, we can then 3284 02:47:17,200 --> 02:47:20,560 begin to think about what types of logical sentences 3285 02:47:20,560 --> 02:47:22,440 we can say about the puzzle. 3286 02:47:22,440 --> 02:47:25,480 That if we know what will before even think about the information we were 3287 02:47:25,480 --> 02:47:28,000 given, we can think about the premise of the problem, 3288 02:47:28,000 --> 02:47:31,560 that every person is assigned to a different house. 3289 02:47:31,560 --> 02:47:32,760 So what does that tell us? 3290 02:47:32,760 --> 02:47:34,440 Well, it tells us sentences like this. 3291 02:47:34,440 --> 02:47:39,880 It tells us like Pomona Slytherin implies not Pomona Hufflepuff. 3292 02:47:39,880 --> 02:47:42,240 Something like if Pomona is in Slytherin, 3293 02:47:42,240 --> 02:47:44,680 then we know that Pomona is not in Hufflepuff. 3294 02:47:44,680 --> 02:47:48,480 And we know this for all four people and for all combinations of houses, 3295 02:47:48,480 --> 02:47:51,320 that no matter what person you pick, if they're in one house, 3296 02:47:51,320 --> 02:47:53,600 then they're not in some other house. 3297 02:47:53,600 --> 02:47:56,120 So I'll probably have a whole bunch of knowledge statements 3298 02:47:56,120 --> 02:47:59,040 that are of this form, that if we know Pomona is in Slytherin, 3299 02:47:59,040 --> 02:48:01,760 then we know Pomona is not in Hufflepuff. 3300 02:48:01,760 --> 02:48:04,120 We were also given the information that each person 3301 02:48:04,120 --> 02:48:05,720 is in a different house. 3302 02:48:05,720 --> 02:48:08,560 So I also have pieces of knowledge that look something like this. 3303 02:48:08,560 --> 02:48:13,200 Minerva Ravenclaw implies not Gilderoy Ravenclaw. 3304 02:48:13,200 --> 02:48:16,600 If they're all in different houses, then if Minerva is in Ravenclaw, 3305 02:48:16,600 --> 02:48:20,040 then we know the Gilderoy is not in Ravenclaw as well. 3306 02:48:20,040 --> 02:48:22,040 And I have a whole bunch of similar sentences 3307 02:48:22,040 --> 02:48:26,320 like this that are expressing that idea for other people and other houses 3308 02:48:26,320 --> 02:48:27,480 as well. 3309 02:48:27,480 --> 02:48:29,760 And so in addition to sentences of these form, 3310 02:48:29,760 --> 02:48:32,120 I also have the knowledge that was given to me. 3311 02:48:32,120 --> 02:48:35,880 Information like Gilderoy was in Gryffindor or in Ravenclaw 3312 02:48:35,880 --> 02:48:39,640 that would be represented like this, Gilderoy Gryffindor or Gilderoy 3313 02:48:39,640 --> 02:48:40,640 Ravenclaw. 3314 02:48:40,640 --> 02:48:42,920 And then using these sorts of sentences, 3315 02:48:42,920 --> 02:48:46,720 I can begin to draw some conclusions about the world. 3316 02:48:46,720 --> 02:48:48,120 So let's see an example of this. 3317 02:48:48,120 --> 02:48:50,800 We'll go ahead and actually try and implement this logic puzzle 3318 02:48:50,800 --> 02:48:53,360 to see if we can figure out what the answer is. 3319 02:48:53,360 --> 02:48:56,680 I'll go ahead and open up puzzle.py, where I've already 3320 02:48:56,680 --> 02:48:58,840 started to implement this sort of idea. 3321 02:48:58,840 --> 02:49:01,880 I've defined a list of people and a list of houses. 3322 02:49:01,880 --> 02:49:06,760 And I've so far created one symbol for every person and for every house. 3323 02:49:06,760 --> 02:49:09,600 That's what this double four loop is doing, looping over all people, 3324 02:49:09,600 --> 02:49:13,560 looping over all houses, creating a new symbol for each of them. 3325 02:49:13,560 --> 02:49:16,240 And then I've added some information. 3326 02:49:16,240 --> 02:49:19,320 I know that every person belongs to a house, 3327 02:49:19,320 --> 02:49:24,200 so I've added the information for every person that person Gryffindor 3328 02:49:24,200 --> 02:49:28,240 or person Hufflepuff or person Ravenclaw or person Slytherin, 3329 02:49:28,240 --> 02:49:30,820 that one of those four things must be true. 3330 02:49:30,820 --> 02:49:33,220 Every person belongs to a house. 3331 02:49:33,220 --> 02:49:34,840 What other information do I know? 3332 02:49:34,840 --> 02:49:37,960 I also know that only one house per person, 3333 02:49:37,960 --> 02:49:41,680 so no person belongs to multiple houses. 3334 02:49:41,680 --> 02:49:42,840 So how does this work? 3335 02:49:42,840 --> 02:49:44,720 Well, this is going to be true for all people. 3336 02:49:44,720 --> 02:49:47,080 So I'll loop over every person. 3337 02:49:47,080 --> 02:49:51,080 And then I need to loop over all different pairs of houses. 3338 02:49:51,080 --> 02:49:54,840 The idea is I want to encode the idea that if Minerva is in Gryffindor, 3339 02:49:54,840 --> 02:49:57,480 then Minerva can't be in Ravenclaw. 3340 02:49:57,480 --> 02:49:59,760 So I'll loop over all houses, each one. 3341 02:49:59,760 --> 02:50:02,580 And I'll loop over all houses again, h2. 3342 02:50:02,580 --> 02:50:06,200 And as long as they're different, h1 not equal to h2, 3343 02:50:06,200 --> 02:50:09,200 then I'll add to my knowledge base this piece of information. 3344 02:50:09,200 --> 02:50:14,320 That implication, in other words, an if then, if the person is in h1, 3345 02:50:14,320 --> 02:50:18,560 then I know that they are not in house h2. 3346 02:50:18,560 --> 02:50:22,160 So these lines here are encoding the notion that for every person, 3347 02:50:22,160 --> 02:50:25,920 if they belong to house one, then they are not in house two. 3348 02:50:25,920 --> 02:50:27,880 And the other piece of logic we need to encode 3349 02:50:27,880 --> 02:50:30,920 is the idea that every house can only have one person. 3350 02:50:30,920 --> 02:50:33,120 In other words, if Pomona is in Hufflepuff, 3351 02:50:33,120 --> 02:50:35,960 then nobody else is allowed to be in Hufflepuff either. 3352 02:50:35,960 --> 02:50:37,960 And that's the same logic, but sort of backwards. 3353 02:50:37,960 --> 02:50:42,840 I loop over all of the houses and loop over all different pairs of people. 3354 02:50:42,840 --> 02:50:45,600 So I loop over people once, loop over people again, 3355 02:50:45,600 --> 02:50:50,120 and only do this when the people are different, p1 not equal to p2. 3356 02:50:50,120 --> 02:50:54,960 And I add the knowledge that if, as given by the implication, 3357 02:50:54,960 --> 02:50:58,360 if person one belongs to the house, then it 3358 02:50:58,360 --> 02:51:03,880 is not the case that person two belongs to the same house. 3359 02:51:03,880 --> 02:51:05,800 So here I'm just encoding the knowledge that 3360 02:51:05,800 --> 02:51:07,880 represents the problem's constraints. 3361 02:51:07,880 --> 02:51:09,760 I know that everyone's in a different house. 3362 02:51:09,760 --> 02:51:12,800 I know that any person can only belong to one house. 3363 02:51:12,800 --> 02:51:17,480 And I can now take my knowledge and try and print out the information 3364 02:51:17,480 --> 02:51:18,600 that I happen to know. 3365 02:51:18,600 --> 02:51:22,120 So I'll go ahead and print out knowledge.formula, 3366 02:51:22,120 --> 02:51:24,880 just to see this in action, and I'll go ahead and skip this for now. 3367 02:51:24,880 --> 02:51:26,840 But we'll come back to this in a second. 3368 02:51:26,840 --> 02:51:31,840 Let's print out the knowledge that I know by running Python puzzle.py. 3369 02:51:31,840 --> 02:51:34,320 It's a lot of information, a lot that I have to scroll through, 3370 02:51:34,320 --> 02:51:36,960 because there are 16 different variables all going on. 3371 02:51:36,960 --> 02:51:39,320 But the basic idea, if we scroll up to the very top, 3372 02:51:39,320 --> 02:51:41,040 is I see my initial information. 3373 02:51:41,040 --> 02:51:44,560 Gilderoy is either in Gryffindor, or Gilderoy is in Hufflepuff, 3374 02:51:44,560 --> 02:51:48,040 or Gilderoy is in Ravenclaw, or Gilderoy is in Slytherin, 3375 02:51:48,040 --> 02:51:50,920 and then way more information as well. 3376 02:51:50,920 --> 02:51:54,040 So this is quite messy, more than we really want to be looking at. 3377 02:51:54,040 --> 02:51:55,920 And soon, too, we'll see ways of representing 3378 02:51:55,920 --> 02:51:58,200 this a little bit more nicely using logic. 3379 02:51:58,200 --> 02:52:00,400 But for now, we can just say these are the variables 3380 02:52:00,400 --> 02:52:01,520 that we're dealing with. 3381 02:52:01,520 --> 02:52:05,560 And now we'd like to add some information. 3382 02:52:05,560 --> 02:52:09,560 So the information we're going to add is Gilderoy is in Gryffindor, 3383 02:52:09,560 --> 02:52:10,680 or he is in Ravenclaw. 3384 02:52:10,680 --> 02:52:12,680 So that knowledge was given to us. 3385 02:52:12,680 --> 02:52:15,520 So I'll go ahead and say knowledge.add. 3386 02:52:15,520 --> 02:52:26,400 And I know that either or Gilderoy Gryffindor or Gilderoy Ravenclaw. 3387 02:52:26,400 --> 02:52:29,280 One of those two things must be true. 3388 02:52:29,280 --> 02:52:32,200 I also know that Pomona was not in Slytherin, 3389 02:52:32,200 --> 02:52:37,680 so I can say knowledge.add not this symbol, not the Pomona-Slytherin 3390 02:52:37,680 --> 02:52:38,720 symbol. 3391 02:52:38,720 --> 02:52:42,240 And then I can add the knowledge that Minerva is in Gryffindor 3392 02:52:42,240 --> 02:52:46,760 by adding the symbol Minerva Gryffindor. 3393 02:52:46,760 --> 02:52:49,200 So those are the pieces of knowledge that I know. 3394 02:52:49,200 --> 02:52:52,920 And this loop here at the bottom just loops over all of my symbols, 3395 02:52:52,920 --> 02:52:56,040 checks to see if the knowledge entails that symbol 3396 02:52:56,040 --> 02:52:58,520 by calling this model check function again. 3397 02:52:58,520 --> 02:53:03,600 And if it does, if we know the symbol is true, we print out the symbol. 3398 02:53:03,600 --> 02:53:07,000 So now I can run Python, puzzle.py, and Python 3399 02:53:07,000 --> 02:53:08,880 is going to solve this puzzle for me. 3400 02:53:08,880 --> 02:53:11,520 We're able to conclude that Gilderoy belongs to Ravenclaw, 3401 02:53:11,520 --> 02:53:15,480 Pomona belongs to Hufflepuff, Minerva to Gryffindor, and Horace to Slytherin 3402 02:53:15,480 --> 02:53:18,120 just by encoding this knowledge inside the computer, 3403 02:53:18,120 --> 02:53:20,360 although it was quite tedious to do in this case. 3404 02:53:20,360 --> 02:53:24,880 And as a result, we were able to get the conclusion from that as well. 3405 02:53:24,880 --> 02:53:27,240 And you can imagine this being applied to many sorts 3406 02:53:27,240 --> 02:53:29,000 of different deductive situations. 3407 02:53:29,000 --> 02:53:31,120 So not only these situations where we're trying 3408 02:53:31,120 --> 02:53:33,640 to deal with Harry Potter characters in this puzzle, 3409 02:53:33,640 --> 02:53:35,800 but if you've ever played games like Mastermind, where 3410 02:53:35,800 --> 02:53:39,040 you're trying to figure out which order different colors go in 3411 02:53:39,040 --> 02:53:40,840 and trying to make predictions about it, I 3412 02:53:40,840 --> 02:53:44,600 could tell you, for example, let's play a simplified version of Mastermind 3413 02:53:44,600 --> 02:53:47,760 where there are four colors, red, blue, green, and yellow, 3414 02:53:47,760 --> 02:53:51,000 and they're in some order, but I'm not telling you what order. 3415 02:53:51,000 --> 02:53:53,080 You just have to make a guess, and I'll tell you 3416 02:53:53,080 --> 02:53:55,400 of red, blue, green, and yellow how many of the four 3417 02:53:55,400 --> 02:53:57,320 you got in the right position. 3418 02:53:57,320 --> 02:53:59,480 So a simplified version of this game, you 3419 02:53:59,480 --> 02:54:01,800 might make a guess like red, blue, green, yellow, 3420 02:54:01,800 --> 02:54:05,320 and I would tell you something like two of those four 3421 02:54:05,320 --> 02:54:08,040 are in the correct position, but the other two are not. 3422 02:54:08,040 --> 02:54:10,560 And then you could reasonably make a guess and say, all right, 3423 02:54:10,560 --> 02:54:13,000 look at this, blue, red, green, yellow. 3424 02:54:13,000 --> 02:54:16,040 Try switching two of them around, and this time maybe I tell you, 3425 02:54:16,040 --> 02:54:19,480 you know what, none of those are in the correct position. 3426 02:54:19,480 --> 02:54:23,000 And the question then is, all right, what is the correct order 3427 02:54:23,000 --> 02:54:24,360 of these four colors? 3428 02:54:24,360 --> 02:54:26,240 And we as humans could begin to reason this through. 3429 02:54:26,240 --> 02:54:28,760 All right, well, if none of these were correct, 3430 02:54:28,760 --> 02:54:31,280 but two of these were correct, well, it must have been 3431 02:54:31,280 --> 02:54:34,560 because I switched the red and the blue, which means red and blue here 3432 02:54:34,560 --> 02:54:37,440 must be correct, which means green and yellow are probably not correct. 3433 02:54:37,440 --> 02:54:40,400 You can begin to do this sort of deductive reasoning. 3434 02:54:40,400 --> 02:54:42,840 And we can also equivalently try and take this 3435 02:54:42,840 --> 02:54:45,400 and encode it inside of our computer as well. 3436 02:54:45,400 --> 02:54:48,000 And it's going to be very similar to the logic puzzle 3437 02:54:48,000 --> 02:54:49,480 that we just did a moment ago. 3438 02:54:49,480 --> 02:54:52,520 So I won't spend too much time on this code because it is fairly similar. 3439 02:54:52,520 --> 02:54:54,920 But again, we have a whole bunch of colors 3440 02:54:54,920 --> 02:54:58,600 and four different positions in which those colors can be. 3441 02:54:58,600 --> 02:55:00,440 And then we have some additional knowledge. 3442 02:55:00,440 --> 02:55:02,120 And I encode all of that knowledge. 3443 02:55:02,120 --> 02:55:04,960 And you can take a look at this code on your own time. 3444 02:55:04,960 --> 02:55:07,880 But I just want to demonstrate that when we run this code, 3445 02:55:07,880 --> 02:55:12,720 run python mastermind.py and run and see what we get, 3446 02:55:12,720 --> 02:55:16,880 we ultimately are able to compute red 0 in the 0 position, 3447 02:55:16,880 --> 02:55:19,460 blue in the 1 position, yellow in the 2 position, 3448 02:55:19,460 --> 02:55:24,160 and green in the 3 position as the ordering of those symbols. 3449 02:55:24,160 --> 02:55:25,840 Now, ultimately, what you might have noticed 3450 02:55:25,840 --> 02:55:28,560 is this process was taking quite a long time. 3451 02:55:28,560 --> 02:55:32,360 And in fact, model checking is not a particularly efficient algorithm, right? 3452 02:55:32,360 --> 02:55:34,320 What I need to do in order to model check 3453 02:55:34,320 --> 02:55:36,800 is take all of my possible different variables 3454 02:55:36,800 --> 02:55:39,480 and enumerate all of the possibilities that they could be in. 3455 02:55:39,480 --> 02:55:44,040 If I have n variables, I have 2 to the n possible worlds 3456 02:55:44,040 --> 02:55:45,840 that I need to be looking through in order 3457 02:55:45,840 --> 02:55:48,060 to perform this model checking algorithm. 3458 02:55:48,060 --> 02:55:50,440 And this is probably not tractable, especially 3459 02:55:50,440 --> 02:55:53,480 as we start to get to much larger and larger sets of data 3460 02:55:53,480 --> 02:55:56,320 where you have many, many more variables that are at play. 3461 02:55:56,320 --> 02:55:59,240 Right here, we only have a relatively small number of variables. 3462 02:55:59,240 --> 02:56:01,560 So this sort of approach can actually work. 3463 02:56:01,560 --> 02:56:04,800 But as the number of variables increases, model checking 3464 02:56:04,800 --> 02:56:07,240 becomes less and less good of a way of trying 3465 02:56:07,240 --> 02:56:09,560 to solve these sorts of problems. 3466 02:56:09,560 --> 02:56:12,240 So while it might have been OK for something like Mastermind 3467 02:56:12,240 --> 02:56:15,280 to conclude that this is indeed the correct sequence where all four 3468 02:56:15,280 --> 02:56:17,720 are in the correct position, what we'd like to do 3469 02:56:17,720 --> 02:56:21,760 is come up with some better ways to be able to make inferences rather than 3470 02:56:21,760 --> 02:56:24,600 just enumerate all of the possibilities. 3471 02:56:24,600 --> 02:56:26,960 And to do so, what we'll transition to next 3472 02:56:26,960 --> 02:56:29,880 is the idea of inference rules, some sort of rules 3473 02:56:29,880 --> 02:56:33,200 that we can apply to take knowledge that already exists 3474 02:56:33,200 --> 02:56:36,000 and translate it into new forms of knowledge. 3475 02:56:36,000 --> 02:56:38,440 And the general way we'll structure an inference rule 3476 02:56:38,440 --> 02:56:40,840 is by having a horizontal line here. 3477 02:56:40,840 --> 02:56:44,240 Anything above the line is going to represent a premise, something 3478 02:56:44,240 --> 02:56:45,960 that we know to be true. 3479 02:56:45,960 --> 02:56:48,680 And then anything below the line will be the conclusion 3480 02:56:48,680 --> 02:56:53,360 that we can arrive at after we apply the logic from the inference rule 3481 02:56:53,360 --> 02:56:54,640 that we're going to demonstrate. 3482 02:56:54,640 --> 02:56:56,140 So we'll do some of these inference rules 3483 02:56:56,140 --> 02:56:59,040 by demonstrating them in English first, but then translating them 3484 02:56:59,040 --> 02:57:01,120 into the world of propositional logic so you 3485 02:57:01,120 --> 02:57:04,800 can see what those inference rules actually look like. 3486 02:57:04,800 --> 02:57:07,000 So for example, let's imagine that I have access 3487 02:57:07,000 --> 02:57:08,720 to two pieces of information. 3488 02:57:08,720 --> 02:57:11,320 I know, for example, that if it is raining, 3489 02:57:11,320 --> 02:57:14,120 then Harry is inside, for example. 3490 02:57:14,120 --> 02:57:16,960 And let's say I also know it is raining. 3491 02:57:16,960 --> 02:57:19,460 Then most of us could reasonably then look at this information 3492 02:57:19,460 --> 02:57:23,880 and conclude that, all right, Harry must be inside. 3493 02:57:23,880 --> 02:57:27,160 This inference rule is known as modus ponens, 3494 02:57:27,160 --> 02:57:29,920 and it's phrased more formally in logic as this. 3495 02:57:29,920 --> 02:57:35,480 If we know that alpha implies beta, in other words, if alpha, then beta, 3496 02:57:35,480 --> 02:57:38,380 and we also know that alpha is true, then we 3497 02:57:38,380 --> 02:57:41,560 should be able to conclude that beta is also true. 3498 02:57:41,560 --> 02:57:45,640 We can apply this inference rule to take these two pieces of information 3499 02:57:45,640 --> 02:57:47,860 and generate this new piece of information. 3500 02:57:47,860 --> 02:57:51,080 Notice that this is a totally different approach from the model checking 3501 02:57:51,080 --> 02:57:54,520 approach, where the approach was look at all of the possible worlds 3502 02:57:54,520 --> 02:57:56,720 and see what's true in each of these worlds. 3503 02:57:56,720 --> 02:57:59,240 Here, we're not dealing with any specific world. 3504 02:57:59,240 --> 02:58:01,560 We're just dealing with the knowledge that we know 3505 02:58:01,560 --> 02:58:04,480 and what conclusions we can arrive at based on that knowledge. 3506 02:58:04,480 --> 02:58:10,040 That I know that A implies B, and I know A, and the conclusion is B. 3507 02:58:10,040 --> 02:58:12,680 And this should seem like a relatively obvious rule. 3508 02:58:12,680 --> 02:58:16,160 But of course, if alpha, then beta, and we know alpha, 3509 02:58:16,160 --> 02:58:19,160 then we should be able to conclude that beta is also true. 3510 02:58:19,160 --> 02:58:21,400 And that's going to be true for many, but maybe even 3511 02:58:21,400 --> 02:58:23,560 all of the inference rules that we'll take a look at. 3512 02:58:23,560 --> 02:58:25,320 You should be able to look at them and say, 3513 02:58:25,320 --> 02:58:27,360 yeah, of course that's going to be true. 3514 02:58:27,360 --> 02:58:30,440 But it's putting these all together, figuring out the right combination 3515 02:58:30,440 --> 02:58:32,920 of inference rules that can be applied that ultimately 3516 02:58:32,920 --> 02:58:38,440 is going to allow us to generate interesting knowledge inside of our AI. 3517 02:58:38,440 --> 02:58:41,440 So that's modus ponensis application of implication, 3518 02:58:41,440 --> 02:58:44,640 that if we know alpha and we know that alpha implies beta, 3519 02:58:44,640 --> 02:58:47,280 then we can conclude beta. 3520 02:58:47,280 --> 02:58:48,760 Let's take a look at another example. 3521 02:58:48,760 --> 02:58:52,560 Fairly straightforward, something like Harry is friends with Ron and Hermione. 3522 02:58:52,560 --> 02:58:54,800 Based on that information, we can reasonably 3523 02:58:54,800 --> 02:58:56,920 conclude Harry is friends with Hermione. 3524 02:58:56,920 --> 02:58:58,760 That must also be true. 3525 02:58:58,760 --> 02:59:01,880 And this inference rule is known as and elimination. 3526 02:59:01,880 --> 02:59:06,920 And what and elimination says is that if we have a situation where alpha 3527 02:59:06,920 --> 02:59:11,560 and beta are both true, I have information alpha and beta, 3528 02:59:11,560 --> 02:59:14,440 well then, just alpha is true. 3529 02:59:14,440 --> 02:59:16,560 Or likewise, just beta is true. 3530 02:59:16,560 --> 02:59:19,800 That if I know that both parts are true, then one of those parts 3531 02:59:19,800 --> 02:59:21,040 must also be true. 3532 02:59:21,040 --> 02:59:24,360 Again, something obvious from the point of view of human intuition, 3533 02:59:24,360 --> 02:59:27,160 but a computer needs to be told this kind of information. 3534 02:59:27,160 --> 02:59:28,960 To be able to apply the inference rule, we 3535 02:59:28,960 --> 02:59:32,200 need to tell the computer that this is an inference rule that you can apply, 3536 02:59:32,200 --> 02:59:35,160 so the computer has access to it and is able to use it 3537 02:59:35,160 --> 02:59:39,880 in order to translate information from one form to another. 3538 02:59:39,880 --> 02:59:42,720 In addition to that, let's take a look at another example of an inference 3539 02:59:42,720 --> 02:59:48,600 rule, something like it is not true that Harry did not pass the test. 3540 02:59:48,600 --> 02:59:50,000 Bit of a tricky sentence to parse. 3541 02:59:50,000 --> 02:59:50,840 I'll read it again. 3542 02:59:50,840 --> 02:59:54,960 It is not true, or it is false, that Harry did not pass the test. 3543 02:59:54,960 --> 02:59:58,800 Well, if it is false that Harry did not pass the test, 3544 02:59:58,800 --> 03:00:02,840 then the only reasonable conclusion is that Harry did pass the test. 3545 03:00:02,840 --> 03:00:05,120 And so this, instead of being and elimination, 3546 03:00:05,120 --> 03:00:07,560 is what we call double negation elimination. 3547 03:00:07,560 --> 03:00:10,360 That if we have two negatives inside of our premise, 3548 03:00:10,360 --> 03:00:12,080 then we can just remove them altogether. 3549 03:00:12,080 --> 03:00:13,120 They cancel each other out. 3550 03:00:13,120 --> 03:00:17,440 One turns true to false, and the other one turns false back into true. 3551 03:00:17,440 --> 03:00:19,300 Phrased a little bit more formally, we say 3552 03:00:19,300 --> 03:00:23,800 that if the premise is not alpha, then the conclusion 3553 03:00:23,800 --> 03:00:25,780 we can draw is just alpha. 3554 03:00:25,780 --> 03:00:28,400 We can say that alpha is true. 3555 03:00:28,400 --> 03:00:30,280 We'll take a look at a couple more of these. 3556 03:00:30,280 --> 03:00:33,960 If I have it is raining, then Harry is inside. 3557 03:00:33,960 --> 03:00:35,920 How do I reframe this? 3558 03:00:35,920 --> 03:00:37,800 Well, this one is a little bit trickier. 3559 03:00:37,800 --> 03:00:41,080 But if I know if it is raining, then Harry is inside, 3560 03:00:41,080 --> 03:00:43,960 then I conclude one of two things must be true. 3561 03:00:43,960 --> 03:00:48,280 Either it is not raining, or Harry is inside. 3562 03:00:48,280 --> 03:00:49,280 Now, this one's trickier. 3563 03:00:49,280 --> 03:00:50,820 So let's think about it a little bit. 3564 03:00:50,820 --> 03:00:54,400 This first premise here, if it is raining, then Harry is inside, 3565 03:00:54,400 --> 03:00:59,200 is saying that if I know that it is raining, then Harry must be inside. 3566 03:00:59,200 --> 03:01:01,840 So what is the other possible case? 3567 03:01:01,840 --> 03:01:06,760 Well, if Harry is not inside, then I know that it must not be raining. 3568 03:01:06,760 --> 03:01:09,640 So one of those two situations must be true. 3569 03:01:09,640 --> 03:01:14,800 Either it's not raining, or it is raining, in which case Harry is inside. 3570 03:01:14,800 --> 03:01:18,280 So the conclusion I can draw is either it is not raining, 3571 03:01:18,280 --> 03:01:22,840 or it is raining, so therefore, Harry is inside. 3572 03:01:22,840 --> 03:01:28,000 And so this is a way to translate if-then statements into or statements. 3573 03:01:28,000 --> 03:01:31,000 And this is known as implication elimination. 3574 03:01:31,000 --> 03:01:33,360 And this is similar to what we actually did in the beginning 3575 03:01:33,360 --> 03:01:35,840 when we were first looking at those very first sentences 3576 03:01:35,840 --> 03:01:37,960 about Harry and Hagrid and Dumbledore. 3577 03:01:37,960 --> 03:01:39,800 And phrased a little bit more formally, this 3578 03:01:39,800 --> 03:01:43,560 says that if I have the implication, alpha implies beta, 3579 03:01:43,560 --> 03:01:49,120 that I can draw the conclusion that either not alpha or beta, 3580 03:01:49,120 --> 03:01:50,760 because there are only two possibilities. 3581 03:01:50,760 --> 03:01:54,040 Either alpha is true or alpha is not true. 3582 03:01:54,040 --> 03:01:57,320 So one of those possibilities is alpha is not true. 3583 03:01:57,320 --> 03:02:00,320 But if alpha is true, well, then we can draw the conclusion 3584 03:02:00,320 --> 03:02:01,560 that beta must be true. 3585 03:02:01,560 --> 03:02:07,920 So either alpha is not true or alpha is true, in which case beta is also true. 3586 03:02:07,920 --> 03:02:12,440 So this is one way to turn an implication into just a statement about or. 3587 03:02:12,440 --> 03:02:14,560 In addition to eliminating implications, 3588 03:02:14,560 --> 03:02:17,640 we can also eliminate biconditionals as well. 3589 03:02:17,640 --> 03:02:19,960 So let's take an English example, something like, 3590 03:02:19,960 --> 03:02:23,600 it is raining if and only if Harry is inside. 3591 03:02:23,600 --> 03:02:26,960 And this if and only if really sounds like that biconditional, 3592 03:02:26,960 --> 03:02:31,200 that double arrow sign that we saw in propositional logic not too long ago. 3593 03:02:31,200 --> 03:02:33,800 And what does this actually mean if we were to translate this? 3594 03:02:33,800 --> 03:02:37,760 Well, this means that if it is raining, then Harry is inside. 3595 03:02:37,760 --> 03:02:40,200 And if Harry is inside, then it is raining, 3596 03:02:40,200 --> 03:02:43,040 that this implication goes both ways. 3597 03:02:43,040 --> 03:02:45,960 And this is what we would call biconditional elimination, 3598 03:02:45,960 --> 03:02:50,040 that I can take a biconditional, a if and only if b, 3599 03:02:50,040 --> 03:02:56,360 and translate that into something like this, a implies b, and b implies a. 3600 03:02:56,360 --> 03:03:00,400 So many of these inference rules are taking logic that uses certain symbols 3601 03:03:00,400 --> 03:03:03,960 and turning them into different symbols, taking an implication 3602 03:03:03,960 --> 03:03:06,680 and turning it into an or, or taking a biconditional 3603 03:03:06,680 --> 03:03:08,640 and turning it into implication. 3604 03:03:08,640 --> 03:03:11,640 And another example of it would be something like this. 3605 03:03:11,640 --> 03:03:16,200 It is not true that both Harry and Ron passed the test. 3606 03:03:16,200 --> 03:03:17,880 Well, all right, how do we translate that? 3607 03:03:17,880 --> 03:03:18,920 What does that mean? 3608 03:03:18,920 --> 03:03:22,920 Well, if it is not true that both of them passed the test, well, 3609 03:03:22,920 --> 03:03:25,080 then the reasonable conclusion we might draw 3610 03:03:25,080 --> 03:03:28,040 is that at least one of them didn't pass the test. 3611 03:03:28,040 --> 03:03:31,280 So the conclusion is either Harry did not pass the test 3612 03:03:31,280 --> 03:03:33,640 or Ron did not pass the test, or both. 3613 03:03:33,640 --> 03:03:35,240 This is not an exclusive or. 3614 03:03:35,240 --> 03:03:40,480 But if it is true that it is not true that both Harry and Ron passed the test, 3615 03:03:40,480 --> 03:03:45,240 well, then either Harry didn't pass the test or Ron didn't pass the test. 3616 03:03:45,240 --> 03:03:48,000 And this type of law is one of De Morgan's laws. 3617 03:03:48,000 --> 03:03:52,160 Quite famous in logic where the idea is that we can turn an and into an or. 3618 03:03:52,160 --> 03:03:56,360 We can say we can take this and that both Harry and Ron passed the test 3619 03:03:56,360 --> 03:03:59,920 and turn it into an or by moving the nots around. 3620 03:03:59,920 --> 03:04:03,360 So if it is not true that Harry and Ron passed the test, 3621 03:04:03,360 --> 03:04:05,800 well, then either Harry did not pass the test 3622 03:04:05,800 --> 03:04:08,880 or Ron did not pass the test either. 3623 03:04:08,880 --> 03:04:12,280 And the way we frame that more formally using logic is to say this. 3624 03:04:12,280 --> 03:04:20,400 If it is not true that alpha and beta, well, then either not alpha or not beta. 3625 03:04:20,400 --> 03:04:22,320 The way I like to think about this is that if you 3626 03:04:22,320 --> 03:04:25,240 have a negation in front of an and expression, 3627 03:04:25,240 --> 03:04:27,880 you move the negation inwards, so to speak, 3628 03:04:27,880 --> 03:04:31,920 moving the negation into each of these individual sentences 3629 03:04:31,920 --> 03:04:34,720 and then flip the and into an or. 3630 03:04:34,720 --> 03:04:37,800 So the negation moves inwards and the and flips into an or. 3631 03:04:37,800 --> 03:04:43,600 So I go from not a and b to not a or not b. 3632 03:04:43,600 --> 03:04:45,640 And there's actually a reverse of De Morgan's law 3633 03:04:45,640 --> 03:04:48,320 that goes in the other direction for something like this. 3634 03:04:48,320 --> 03:04:52,240 If I say it is not true that Harry or Ron passed the test, 3635 03:04:52,240 --> 03:04:56,160 meaning neither of them passed the test, well, then the conclusion I can draw 3636 03:04:56,160 --> 03:05:01,040 is that Harry did not pass the test and Ron did not pass the test. 3637 03:05:01,040 --> 03:05:04,160 So in this case, instead of turning an and into an or, 3638 03:05:04,160 --> 03:05:06,560 we're turning an or into an and. 3639 03:05:06,560 --> 03:05:07,760 But the idea is the same. 3640 03:05:07,760 --> 03:05:10,880 And this, again, is another example of De Morgan's laws. 3641 03:05:10,880 --> 03:05:15,720 And the way that works is that if I have not a or b this time, 3642 03:05:15,720 --> 03:05:17,080 the same logic is going to apply. 3643 03:05:17,080 --> 03:05:19,240 I'm going to move the negation inwards. 3644 03:05:19,240 --> 03:05:22,640 And I'm going to flip this time, flip the or into an and. 3645 03:05:22,640 --> 03:05:28,520 So if not a or b, meaning it is not true that a or b or alpha or beta, 3646 03:05:28,520 --> 03:05:34,200 then I can say not alpha and not beta, moving the negation inwards 3647 03:05:34,200 --> 03:05:36,120 in order to make that conclusion. 3648 03:05:36,120 --> 03:05:38,840 So those are De Morgan's laws and a couple other inference rules 3649 03:05:38,840 --> 03:05:40,680 that are worth just taking a look at. 3650 03:05:40,680 --> 03:05:43,360 One is the distributive law that works this way. 3651 03:05:43,360 --> 03:05:49,600 So if I have alpha and beta or gamma, well, then much in the same way 3652 03:05:49,600 --> 03:05:52,640 that you can use in math, use distributive laws to distribute 3653 03:05:52,640 --> 03:05:55,440 operands like addition and multiplication, 3654 03:05:55,440 --> 03:06:01,120 I can do a similar thing here, where I can say if alpha and beta or gamma, 3655 03:06:01,120 --> 03:06:06,600 then I can say something like alpha and beta or alpha and gamma, 3656 03:06:06,600 --> 03:06:11,200 that I've been able to distribute this and sign throughout this expression. 3657 03:06:11,200 --> 03:06:13,200 So this is an example of the distributive property 3658 03:06:13,200 --> 03:06:16,960 or the distributive law as applied to logic in much the same way 3659 03:06:16,960 --> 03:06:19,800 that you would distribute a multiplication over the addition 3660 03:06:19,800 --> 03:06:22,160 of something, for example. 3661 03:06:22,160 --> 03:06:23,760 This works the other way too. 3662 03:06:23,760 --> 03:06:27,960 So if, for example, I have alpha or beta and gamma, 3663 03:06:27,960 --> 03:06:30,280 I can distribute the or throughout the expression. 3664 03:06:30,280 --> 03:06:34,440 I can say alpha or beta and alpha or gamma. 3665 03:06:34,440 --> 03:06:36,520 So the distributive law works in that way too. 3666 03:06:36,520 --> 03:06:40,320 And it's helpful if I want to take an or and move it into the expression. 3667 03:06:40,320 --> 03:06:43,160 And we'll see an example soon of why it is that we might actually 3668 03:06:43,160 --> 03:06:46,400 care to do something like that. 3669 03:06:46,400 --> 03:06:49,640 All right, so now we've seen a lot of different inference rules. 3670 03:06:49,640 --> 03:06:53,640 And the question now is, how can we use those inference rules to actually try 3671 03:06:53,640 --> 03:06:57,120 and draw some conclusions, to actually try and prove something about entailment, 3672 03:06:57,120 --> 03:06:59,400 proving that given some initial knowledge base, 3673 03:06:59,400 --> 03:07:04,480 we would like to find some way to prove that a query is true? 3674 03:07:04,480 --> 03:07:06,520 Well, one way to think about it is actually 3675 03:07:06,520 --> 03:07:08,600 to think back to what we talked about last time 3676 03:07:08,600 --> 03:07:10,480 when we talked about search problems. 3677 03:07:10,480 --> 03:07:13,400 Recall again that search problems have some sort of initial state. 3678 03:07:13,400 --> 03:07:16,200 They have actions that you can take from one state to another 3679 03:07:16,200 --> 03:07:18,360 as defined by a transition model that tells you 3680 03:07:18,360 --> 03:07:20,240 how to get from one state to another. 3681 03:07:20,240 --> 03:07:22,800 We talked about testing to see if you were at a goal. 3682 03:07:22,800 --> 03:07:26,280 And then some path cost function to see how many steps 3683 03:07:26,280 --> 03:07:31,040 did you have to take or how costly was the solution that you found. 3684 03:07:31,040 --> 03:07:33,080 Now that we have these inference rules that 3685 03:07:33,080 --> 03:07:36,720 take some set of sentences in propositional logic 3686 03:07:36,720 --> 03:07:40,400 and get us some new set of sentences in propositional logic, 3687 03:07:40,400 --> 03:07:44,760 we can actually treat those sentences or those sets of sentences 3688 03:07:44,760 --> 03:07:47,320 as states inside of a search problem. 3689 03:07:47,320 --> 03:07:49,760 So if we want to prove that some query is true, 3690 03:07:49,760 --> 03:07:52,160 prove that some logical theorem is true, 3691 03:07:52,160 --> 03:07:55,860 we can treat theorem proving as a form of a search problem. 3692 03:07:55,860 --> 03:07:59,240 I can say that we begin in some initial state, where 3693 03:07:59,240 --> 03:08:02,040 that initial state is the knowledge base that I begin with, 3694 03:08:02,040 --> 03:08:05,600 the set of all of the sentences that I know to be true. 3695 03:08:05,600 --> 03:08:07,280 What actions are available to me? 3696 03:08:07,280 --> 03:08:09,520 Well, the actions are any of the inference rules 3697 03:08:09,520 --> 03:08:12,080 that I can apply at any given time. 3698 03:08:12,080 --> 03:08:16,440 The transition model just tells me after I apply the inference rule, 3699 03:08:16,440 --> 03:08:18,360 here is the new set of all of the knowledge 3700 03:08:18,360 --> 03:08:20,560 that I have, which will be the old set of knowledge, 3701 03:08:20,560 --> 03:08:23,540 plus some additional inference that I've been able to draw, 3702 03:08:23,540 --> 03:08:26,600 much as in the same way we saw what we got when we applied those inference 3703 03:08:26,600 --> 03:08:28,720 rules and got some sort of conclusion. 3704 03:08:28,720 --> 03:08:31,600 That conclusion gets added to our knowledge base, 3705 03:08:31,600 --> 03:08:34,240 and our transition model will encode that. 3706 03:08:34,240 --> 03:08:35,440 What is the goal test? 3707 03:08:35,440 --> 03:08:38,160 Well, our goal test is checking to see if we 3708 03:08:38,160 --> 03:08:40,480 have proved the statement we're trying to prove, 3709 03:08:40,480 --> 03:08:44,880 if the thing we're trying to prove is inside of our knowledge base. 3710 03:08:44,880 --> 03:08:47,920 And the path cost function, the thing we're trying to minimize, 3711 03:08:47,920 --> 03:08:50,960 is maybe the number of inference rules that we needed to use, 3712 03:08:50,960 --> 03:08:54,840 the number of steps, so to speak, inside of our proof. 3713 03:08:54,840 --> 03:08:57,760 And so here we've been able to apply the same types of ideas 3714 03:08:57,760 --> 03:08:59,840 that we saw last time with search problems 3715 03:08:59,840 --> 03:09:02,560 to something like trying to prove something about knowledge 3716 03:09:02,560 --> 03:09:05,640 by taking our knowledge and framing it in terms 3717 03:09:05,640 --> 03:09:08,560 that we can understand as a search problem with an initial state, 3718 03:09:08,560 --> 03:09:10,920 with actions, with a transition model. 3719 03:09:10,920 --> 03:09:14,680 So this shows a couple of things, one being how versatile search problems 3720 03:09:14,680 --> 03:09:16,960 are, that they can be the same types of algorithms 3721 03:09:16,960 --> 03:09:19,280 that we use to solve a maze or figure out 3722 03:09:19,280 --> 03:09:22,360 how to get from point A to point B inside of driving directions, 3723 03:09:22,360 --> 03:09:25,480 for example, can also be used as a theorem proving 3724 03:09:25,480 --> 03:09:28,320 method of taking some sort of starting knowledge base 3725 03:09:28,320 --> 03:09:31,920 and trying to prove something about that knowledge. 3726 03:09:31,920 --> 03:09:35,120 So this, yet again, is a second way, in addition to model checking, 3727 03:09:35,120 --> 03:09:38,720 to try and prove that certain statements are true. 3728 03:09:38,720 --> 03:09:42,160 But it turns out there's yet another way that we can try and apply inference. 3729 03:09:42,160 --> 03:09:45,120 And we'll talk about this now, which is not the only way, but certainly one 3730 03:09:45,120 --> 03:09:48,560 of the most common, which is known as resolution. 3731 03:09:48,560 --> 03:09:51,880 And resolution is based on another inference rule 3732 03:09:51,880 --> 03:09:54,700 that we'll take a look at now, quite a powerful inference rule that 3733 03:09:54,700 --> 03:09:58,800 will let us prove anything that can be proven about a knowledge base. 3734 03:09:58,800 --> 03:10:01,440 And it's based on this basic idea. 3735 03:10:01,440 --> 03:10:05,360 Let's say I know that either Ron is in the Great Hall 3736 03:10:05,360 --> 03:10:08,040 or Hermione is in the library. 3737 03:10:08,040 --> 03:10:12,480 And let's say I also know that Ron is not in the Great Hall. 3738 03:10:12,480 --> 03:10:16,160 Based on those two pieces of information, what can I conclude? 3739 03:10:16,160 --> 03:10:18,640 Well, I could pretty reasonably conclude that Hermione 3740 03:10:18,640 --> 03:10:20,160 must be in the library. 3741 03:10:20,160 --> 03:10:21,160 How do I know that? 3742 03:10:21,160 --> 03:10:24,440 Well, it's because these two statements, these two 3743 03:10:24,440 --> 03:10:28,640 what we'll call complementary literals, literals that complement each other, 3744 03:10:28,640 --> 03:10:32,600 they're opposites of each other, seem to conflict with each other. 3745 03:10:32,600 --> 03:10:35,480 This sentence tells us that either Ron is in the Great Hall 3746 03:10:35,480 --> 03:10:37,680 or Hermione is in the library. 3747 03:10:37,680 --> 03:10:40,120 So if we know that Ron is not in the Great Hall, 3748 03:10:40,120 --> 03:10:45,720 that conflicts with this one, which means Hermione must be in the library. 3749 03:10:45,720 --> 03:10:48,640 And this we can frame as a more general rule 3750 03:10:48,640 --> 03:10:54,320 known as the unit resolution rule, a rule that says that if we have p or q 3751 03:10:54,320 --> 03:11:00,400 and we also know not p, well then from that we can reasonably conclude q. 3752 03:11:00,400 --> 03:11:03,880 That if p or q are true and we know that p is not true, 3753 03:11:03,880 --> 03:11:07,880 the only possibility is for q to then be true. 3754 03:11:07,880 --> 03:11:10,360 And this, it turns out, is quite a powerful inference rule 3755 03:11:10,360 --> 03:11:13,160 in terms of what it can do, in part because we can quickly 3756 03:11:13,160 --> 03:11:14,960 start to generalize this rule. 3757 03:11:14,960 --> 03:11:19,040 This q right here doesn't need to just be a single propositional symbol. 3758 03:11:19,040 --> 03:11:22,400 It could be multiple, all chained together in a single clause, 3759 03:11:22,400 --> 03:11:23,400 as we'll call it. 3760 03:11:23,400 --> 03:11:29,640 So if I had something like p or q1 or q2 or q3, so on and so forth, up until qn, 3761 03:11:29,640 --> 03:11:34,320 so I had n different other variables, and I have not p, 3762 03:11:34,320 --> 03:11:37,400 well then what happens when these two complement each other 3763 03:11:37,400 --> 03:11:40,720 is that these two clauses resolve, so to speak, 3764 03:11:40,720 --> 03:11:46,280 to produce a new clause that is just q1 or q2 all the way up to qn. 3765 03:11:46,280 --> 03:11:49,600 And in an or, the order of the arguments in the or doesn't actually matter. 3766 03:11:49,600 --> 03:11:50,960 The p doesn't need to be the first thing. 3767 03:11:50,960 --> 03:11:52,240 It could have been in the middle. 3768 03:11:52,240 --> 03:11:56,160 But the idea here is that if I have p in one clause and not 3769 03:11:56,160 --> 03:11:59,920 p in the other clause, well then I know that one of these remaining things 3770 03:11:59,920 --> 03:12:00,800 must be true. 3771 03:12:00,800 --> 03:12:04,640 I've resolved them in order to produce a new clause. 3772 03:12:04,640 --> 03:12:08,520 But it turns out we can generalize this idea even further, in fact, 3773 03:12:08,520 --> 03:12:12,640 and display even more power that we can have with this resolution rule. 3774 03:12:12,640 --> 03:12:14,520 So let's take another example. 3775 03:12:14,520 --> 03:12:17,240 Let's say, for instance, that I know the same piece of information 3776 03:12:17,240 --> 03:12:21,400 that either Ron is in the Great Hall or Hermione is in the library. 3777 03:12:21,400 --> 03:12:23,680 And the second piece of information I know 3778 03:12:23,680 --> 03:12:29,360 is that Ron is not in the Great Hall or Harry is sleeping. 3779 03:12:29,360 --> 03:12:31,520 So it's not just a single piece of information. 3780 03:12:31,520 --> 03:12:33,800 I have two different clauses. 3781 03:12:33,800 --> 03:12:37,360 And we'll define clauses more precisely in just a moment. 3782 03:12:37,360 --> 03:12:38,600 What do I know here? 3783 03:12:38,600 --> 03:12:42,360 Well again, for any propositional symbol like Ron is in the Great Hall, 3784 03:12:42,360 --> 03:12:44,320 there are only two possibilities. 3785 03:12:44,320 --> 03:12:48,520 Either Ron is in the Great Hall, in which case, based on resolution, 3786 03:12:48,520 --> 03:12:53,840 we know that Harry must be sleeping, or Ron is not in the Great Hall, 3787 03:12:53,840 --> 03:12:56,160 in which case we know based on the same rule 3788 03:12:56,160 --> 03:12:59,320 that Hermione must be in the library. 3789 03:12:59,320 --> 03:13:01,320 Based on those two things in combination, 3790 03:13:01,320 --> 03:13:03,920 I can say based on these two premises that I 3791 03:13:03,920 --> 03:13:10,400 can conclude that either Hermione is in the library or Harry is sleeping. 3792 03:13:10,400 --> 03:13:13,200 So again, because these two conflict with each other, 3793 03:13:13,200 --> 03:13:15,600 I know that one of these two must be true. 3794 03:13:15,600 --> 03:13:18,560 And you can take a closer look and try and reason through that logic. 3795 03:13:18,560 --> 03:13:22,400 Make sure you convince yourself that you believe this conclusion. 3796 03:13:22,400 --> 03:13:25,320 Stated more generally, we can name this resolution rule 3797 03:13:25,320 --> 03:13:28,680 by saying that if we know p or q is true, 3798 03:13:28,680 --> 03:13:33,040 and we also know that not p or r is true, 3799 03:13:33,040 --> 03:13:37,760 we resolve these two clauses together to get a new clause, q or r, 3800 03:13:37,760 --> 03:13:41,320 that either q or r must be true. 3801 03:13:41,320 --> 03:13:43,920 And again, much as in the last case, q and r 3802 03:13:43,920 --> 03:13:46,720 don't need to just be single propositional symbols. 3803 03:13:46,720 --> 03:13:48,160 It could be multiple symbols. 3804 03:13:48,160 --> 03:13:52,720 So if I had a rule that had p or q1 or q2 or q3, so on and so forth, 3805 03:13:52,720 --> 03:13:55,680 up until qn, where n is just some number. 3806 03:13:55,680 --> 03:14:02,440 And likewise, I had not p or r1 or r2, so on and so forth, up until rm, 3807 03:14:02,440 --> 03:14:05,340 where m, again, is just some other number. 3808 03:14:05,340 --> 03:14:09,680 I can resolve these two clauses together to get one of these must be true, 3809 03:14:09,680 --> 03:14:14,680 q1 or q2 up until qn or r1 or r2 up until rm. 3810 03:14:14,680 --> 03:14:19,520 And this is just a generalization of that same rule we saw before. 3811 03:14:19,520 --> 03:14:23,160 Each of these things here are what we're going to call a clause, 3812 03:14:23,160 --> 03:14:27,760 where a clause is formally defined as a disjunction of literals, 3813 03:14:27,760 --> 03:14:31,720 where a disjunction means it's a bunch of things that are connected with or. 3814 03:14:31,720 --> 03:14:34,120 Disjunction means things connected with or. 3815 03:14:34,120 --> 03:14:37,400 Conjunction, meanwhile, is things connected with and. 3816 03:14:37,400 --> 03:14:40,360 And a literal is either a propositional symbol 3817 03:14:40,360 --> 03:14:42,320 or the opposite of a propositional symbol. 3818 03:14:42,320 --> 03:14:46,160 So it's something like p or q or not p or not q. 3819 03:14:46,160 --> 03:14:50,360 Those are all propositional symbols or not of the propositional symbols. 3820 03:14:50,360 --> 03:14:52,920 And we call those literals. 3821 03:14:52,920 --> 03:14:57,920 And so a clause is just something like this, p or q or r, for example. 3822 03:14:57,920 --> 03:15:00,440 Meanwhile, what this gives us an ability to do 3823 03:15:00,440 --> 03:15:04,520 is it gives us an ability to turn logic, any logical sentence, 3824 03:15:04,520 --> 03:15:07,960 into something called conjunctive normal form. 3825 03:15:07,960 --> 03:15:11,480 A conjunctive normal form sentence is a logical sentence 3826 03:15:11,480 --> 03:15:14,240 that is a conjunction of clauses. 3827 03:15:14,240 --> 03:15:18,760 Recall, again, conjunction means things are connected to one another using and. 3828 03:15:18,760 --> 03:15:23,840 And so a conjunction of clauses means it is an and of individual clauses, 3829 03:15:23,840 --> 03:15:25,440 each of which has ors in it. 3830 03:15:25,440 --> 03:15:32,240 So something like this, a or b or c, and d or not e, and f or g. 3831 03:15:32,240 --> 03:15:35,440 Everything in parentheses is one clause. 3832 03:15:35,440 --> 03:15:38,960 All of the clauses are connected to each other using an and. 3833 03:15:38,960 --> 03:15:43,080 And everything in the clause is separated using an or. 3834 03:15:43,080 --> 03:15:46,680 And this is just a standard form that we can translate a logical sentence 3835 03:15:46,680 --> 03:15:50,440 into that just makes it easy to work with and easy to manipulate. 3836 03:15:50,440 --> 03:15:53,360 And it turns out that we can take any sentence in logic 3837 03:15:53,360 --> 03:15:56,400 and turn it into conjunctive normal form just 3838 03:15:56,400 --> 03:15:59,960 by applying some inference rules and transformations to it. 3839 03:15:59,960 --> 03:16:03,080 So we'll take a look at how we can actually do that. 3840 03:16:03,080 --> 03:16:06,000 So what is the process for taking a logical formula 3841 03:16:06,000 --> 03:16:10,480 and converting it into conjunctive normal form, otherwise known as c and f? 3842 03:16:10,480 --> 03:16:12,520 Well, the process looks a little something like this. 3843 03:16:12,520 --> 03:16:14,840 We need to take all of the symbols that are not 3844 03:16:14,840 --> 03:16:16,200 part of conjunctive normal form. 3845 03:16:16,200 --> 03:16:18,920 The bi-conditionals and the implications and so forth, 3846 03:16:18,920 --> 03:16:23,320 and turn them into something that is more closely like conjunctive normal 3847 03:16:23,320 --> 03:16:24,160 form. 3848 03:16:24,160 --> 03:16:26,760 So the first step will be to eliminate bi-conditionals, 3849 03:16:26,760 --> 03:16:29,160 those if and only if double arrows. 3850 03:16:29,160 --> 03:16:31,120 And we know how to eliminate bi-conditionals 3851 03:16:31,120 --> 03:16:34,200 because we saw there was an inference rule to do just that. 3852 03:16:34,200 --> 03:16:38,400 Any time I have an expression like alpha if and only if beta, 3853 03:16:38,400 --> 03:16:43,400 I can turn that into alpha implies beta and beta implies alpha 3854 03:16:43,400 --> 03:16:46,480 based on that inference rule we saw before. 3855 03:16:46,480 --> 03:16:48,880 Likewise, in addition to eliminating bi-conditionals, 3856 03:16:48,880 --> 03:16:52,680 I can eliminate implications as well, the if then arrows. 3857 03:16:52,680 --> 03:16:56,120 And I can do that using the same inference rule we saw before too, 3858 03:16:56,120 --> 03:17:01,480 taking alpha implies beta and turning that into not alpha or beta 3859 03:17:01,480 --> 03:17:06,440 because that is logically equivalent to this first thing here. 3860 03:17:06,440 --> 03:17:08,760 Then we can move knots inwards because we don't 3861 03:17:08,760 --> 03:17:10,800 want knots on the outsides of our expressions. 3862 03:17:10,800 --> 03:17:14,280 Conjunctive normal form requires that it's just claws and claws 3863 03:17:14,280 --> 03:17:15,800 and claws and claws. 3864 03:17:15,800 --> 03:17:19,560 Any knots need to be immediately next to propositional symbols. 3865 03:17:19,560 --> 03:17:22,520 But we can move those knots around using De Morgan's laws 3866 03:17:22,520 --> 03:17:29,000 by taking something like not A and B and turn it into not A or not B, 3867 03:17:29,000 --> 03:17:31,800 for example, using De Morgan's laws to manipulate that. 3868 03:17:31,800 --> 03:17:34,600 And after that, all we'll be left with are ands and ors. 3869 03:17:34,600 --> 03:17:35,920 And those are easy to deal with. 3870 03:17:35,920 --> 03:17:39,160 We can use the distributive law to distribute the ors 3871 03:17:39,160 --> 03:17:42,760 so that the ors end up on the inside of the expression, so to speak, 3872 03:17:42,760 --> 03:17:45,320 and the ands end up on the outside. 3873 03:17:45,320 --> 03:17:47,900 So this is the general pattern for how we'll take a formula 3874 03:17:47,900 --> 03:17:50,160 and convert it into conjunctive normal form. 3875 03:17:50,160 --> 03:17:53,400 And let's now take a look at an example of how we would do this 3876 03:17:53,400 --> 03:17:57,520 and explore then why it is that we would want to do something like this. 3877 03:17:57,520 --> 03:17:58,600 Here's how we can do it. 3878 03:17:58,600 --> 03:18:00,600 Let's take this formula, for example. 3879 03:18:00,600 --> 03:18:06,160 P or Q implies R. And I'd like to convert this into conjunctive normal form, 3880 03:18:06,160 --> 03:18:10,800 where it's all ands of clauses, and every clause is a disjunctive clause. 3881 03:18:10,800 --> 03:18:12,400 It's ors together. 3882 03:18:12,400 --> 03:18:14,120 So what's the first thing I need to do? 3883 03:18:14,120 --> 03:18:15,840 Well, this is an implication. 3884 03:18:15,840 --> 03:18:18,160 So let me go ahead and remove that implication. 3885 03:18:18,160 --> 03:18:25,220 Using the implication inference rule, I can turn P or Q into P or Q implies R 3886 03:18:25,220 --> 03:18:29,880 into not P or Q or R. So that's the first step. 3887 03:18:29,880 --> 03:18:32,100 I've gotten rid of the implication. 3888 03:18:32,100 --> 03:18:36,080 And next, I can get rid of the not on the outside of this expression, too. 3889 03:18:36,080 --> 03:18:41,560 I can move the nots inwards so they're closer to the literals themselves 3890 03:18:41,560 --> 03:18:43,080 by using De Morgan's laws. 3891 03:18:43,080 --> 03:18:50,480 And De Morgan's law says that not P or Q is equivalent to not P and not Q. 3892 03:18:50,480 --> 03:18:52,920 Again, here, just applying the inference rules 3893 03:18:52,920 --> 03:18:57,120 that we've already seen in order to translate these statements. 3894 03:18:57,120 --> 03:19:00,920 And now, I have two things that are separated by an or, 3895 03:19:00,920 --> 03:19:03,080 where this thing on the inside is an and. 3896 03:19:03,080 --> 03:19:06,560 What I'd really like to move the ors so the ors are on the inside, 3897 03:19:06,560 --> 03:19:10,040 because conjunctive normal form means I need clause and clause 3898 03:19:10,040 --> 03:19:11,680 and clause and clause. 3899 03:19:11,680 --> 03:19:14,260 And so to do that, I can use the distributive law. 3900 03:19:14,260 --> 03:19:21,080 If I have not P and not Q or R, I can distribute the or R to both of these 3901 03:19:21,080 --> 03:19:26,800 to get not P or R and not Q or R using the distributive law. 3902 03:19:26,800 --> 03:19:30,520 And this now here at the bottom is in conjunctive normal form. 3903 03:19:30,520 --> 03:19:35,840 It is a conjunction and and of disjunctions of clauses 3904 03:19:35,840 --> 03:19:38,200 that just are separated by ors. 3905 03:19:38,200 --> 03:19:42,120 So this process can be used by any formula to take a logical sentence 3906 03:19:42,120 --> 03:19:44,920 and turn it into this conjunctive normal form, where 3907 03:19:44,920 --> 03:19:49,800 I have clause and clause and clause and clause and clause and so on. 3908 03:19:49,800 --> 03:19:50,800 So why is this helpful? 3909 03:19:50,800 --> 03:19:52,960 Why do we even care about taking all these sentences 3910 03:19:52,960 --> 03:19:54,640 and converting them into this form? 3911 03:19:54,640 --> 03:19:58,560 It's because once they're in this form where we have these clauses, 3912 03:19:58,560 --> 03:20:02,360 these clauses are the inputs to the resolution inference rule 3913 03:20:02,360 --> 03:20:05,640 that we saw a moment ago, that if I have two clauses where there's 3914 03:20:05,640 --> 03:20:08,040 something that conflicts or something complementary 3915 03:20:08,040 --> 03:20:10,680 between those two clauses, I can resolve them 3916 03:20:10,680 --> 03:20:13,160 to get a new clause, to draw a new conclusion. 3917 03:20:13,160 --> 03:20:16,220 And we call this process inference by resolution, 3918 03:20:16,220 --> 03:20:19,640 using the resolution rule to draw some sort of inference. 3919 03:20:19,640 --> 03:20:23,720 And it's based on the same idea, that if I have P or Q, this clause, 3920 03:20:23,720 --> 03:20:28,380 and I have not P or R, that I can resolve these two clauses together 3921 03:20:28,380 --> 03:20:32,960 to get Q or R as the resulting clause, a new piece of information 3922 03:20:32,960 --> 03:20:35,000 that I didn't have before. 3923 03:20:35,000 --> 03:20:37,500 Now, a couple of key points that are worth noting about this 3924 03:20:37,500 --> 03:20:39,720 before we talk about the actual algorithm. 3925 03:20:39,720 --> 03:20:43,560 One thing is that, let's imagine we have P or Q or S, 3926 03:20:43,560 --> 03:20:48,200 and I also have not P or R or S. The resolution rule 3927 03:20:48,200 --> 03:20:51,680 says that because this P conflicts with this not P, 3928 03:20:51,680 --> 03:20:57,000 we would resolve to put everything else together to get Q or S or R or S. 3929 03:20:57,000 --> 03:21:01,480 But it turns out that this double S is redundant, or S here and or S there. 3930 03:21:01,480 --> 03:21:03,680 It doesn't change the meaning of the sentence. 3931 03:21:03,680 --> 03:21:06,240 So in resolution, when we do this resolution process, 3932 03:21:06,240 --> 03:21:08,880 we'll usually also do a process known as factoring, 3933 03:21:08,880 --> 03:21:11,360 where we take any duplicate variables that show up 3934 03:21:11,360 --> 03:21:12,480 and just eliminate them. 3935 03:21:12,480 --> 03:21:18,880 So Q or S or R or S just becomes Q or R or S. The S only needs to appear once, 3936 03:21:18,880 --> 03:21:22,000 no need to include it multiple times. 3937 03:21:22,000 --> 03:21:24,120 Now, one final question worth considering 3938 03:21:24,120 --> 03:21:28,960 is what happens if I try to resolve P and not P together? 3939 03:21:28,960 --> 03:21:32,440 If I know that P is true and I know that not P is true, 3940 03:21:32,440 --> 03:21:35,240 well, resolution says I can merge these clauses together 3941 03:21:35,240 --> 03:21:37,160 and look at everything else. 3942 03:21:37,160 --> 03:21:39,320 Well, in this case, there is nothing else, 3943 03:21:39,320 --> 03:21:42,280 so I'm left with what we might call the empty clause. 3944 03:21:42,280 --> 03:21:43,840 I'm left with nothing. 3945 03:21:43,840 --> 03:21:46,920 And the empty clause is always false. 3946 03:21:46,920 --> 03:21:49,920 The empty clause is equivalent to just being false. 3947 03:21:49,920 --> 03:21:55,720 And that's pretty reasonable because it's impossible for both P and not P 3948 03:21:55,720 --> 03:21:57,400 to both hold at the same time. 3949 03:21:57,400 --> 03:21:59,800 P is either true or it's not true, which 3950 03:21:59,800 --> 03:22:02,960 means that if P is true, then this must be false. 3951 03:22:02,960 --> 03:22:05,000 And if this is true, then this must be false. 3952 03:22:05,000 --> 03:22:07,880 There is no way for both of these to hold at the same time. 3953 03:22:07,880 --> 03:22:11,320 So if ever I try and resolve these two, it's a contradiction, 3954 03:22:11,320 --> 03:22:14,600 and I'll end up getting this empty clause where the empty clause I 3955 03:22:14,600 --> 03:22:17,440 can call equivalent to false. 3956 03:22:17,440 --> 03:22:21,400 And this idea that if I resolve these two contradictory terms, 3957 03:22:21,400 --> 03:22:25,280 I get the empty clause, this is the basis for our inference 3958 03:22:25,280 --> 03:22:26,880 by resolution algorithm. 3959 03:22:26,880 --> 03:22:29,480 Here's how we're going to perform inference by resolution 3960 03:22:29,480 --> 03:22:31,040 at a very high level. 3961 03:22:31,040 --> 03:22:35,760 We want to prove that our knowledge base entails some query alpha, 3962 03:22:35,760 --> 03:22:39,040 that based on the knowledge we have, we can prove conclusively 3963 03:22:39,040 --> 03:22:41,600 that alpha is going to be true. 3964 03:22:41,600 --> 03:22:43,200 How are we going to do that? 3965 03:22:43,200 --> 03:22:45,160 Well, in order to do that, we're going to try 3966 03:22:45,160 --> 03:22:49,440 to prove that if we know the knowledge and not alpha, 3967 03:22:49,440 --> 03:22:51,560 that that would be a contradiction. 3968 03:22:51,560 --> 03:22:53,560 And this is a common technique in computer science 3969 03:22:53,560 --> 03:22:57,440 more generally, this idea of proving something by contradiction. 3970 03:22:57,440 --> 03:23:00,200 If I want to prove that something is true, 3971 03:23:00,200 --> 03:23:04,000 I can do so by first assuming that it is false 3972 03:23:04,000 --> 03:23:06,160 and showing that it would be contradictory, 3973 03:23:06,160 --> 03:23:08,360 showing that it leads to some contradiction. 3974 03:23:08,360 --> 03:23:11,800 And if the thing I'm trying to prove, if when I assume it's false, 3975 03:23:11,800 --> 03:23:14,760 leads to a contradiction, then it must be true. 3976 03:23:14,760 --> 03:23:18,560 And that's the logical approach or the idea behind a proof by contradiction. 3977 03:23:18,560 --> 03:23:20,160 And that's what we're going to do here. 3978 03:23:20,160 --> 03:23:23,400 We want to prove that this query alpha is true. 3979 03:23:23,400 --> 03:23:26,040 So we're going to assume that it's not true. 3980 03:23:26,040 --> 03:23:28,120 We're going to assume not alpha. 3981 03:23:28,120 --> 03:23:30,680 And we're going to try and prove that it's a contradiction. 3982 03:23:30,680 --> 03:23:32,960 If we do get a contradiction, well, then we 3983 03:23:32,960 --> 03:23:36,440 know that our knowledge entails the query alpha. 3984 03:23:36,440 --> 03:23:39,040 If we don't get a contradiction, there is no entailment. 3985 03:23:39,040 --> 03:23:41,400 This is this idea of a proof by contradiction 3986 03:23:41,400 --> 03:23:44,000 of assuming the opposite of what you're trying to prove. 3987 03:23:44,000 --> 03:23:46,520 And if you can demonstrate that that's a contradiction, 3988 03:23:46,520 --> 03:23:49,840 then what you're proving must be true. 3989 03:23:49,840 --> 03:23:51,760 But more formally, how do we actually do this? 3990 03:23:51,760 --> 03:23:56,160 How do we check that knowledge base and not alpha 3991 03:23:56,160 --> 03:23:58,000 is going to lead to a contradiction? 3992 03:23:58,000 --> 03:24:01,320 Well, here is where resolution comes into play. 3993 03:24:01,320 --> 03:24:05,160 To determine if our knowledge base entails some query alpha, 3994 03:24:05,160 --> 03:24:08,400 we're going to convert knowledge base and not alpha 3995 03:24:08,400 --> 03:24:10,520 to conjunctive normal form, that form where 3996 03:24:10,520 --> 03:24:14,400 we have a whole bunch of clauses that are all anded together. 3997 03:24:14,400 --> 03:24:16,680 And when we have these individual clauses, 3998 03:24:16,680 --> 03:24:21,600 now we can keep checking to see if we can use resolution 3999 03:24:21,600 --> 03:24:23,640 to produce a new clause. 4000 03:24:23,640 --> 03:24:26,720 We can take any pair of clauses and check, 4001 03:24:26,720 --> 03:24:29,920 is there some literal that is the opposite of each other 4002 03:24:29,920 --> 03:24:32,240 or complementary to each other in both of them? 4003 03:24:32,240 --> 03:24:35,880 For example, I have a p in one clause and a not p in another clause. 4004 03:24:35,880 --> 03:24:39,480 Or an r in one clause and a not r in another clause. 4005 03:24:39,480 --> 03:24:41,640 If ever I have that situation where once I 4006 03:24:41,640 --> 03:24:44,920 convert to conjunctive normal form and I have a whole bunch of clauses, 4007 03:24:44,920 --> 03:24:49,720 I see two clauses that I can resolve to produce a new clause, then I'll do so. 4008 03:24:49,720 --> 03:24:50,960 This process occurs in a loop. 4009 03:24:50,960 --> 03:24:53,960 I'm going to keep checking to see if I can use resolution 4010 03:24:53,960 --> 03:24:56,760 to produce a new clause and keep using those new clauses 4011 03:24:56,760 --> 03:25:00,520 to try to generate more new clauses after that. 4012 03:25:00,520 --> 03:25:03,000 Now, it just so may happen that eventually we 4013 03:25:03,000 --> 03:25:06,880 may produce the empty clause, the clause we were talking about before. 4014 03:25:06,880 --> 03:25:11,720 If I resolve p and not p together, that produces the empty clause 4015 03:25:11,720 --> 03:25:14,620 and the empty clause we know to be false. 4016 03:25:14,620 --> 03:25:18,280 Because we know that there's no way for both p and not p 4017 03:25:18,280 --> 03:25:21,200 to both simultaneously be true. 4018 03:25:21,200 --> 03:25:25,120 So if ever we produce the empty clause, then we have a contradiction. 4019 03:25:25,120 --> 03:25:27,720 And if we have a contradiction, that's exactly what we were trying 4020 03:25:27,720 --> 03:25:29,720 to do in a fruit by contradiction. 4021 03:25:29,720 --> 03:25:32,360 If we have a contradiction, then we know that our knowledge base 4022 03:25:32,360 --> 03:25:34,400 must entail this query alpha. 4023 03:25:34,400 --> 03:25:37,600 And we know that alpha must be true. 4024 03:25:37,600 --> 03:25:39,920 And it turns out, and we won't go into the proof here, 4025 03:25:39,920 --> 03:25:43,760 but you can show that otherwise, if you don't produce the empty clause, 4026 03:25:43,760 --> 03:25:45,400 then there is no entailment. 4027 03:25:45,400 --> 03:25:48,680 If we run into a situation where there are no more new clauses to add, 4028 03:25:48,680 --> 03:25:50,960 we've done all the resolution that we can do, 4029 03:25:50,960 --> 03:25:53,400 and yet we still haven't produced the empty clause, 4030 03:25:53,400 --> 03:25:56,480 then there is no entailment in this case. 4031 03:25:56,480 --> 03:25:58,720 And this now is the resolution algorithm. 4032 03:25:58,720 --> 03:26:01,240 And it's very abstract looking, especially this idea of like, 4033 03:26:01,240 --> 03:26:03,560 what does it even mean to have the empty clause? 4034 03:26:03,560 --> 03:26:05,440 So let's take a look at an example, actually 4035 03:26:05,440 --> 03:26:11,320 try and prove some entailment by using this inference by resolution process. 4036 03:26:11,320 --> 03:26:12,680 So here's our question. 4037 03:26:12,680 --> 03:26:14,200 We have this knowledge base. 4038 03:26:14,200 --> 03:26:21,200 Here is the knowledge that we know, A or B, and not B or C, and not C. 4039 03:26:21,200 --> 03:26:25,840 And we want to know if all of this entails A. 4040 03:26:25,840 --> 03:26:28,600 So this is our knowledge base here, this whole log thing. 4041 03:26:28,600 --> 03:26:33,160 And our query alpha is just this propositional symbol, A. 4042 03:26:33,160 --> 03:26:34,240 So what do we do? 4043 03:26:34,240 --> 03:26:36,480 Well, first, we want to prove by contradiction. 4044 03:26:36,480 --> 03:26:39,600 So we want to first assume that A is false, 4045 03:26:39,600 --> 03:26:42,200 and see if that leads to some sort of contradiction. 4046 03:26:42,200 --> 03:26:46,880 So here is what we're going to start with, A or B, and not B or C, and not C. 4047 03:26:46,880 --> 03:26:48,680 This is our knowledge base. 4048 03:26:48,680 --> 03:26:51,280 And we're going to assume not A. We're going 4049 03:26:51,280 --> 03:26:56,760 to assume that the thing we're trying to prove is, in fact, false. 4050 03:26:56,760 --> 03:26:59,520 And so this is now in conjunctive normal form, 4051 03:26:59,520 --> 03:27:01,400 and I have four different clauses. 4052 03:27:01,400 --> 03:27:08,880 I have A or B. I have not B or C. I have not C, and I have not A. 4053 03:27:08,880 --> 03:27:12,800 And now, I can begin to just pick two clauses that I can resolve, 4054 03:27:12,800 --> 03:27:15,880 and apply the resolution rule to them. 4055 03:27:15,880 --> 03:27:20,320 And so looking at these four clauses, I see, all right, these two clauses 4056 03:27:20,320 --> 03:27:21,440 are ones I can resolve. 4057 03:27:21,440 --> 03:27:25,160 I can resolve them because there are complementary literals 4058 03:27:25,160 --> 03:27:26,040 that show up in them. 4059 03:27:26,040 --> 03:27:28,600 There's a C here, and a not C here. 4060 03:27:28,600 --> 03:27:34,240 So just looking at these two clauses, if I know that not B or C is true, 4061 03:27:34,240 --> 03:27:36,960 and I know that C is not true, well, then I 4062 03:27:36,960 --> 03:27:41,280 can resolve these two clauses to say, all right, not B, that must be true. 4063 03:27:41,280 --> 03:27:45,040 I can generate this new clause as a new piece of information 4064 03:27:45,040 --> 03:27:47,800 that I now know to be true. 4065 03:27:47,800 --> 03:27:50,800 And all right, now I can repeat this process, do the process again. 4066 03:27:50,800 --> 03:27:54,160 Can I use resolution again to get some new conclusion? 4067 03:27:54,160 --> 03:27:55,160 Well, it turns out I can. 4068 03:27:55,160 --> 03:27:58,720 I can use that new clause I just generated, along with this one here. 4069 03:27:58,720 --> 03:28:00,600 There are complementary literals. 4070 03:28:00,600 --> 03:28:06,280 This B is complementary to, or conflicts with, this not B over here. 4071 03:28:06,280 --> 03:28:12,320 And so if I know that A or B is true, and I know that B is not true, 4072 03:28:12,320 --> 03:28:15,560 well, then the only remaining possibility is that A must be true. 4073 03:28:15,560 --> 03:28:19,640 So now we have A. That is a new clause that I've been able to generate. 4074 03:28:19,640 --> 03:28:21,240 And now, I can do this one more time. 4075 03:28:21,240 --> 03:28:23,360 I'm looking for two clauses that can be resolved, 4076 03:28:23,360 --> 03:28:25,480 and you might programmatically do this by just looping 4077 03:28:25,480 --> 03:28:28,320 over all possible pairs of clauses and checking 4078 03:28:28,320 --> 03:28:30,240 for complementary literals in each. 4079 03:28:30,240 --> 03:28:34,560 And here, I can say, all right, I found two clauses, not A and A, 4080 03:28:34,560 --> 03:28:36,360 that conflict with each other. 4081 03:28:36,360 --> 03:28:38,600 And when I resolve these two together, well, 4082 03:28:38,600 --> 03:28:42,040 this is the same as when we were resolving P and not P from before. 4083 03:28:42,040 --> 03:28:45,760 When I resolve these two clauses together, I get rid of the As, 4084 03:28:45,760 --> 03:28:48,240 and I'm left with the empty clause. 4085 03:28:48,240 --> 03:28:51,920 And the empty clause we know to be false, which means we have a contradiction, 4086 03:28:51,920 --> 03:28:56,320 which means we can safely say that this whole knowledge base does entail A. 4087 03:28:56,320 --> 03:29:02,080 That if this sentence is true, that we know that A for sure is also true. 4088 03:29:02,080 --> 03:29:04,720 So this now, using inference by resolution, 4089 03:29:04,720 --> 03:29:07,740 is an entirely different way to take some statement 4090 03:29:07,740 --> 03:29:10,240 and try and prove that it is, in fact, true. 4091 03:29:10,240 --> 03:29:12,560 Instead of enumerating all of the possible worlds 4092 03:29:12,560 --> 03:29:15,840 that we might be in in order to try to figure out in which cases 4093 03:29:15,840 --> 03:29:18,760 is the knowledge base true and in which cases are query true, 4094 03:29:18,760 --> 03:29:22,000 instead we use this resolution algorithm to say, 4095 03:29:22,000 --> 03:29:25,080 let's keep trying to figure out what conclusions we can draw 4096 03:29:25,080 --> 03:29:27,240 and see if we reach a contradiction. 4097 03:29:27,240 --> 03:29:28,920 And if we reach a contradiction, then that 4098 03:29:28,920 --> 03:29:31,840 tells us something about whether our knowledge actually 4099 03:29:31,840 --> 03:29:33,540 entails the query or not. 4100 03:29:33,540 --> 03:29:35,840 And it turns out there are many different algorithms that 4101 03:29:35,840 --> 03:29:37,520 can be used for inference. 4102 03:29:37,520 --> 03:29:39,840 What we've just looked at here are just a couple of them. 4103 03:29:39,840 --> 03:29:44,080 And in fact, all of this is just based on one particular type of logic. 4104 03:29:44,080 --> 03:29:47,900 It's based on propositional logic, where we have these individual symbols 4105 03:29:47,900 --> 03:29:52,640 and we connect them using and and or and not and implies and by conditionals. 4106 03:29:52,640 --> 03:29:56,760 But propositional logic is not the only kind of logic that exists. 4107 03:29:56,760 --> 03:29:58,880 And in fact, we see that there are limitations 4108 03:29:58,880 --> 03:30:01,680 that exist in propositional logic, especially 4109 03:30:01,680 --> 03:30:06,000 as we saw in examples like with the mastermind example 4110 03:30:06,000 --> 03:30:08,560 or with the example with the logic puzzle where 4111 03:30:08,560 --> 03:30:12,260 we had different Hogwarts house people that belong to different houses 4112 03:30:12,260 --> 03:30:15,080 and we were trying to figure out who belonged to which houses. 4113 03:30:15,080 --> 03:30:18,280 There were a lot of different propositional symbols that we needed 4114 03:30:18,280 --> 03:30:21,680 in order to represent some fairly basic ideas. 4115 03:30:21,680 --> 03:30:24,640 So now is the final topic that we'll take a look at just before we end class 4116 03:30:24,640 --> 03:30:28,560 today is one final type of logic different from propositional logic 4117 03:30:28,560 --> 03:30:32,080 known as first order logic, which is a little bit more powerful than 4118 03:30:32,080 --> 03:30:34,620 propositional logic and is going to make it easier for us 4119 03:30:34,620 --> 03:30:37,240 to express certain types of ideas. 4120 03:30:37,240 --> 03:30:39,800 In propositional logic, if we think back to that puzzle 4121 03:30:39,800 --> 03:30:43,680 with the people in the Hogwarts houses, we had a whole bunch of symbols. 4122 03:30:43,680 --> 03:30:46,200 And every symbol could only be true or false. 4123 03:30:46,200 --> 03:30:49,240 We had a symbol for Minerva Gryffindor, which was either true of Minerva 4124 03:30:49,240 --> 03:30:51,840 within Gryffindor and false otherwise, and likewise 4125 03:30:51,840 --> 03:30:55,120 for Minerva Hufflepuff and Minerva Ravenclaw and Minerva Slytherin 4126 03:30:55,120 --> 03:30:56,920 and so forth. 4127 03:30:56,920 --> 03:30:58,920 But this was starting to get quite redundant. 4128 03:30:58,920 --> 03:31:01,120 We wanted some way to be able to express that there 4129 03:31:01,120 --> 03:31:03,360 is a relationship between these propositional symbols, 4130 03:31:03,360 --> 03:31:05,720 that Minerva shows up in all of them. 4131 03:31:05,720 --> 03:31:09,360 And also, I would have liked to have not have had so many different symbols 4132 03:31:09,360 --> 03:31:13,360 to represent what really was a fairly straightforward problem. 4133 03:31:13,360 --> 03:31:15,480 So first order logic will give us a different way 4134 03:31:15,480 --> 03:31:19,520 of trying to deal with this idea by giving us two different types of symbols. 4135 03:31:19,520 --> 03:31:23,040 We're going to have constant symbols that are going to represent objects 4136 03:31:23,040 --> 03:31:24,880 like people or houses. 4137 03:31:24,880 --> 03:31:29,640 And then predicate symbols, which you can think of as relations or functions 4138 03:31:29,640 --> 03:31:33,240 that take an input and evaluate them to true or false, for example, 4139 03:31:33,240 --> 03:31:37,400 that tell us whether or not some property of some constant 4140 03:31:37,400 --> 03:31:41,120 or some pair of constants or multiple constants actually holds. 4141 03:31:41,120 --> 03:31:43,120 So we'll see an example of that in just a moment. 4142 03:31:43,120 --> 03:31:46,640 For now, in this same problem, our constant symbols 4143 03:31:46,640 --> 03:31:49,240 might be objects, things like people or houses. 4144 03:31:49,240 --> 03:31:53,440 So Minerva, Pomona, Horace, Gilderoy, those are all constant symbols, 4145 03:31:53,440 --> 03:31:58,040 as are my four houses, Gryffindor, Hufflepuff, Ravenclaw, and Slytherin. 4146 03:31:58,040 --> 03:32:00,360 Predicates, meanwhile, these predicate symbols 4147 03:32:00,360 --> 03:32:03,880 are going to be properties that might hold true or false 4148 03:32:03,880 --> 03:32:06,120 of these individual constants. 4149 03:32:06,120 --> 03:32:09,480 So person might hold true of Minerva, but it 4150 03:32:09,480 --> 03:32:12,320 would be false for Gryffindor because Gryffindor is not a person. 4151 03:32:12,320 --> 03:32:15,280 And house is going to hold true for Ravenclaw, 4152 03:32:15,280 --> 03:32:17,640 but it's not going to hold true for Horace, for example, 4153 03:32:17,640 --> 03:32:19,800 because Horace is a person. 4154 03:32:19,800 --> 03:32:23,320 And belongs to, meanwhile, is going to be some relation that 4155 03:32:23,320 --> 03:32:26,280 is going to relate people to their houses. 4156 03:32:26,280 --> 03:32:30,440 And it's going to only tell me when someone belongs to a house or does not. 4157 03:32:30,440 --> 03:32:35,080 So let's take a look at some examples of what a sentence in first order logic 4158 03:32:35,080 --> 03:32:36,480 might actually look like. 4159 03:32:36,480 --> 03:32:38,320 A sentence might look like something like this. 4160 03:32:38,320 --> 03:32:42,960 Person Minerva, with Minerva in parentheses, and person being a predicate 4161 03:32:42,960 --> 03:32:45,880 symbol, Minerva being a constant symbol. 4162 03:32:45,880 --> 03:32:48,600 This sentence in first order logic effectively 4163 03:32:48,600 --> 03:32:54,440 means Minerva is a person, or the person property applies to the Minerva object. 4164 03:32:54,440 --> 03:32:56,920 So if I want to say something like Minerva is a person, 4165 03:32:56,920 --> 03:33:00,800 here is how I express that idea using first order logic. 4166 03:33:00,800 --> 03:33:03,720 Meanwhile, I can say something like, house Gryffindor, 4167 03:33:03,720 --> 03:33:07,320 to likewise express the idea that Gryffindor is a house. 4168 03:33:07,320 --> 03:33:08,800 I can do that this way. 4169 03:33:08,800 --> 03:33:10,920 And all of the same logical connectives that we 4170 03:33:10,920 --> 03:33:13,920 saw in propositional logic, those are going to work here too. 4171 03:33:13,920 --> 03:33:16,760 And or implication by conditional not. 4172 03:33:16,760 --> 03:33:20,920 In fact, I can use not to say something like, not house Minerva. 4173 03:33:20,920 --> 03:33:24,240 And this sentence in first order logic means something like, 4174 03:33:24,240 --> 03:33:26,080 Minerva is not a house. 4175 03:33:26,080 --> 03:33:31,640 It is not true that the house property applies to Minerva. 4176 03:33:31,640 --> 03:33:34,080 Meanwhile, in addition to some of these predicate symbols 4177 03:33:34,080 --> 03:33:36,880 that just take a single argument, some of our predicate symbols 4178 03:33:36,880 --> 03:33:39,840 are going to express binary relations, relations 4179 03:33:39,840 --> 03:33:42,080 between two of its arguments. 4180 03:33:42,080 --> 03:33:46,600 So I could say something like, belongs to, and then two inputs, Minerva 4181 03:33:46,600 --> 03:33:51,920 and Gryffindor, to express the idea that Minerva belongs to Gryffindor. 4182 03:33:51,920 --> 03:33:54,920 And so now here's the key difference, or one of the key differences, 4183 03:33:54,920 --> 03:33:56,920 between this and propositional logic. 4184 03:33:56,920 --> 03:34:00,640 In propositional logic, I needed one symbol for Minerva Gryffindor, 4185 03:34:00,640 --> 03:34:02,960 and one symbol for Minerva Hufflepuff, and one 4186 03:34:02,960 --> 03:34:06,360 symbol for all the other people's Gryffindor and Hufflepuff variables. 4187 03:34:06,360 --> 03:34:10,560 In this case, I just need one symbol for each of my people, 4188 03:34:10,560 --> 03:34:13,200 and one symbol for each of my houses. 4189 03:34:13,200 --> 03:34:16,920 And then I can express as a predicate something like, belongs to, 4190 03:34:16,920 --> 03:34:21,520 and say, belongs to Minerva Gryffindor, to express the idea that Minerva 4191 03:34:21,520 --> 03:34:23,440 belongs to Gryffindor House. 4192 03:34:23,440 --> 03:34:27,180 So already we can see that first order logic is quite expressive in being 4193 03:34:27,180 --> 03:34:32,480 able to express these sorts of sentences using the existing constant symbols 4194 03:34:32,480 --> 03:34:36,240 and predicates that already exist, while minimizing the number of new symbols 4195 03:34:36,240 --> 03:34:37,120 that I need to create. 4196 03:34:37,120 --> 03:34:40,920 I can just use eight symbols for people for houses, 4197 03:34:40,920 --> 03:34:46,080 instead of 16 symbols for every possible combination of each. 4198 03:34:46,080 --> 03:34:49,000 But first order logic gives us a couple of additional features 4199 03:34:49,000 --> 03:34:52,000 that we can use to express even more complex ideas. 4200 03:34:52,000 --> 03:34:56,160 And these more additional features are generally known as quantifiers. 4201 03:34:56,160 --> 03:34:58,800 And there are two main quantifiers in first order logic, 4202 03:34:58,800 --> 03:35:01,640 the first of which is universal quantification. 4203 03:35:01,640 --> 03:35:04,800 Universal quantification lets me express an idea 4204 03:35:04,800 --> 03:35:09,040 like something is going to be true for all values of a variable. 4205 03:35:09,040 --> 03:35:13,560 Like for all values of x, some statement is going to hold true. 4206 03:35:13,560 --> 03:35:16,600 So what might a sentence in universal quantification look like? 4207 03:35:16,600 --> 03:35:21,080 Well, we're going to use this upside down a to mean for all. 4208 03:35:21,080 --> 03:35:26,680 So upside down ax means for all values of x, where x is any object, 4209 03:35:26,680 --> 03:35:28,840 this is going to hold true. 4210 03:35:28,840 --> 03:35:36,800 Belongs to x Gryffindor implies not belongs to x Hufflepuff. 4211 03:35:36,800 --> 03:35:38,160 So let's try and parse this out. 4212 03:35:38,160 --> 03:35:42,440 This means that for all values of x, if this holds true, 4213 03:35:42,440 --> 03:35:46,880 if x belongs to Gryffindor, then this does not hold true. 4214 03:35:46,880 --> 03:35:50,160 x does not belong to Hufflepuff. 4215 03:35:50,160 --> 03:35:52,560 So translated into English, this sentence 4216 03:35:52,560 --> 03:35:57,280 is saying something like for all objects x, if x belongs to Gryffindor, 4217 03:35:57,280 --> 03:36:00,720 then x does not belong to Hufflepuff, for example. 4218 03:36:00,720 --> 03:36:03,720 Or a phrase even more simply, anyone in Gryffindor 4219 03:36:03,720 --> 03:36:07,920 is not in Hufflepuff, simplified way of saying the same thing. 4220 03:36:07,920 --> 03:36:10,560 So this universal quantification lets us express 4221 03:36:10,560 --> 03:36:14,240 an idea like something is going to hold true for all values 4222 03:36:14,240 --> 03:36:16,400 of a particular variable. 4223 03:36:16,400 --> 03:36:18,520 In addition to universal quantification though, 4224 03:36:18,520 --> 03:36:21,880 we also have existential quantification. 4225 03:36:21,880 --> 03:36:24,400 Whereas universal quantification said that something 4226 03:36:24,400 --> 03:36:27,320 is going to be true for all values of a variable, 4227 03:36:27,320 --> 03:36:30,680 existential quantification says that some expression is going 4228 03:36:30,680 --> 03:36:36,680 to be true for some value of a variable, at least one value of the variable. 4229 03:36:36,680 --> 03:36:40,560 So let's take a look at a sample sentence using existential quantification. 4230 03:36:40,560 --> 03:36:42,480 One such sentence looks like this. 4231 03:36:42,480 --> 03:36:43,680 There exists an x. 4232 03:36:43,680 --> 03:36:46,360 This backwards e stands for exists. 4233 03:36:46,360 --> 03:36:51,560 And here we're saying there exists an x such that house x and belongs 4234 03:36:51,560 --> 03:36:53,400 to Minerva x. 4235 03:36:53,400 --> 03:36:57,480 In other words, there exists some object x where x is a house 4236 03:36:57,480 --> 03:37:00,480 and Minerva belongs to x. 4237 03:37:00,480 --> 03:37:02,640 Or phrased a little more succinctly in English, 4238 03:37:02,640 --> 03:37:05,400 I'm here just saying Minerva belongs to a house. 4239 03:37:05,400 --> 03:37:10,280 There's some object that is a house and Minerva belongs to a house. 4240 03:37:10,280 --> 03:37:13,280 And combining this universal and existential quantification, 4241 03:37:13,280 --> 03:37:16,280 we can create far more sophisticated logical statements 4242 03:37:16,280 --> 03:37:19,320 than we were able to just using propositional logic. 4243 03:37:19,320 --> 03:37:21,840 I could combine these to say something like this. 4244 03:37:21,840 --> 03:37:26,000 For all x, person x implies there exists 4245 03:37:26,000 --> 03:37:30,920 a y such that house y and belongs to xy. 4246 03:37:30,920 --> 03:37:31,400 All right. 4247 03:37:31,400 --> 03:37:33,600 So a lot of stuff going on there, a lot of symbols. 4248 03:37:33,600 --> 03:37:36,320 Let's try and parse it out and just understand what it's saying. 4249 03:37:36,320 --> 03:37:41,560 Here we're saying that for all values of x, if x is a person, 4250 03:37:41,560 --> 03:37:43,080 then this is true. 4251 03:37:43,080 --> 03:37:45,680 So in other words, I'm saying for all people, 4252 03:37:45,680 --> 03:37:48,960 and we call that person x, this statement is going to be true. 4253 03:37:48,960 --> 03:37:50,800 What statement is true of all people? 4254 03:37:50,800 --> 03:37:55,760 Well, there exists a y that is a house, so there exists some house, 4255 03:37:55,760 --> 03:37:58,760 and x belongs to y. 4256 03:37:58,760 --> 03:38:01,560 In other words, I'm saying that for all people out there, 4257 03:38:01,560 --> 03:38:07,520 there exists some house such that x, the person, belongs to y, the house. 4258 03:38:07,520 --> 03:38:08,920 This is phrased more succinctly. 4259 03:38:08,920 --> 03:38:12,480 I'm saying that every person belongs to a house, that for all x, 4260 03:38:12,480 --> 03:38:17,200 if x is a person, then there exists a house that x belongs to. 4261 03:38:17,200 --> 03:38:20,760 And so we can now express a lot more powerful ideas using this idea now 4262 03:38:20,760 --> 03:38:21,920 of first order logic. 4263 03:38:21,920 --> 03:38:24,480 And it turns out there are many other kinds of logic out there. 4264 03:38:24,480 --> 03:38:27,040 There's second order logic and other higher order logic, 4265 03:38:27,040 --> 03:38:30,720 each of which allows us to express more and more complex ideas. 4266 03:38:30,720 --> 03:38:33,160 But all of it, in this case, is really in pursuit 4267 03:38:33,160 --> 03:38:36,280 of the same goal, which is the representation of knowledge. 4268 03:38:36,280 --> 03:38:39,800 We want our AI agents to be able to know information, 4269 03:38:39,800 --> 03:38:41,880 to represent that information, whether that's 4270 03:38:41,880 --> 03:38:45,440 using propositional logic or first order logic or some other logic, 4271 03:38:45,440 --> 03:38:49,080 and then be able to reason based on that, to be able to draw conclusions, 4272 03:38:49,080 --> 03:38:50,840 make inferences, figure out whether there's 4273 03:38:50,840 --> 03:38:54,920 some sort of entailment relationship, as by using some sort of inference 4274 03:38:54,920 --> 03:38:58,560 algorithm, something like inference by resolution or model checking 4275 03:38:58,560 --> 03:39:01,600 or any number of these other algorithms that we can use in order 4276 03:39:01,600 --> 03:39:06,200 to take information that we know and translate it to additional conclusions. 4277 03:39:06,200 --> 03:39:08,880 So all of this has helped us to create AI that 4278 03:39:08,880 --> 03:39:13,640 is able to represent information about what it knows and what it doesn't know. 4279 03:39:13,640 --> 03:39:16,560 Next time, though, we'll take a look at how we can make our AI even more 4280 03:39:16,560 --> 03:39:20,520 powerful by not just encoding information that we know for sure to be true 4281 03:39:20,520 --> 03:39:23,920 and not to be true, but also to take a look at uncertainty, 4282 03:39:23,920 --> 03:39:27,240 to look at what happens if AI thinks that something might be probable 4283 03:39:27,240 --> 03:39:31,520 or maybe not very probable or somewhere in between those two extremes, 4284 03:39:31,520 --> 03:39:34,760 all in the pursuit of trying to build our intelligent systems 4285 03:39:34,760 --> 03:39:36,880 to be even more intelligent. 4286 03:39:36,880 --> 03:39:39,320 We'll see you next time. 4287 03:39:39,320 --> 03:39:57,760 Thank you. 4288 03:39:57,760 --> 03:39:59,880 All right, welcome back, everyone, to an introduction 4289 03:39:59,880 --> 03:40:02,040 to artificial intelligence with Python. 4290 03:40:02,040 --> 03:40:05,720 And last time, we took a look at how it is that AI inside of our computers 4291 03:40:05,720 --> 03:40:07,040 can represent knowledge. 4292 03:40:07,040 --> 03:40:10,120 We represented that knowledge in the form of logical sentences 4293 03:40:10,120 --> 03:40:12,080 in a variety of different logical languages. 4294 03:40:12,080 --> 03:40:15,640 And the idea was we wanted our AI to be able to represent knowledge 4295 03:40:15,640 --> 03:40:19,080 or information and somehow use those pieces of information 4296 03:40:19,080 --> 03:40:22,200 to be able to derive new pieces of information by inference, 4297 03:40:22,200 --> 03:40:24,680 to be able to take some information and deduce 4298 03:40:24,680 --> 03:40:27,240 some additional conclusions based on the information 4299 03:40:27,240 --> 03:40:29,160 that it already knew for sure. 4300 03:40:29,160 --> 03:40:32,320 But in reality, when we think about computers and we think about AI, 4301 03:40:32,320 --> 03:40:35,920 very rarely are our machines going to be able to know things for sure. 4302 03:40:35,920 --> 03:40:38,440 Oftentimes, there's going to be some amount of uncertainty 4303 03:40:38,440 --> 03:40:41,480 in the information that our AIs or our computers are dealing with, 4304 03:40:41,480 --> 03:40:44,200 where it might believe something with some probability, 4305 03:40:44,200 --> 03:40:46,960 as we'll soon discuss what probability is all about and what it means, 4306 03:40:46,960 --> 03:40:48,840 but not entirely for certain. 4307 03:40:48,840 --> 03:40:51,920 And we want to use the information that it has some knowledge about, 4308 03:40:51,920 --> 03:40:53,720 even if it doesn't have perfect knowledge, 4309 03:40:53,720 --> 03:40:57,280 to still be able to make inferences, still be able to draw conclusions. 4310 03:40:57,280 --> 03:41:00,480 So you might imagine, for example, in the context of a robot that 4311 03:41:00,480 --> 03:41:02,920 has some sensors and is exploring some environment, 4312 03:41:02,920 --> 03:41:06,200 it might not know exactly where it is or exactly what's around it, 4313 03:41:06,200 --> 03:41:08,880 but it does have access to some data that can allow it 4314 03:41:08,880 --> 03:41:10,840 to draw inferences with some probability. 4315 03:41:10,840 --> 03:41:13,520 There's some likelihood that one thing is true or another. 4316 03:41:13,520 --> 03:41:15,960 Or you can imagine in context where there is a little bit more 4317 03:41:15,960 --> 03:41:18,840 randomness and uncertainty, something like predicting the weather, 4318 03:41:18,840 --> 03:41:21,640 where you might not be able to know for sure what tomorrow's weather is 4319 03:41:21,640 --> 03:41:26,000 with 100% certainty, but you can probably infer with some probability 4320 03:41:26,000 --> 03:41:29,440 what tomorrow's weather is going to be based on maybe today's weather 4321 03:41:29,440 --> 03:41:32,280 and yesterday's weather and other data that you might have access 4322 03:41:32,280 --> 03:41:33,600 to as well. 4323 03:41:33,600 --> 03:41:36,920 And so oftentimes, we can distill this in terms of just possible events 4324 03:41:36,920 --> 03:41:39,760 that might happen and what the likelihood of those events are. 4325 03:41:39,760 --> 03:41:43,040 This comes a lot in games, for example, where there is an element of chance 4326 03:41:43,040 --> 03:41:44,280 inside of those games. 4327 03:41:44,280 --> 03:41:45,760 So you imagine rolling a dice. 4328 03:41:45,760 --> 03:41:48,240 You're not sure exactly what the die roll is going to be, 4329 03:41:48,240 --> 03:41:52,160 but you know it's going to be one of these possibilities from 1 to 6, 4330 03:41:52,160 --> 03:41:53,520 for example. 4331 03:41:53,520 --> 03:41:56,760 And so here now, we introduce the idea of probability theory. 4332 03:41:56,760 --> 03:41:58,720 And what we'll take a look at today is beginning 4333 03:41:58,720 --> 03:42:01,840 by looking at the mathematical foundations of probability theory, 4334 03:42:01,840 --> 03:42:05,400 getting an understanding for some of the key concepts within probability, 4335 03:42:05,400 --> 03:42:08,680 and then diving into how we can use probability and the ideas 4336 03:42:08,680 --> 03:42:12,400 that we look at mathematically to represent some ideas in terms of models 4337 03:42:12,400 --> 03:42:15,960 that we can put into our computers in order to program an AI that 4338 03:42:15,960 --> 03:42:19,280 is able to use information about probability to draw inferences, 4339 03:42:19,280 --> 03:42:22,280 to make some judgments about the world with some probability 4340 03:42:22,280 --> 03:42:25,040 or likelihood of being true. 4341 03:42:25,040 --> 03:42:27,920 So probability ultimately boils down to this idea 4342 03:42:27,920 --> 03:42:30,880 that there are possible worlds that we're here representing 4343 03:42:30,880 --> 03:42:32,920 using this little Greek letter omega. 4344 03:42:32,920 --> 03:42:36,400 And the idea of a possible world is that when I roll a die, 4345 03:42:36,400 --> 03:42:38,920 there are six possible worlds that could result from it. 4346 03:42:38,920 --> 03:42:42,840 I could roll a 1, or a 2, or a 3, or a 4, or a 5, or a 6. 4347 03:42:42,840 --> 03:42:45,040 And each of those are a possible world. 4348 03:42:45,040 --> 03:42:49,000 And each of those possible worlds has some probability of being true, 4349 03:42:49,000 --> 03:42:53,400 the probability that I do roll a 1, or a 2, or a 3, or something else. 4350 03:42:53,400 --> 03:42:57,040 And we represent that probability like this, using the capital letter P. 4351 03:42:57,040 --> 03:43:00,560 And then in parentheses, what it is that we want the probability of. 4352 03:43:00,560 --> 03:43:04,240 So this right here would be the probability of some possible world 4353 03:43:04,240 --> 03:43:07,040 as represented by the little letter omega. 4354 03:43:07,040 --> 03:43:09,760 Now, there are a couple of basic axioms of probability 4355 03:43:09,760 --> 03:43:13,000 that become relevant as we consider how we deal with probability 4356 03:43:13,000 --> 03:43:14,200 and how we think about it. 4357 03:43:14,200 --> 03:43:16,960 First and foremost, every probability value 4358 03:43:16,960 --> 03:43:20,160 must range between 0 and 1 inclusive. 4359 03:43:20,160 --> 03:43:23,920 So the smallest value any probability can have is the number 0, 4360 03:43:23,920 --> 03:43:25,800 which is an impossible event. 4361 03:43:25,800 --> 03:43:28,960 Something like I roll a die, and the die is a 7 is the roll that I get. 4362 03:43:28,960 --> 03:43:33,000 If the die only has numbers 1 through 6, the event that I roll a 7 4363 03:43:33,000 --> 03:43:36,240 is impossible, so it would have probability 0. 4364 03:43:36,240 --> 03:43:38,320 And on the other end of the spectrum, probability 4365 03:43:38,320 --> 03:43:40,920 can range all the way up to the positive number 1, 4366 03:43:40,920 --> 03:43:43,800 meaning an event is certain to happen, that I roll a die 4367 03:43:43,800 --> 03:43:46,200 and the number is less than 10, for example. 4368 03:43:46,200 --> 03:43:49,560 That is an event that is guaranteed to happen if the only sides on my die 4369 03:43:49,560 --> 03:43:51,800 are 1 through 6, for instance. 4370 03:43:51,800 --> 03:43:55,240 And then they can range through any real number in between these two values. 4371 03:43:55,240 --> 03:43:58,240 Where, generally speaking, a higher value for the probability 4372 03:43:58,240 --> 03:44:00,560 means an event is more likely to take place, 4373 03:44:00,560 --> 03:44:03,600 and a lower value for the probability means the event is less 4374 03:44:03,600 --> 03:44:05,680 likely to take place. 4375 03:44:05,680 --> 03:44:08,920 And the other key rule for probability looks a little bit like this. 4376 03:44:08,920 --> 03:44:11,840 This sigma notation, if you haven't seen it before, 4377 03:44:11,840 --> 03:44:13,920 refers to summation, the idea that we're going 4378 03:44:13,920 --> 03:44:16,160 to be adding up a whole sequence of values. 4379 03:44:16,160 --> 03:44:19,160 And this sigma notation is going to come up a couple of times today, 4380 03:44:19,160 --> 03:44:21,480 because as we deal with probability, oftentimes we're 4381 03:44:21,480 --> 03:44:25,120 adding up a whole bunch of individual values or individual probabilities 4382 03:44:25,120 --> 03:44:26,240 to get some other value. 4383 03:44:26,240 --> 03:44:28,200 So we'll see this come up a couple of times. 4384 03:44:28,200 --> 03:44:31,120 But what this notation means is that if I sum up 4385 03:44:31,120 --> 03:44:35,600 all of the possible worlds omega that are in big omega, which 4386 03:44:35,600 --> 03:44:38,280 represents the set of all the possible worlds, 4387 03:44:38,280 --> 03:44:42,120 meaning I take for all of the worlds in the set of possible worlds 4388 03:44:42,120 --> 03:44:47,000 and add up all of their probabilities, what I ultimately get is the number 1. 4389 03:44:47,000 --> 03:44:48,880 So if I take all the possible worlds, add up 4390 03:44:48,880 --> 03:44:52,280 what each of their probabilities is, I should get the number 1 at the end, 4391 03:44:52,280 --> 03:44:55,220 meaning all probabilities just need to sum to 1. 4392 03:44:55,220 --> 03:44:57,640 So for example, if I take dice, for example, 4393 03:44:57,640 --> 03:45:00,400 and if you imagine I have a fair die with numbers 1 through 6 4394 03:45:00,400 --> 03:45:02,480 and I roll the die, each one of these rolls 4395 03:45:02,480 --> 03:45:04,800 has an equal probability of taking place. 4396 03:45:04,800 --> 03:45:07,960 And the probability is 1 over 6, for example. 4397 03:45:07,960 --> 03:45:12,160 So each of these probabilities is between 0 and 1, 0 meaning impossible 4398 03:45:12,160 --> 03:45:13,600 and 1 meaning for certain. 4399 03:45:13,600 --> 03:45:15,640 And if you add up all of these probabilities 4400 03:45:15,640 --> 03:45:18,960 for all of the possible worlds, you get the number 1. 4401 03:45:18,960 --> 03:45:22,200 And we can represent any one of those probabilities like this. 4402 03:45:22,200 --> 03:45:25,640 The probability that we roll the number 2, for example, 4403 03:45:25,640 --> 03:45:27,440 is just 1 over 6. 4404 03:45:27,440 --> 03:45:31,680 Every six times we roll the die, we'd expect that one time, for instance, 4405 03:45:31,680 --> 03:45:33,280 the die might come up as a 2. 4406 03:45:33,280 --> 03:45:36,520 Its probability is not certain, but it's a little more than nothing, 4407 03:45:36,520 --> 03:45:38,120 for instance. 4408 03:45:38,120 --> 03:45:40,920 And so this is all fairly straightforward for just a single die. 4409 03:45:40,920 --> 03:45:43,260 But things get more interesting as our models of the world 4410 03:45:43,260 --> 03:45:44,840 get a little bit more complex. 4411 03:45:44,840 --> 03:45:47,520 Let's imagine now that we're not just dealing with a single die, 4412 03:45:47,520 --> 03:45:49,720 but we have two dice, for example. 4413 03:45:49,720 --> 03:45:51,880 I have a red die here and a blue die there, 4414 03:45:51,880 --> 03:45:54,920 and I care not just about what the individual roll is, 4415 03:45:54,920 --> 03:45:56,880 but I care about the sum of the two rolls. 4416 03:45:56,880 --> 03:46:00,280 In this case, the sum of the two rolls is the number 3. 4417 03:46:00,280 --> 03:46:04,160 How do I begin to now reason about what does the probability look like 4418 03:46:04,160 --> 03:46:07,560 if instead of having one die, I now have two dice? 4419 03:46:07,560 --> 03:46:09,920 Well, what we might imagine is that we could first consider 4420 03:46:09,920 --> 03:46:12,480 what are all of the possible worlds. 4421 03:46:12,480 --> 03:46:14,480 And in this case, all of the possible worlds 4422 03:46:14,480 --> 03:46:18,120 are just every combination of the red and blue die that I could come up with. 4423 03:46:18,120 --> 03:46:22,640 For the red die, it could be a 1 or a 2 or a 3 or a 4 or a 5 or a 6. 4424 03:46:22,640 --> 03:46:25,320 And for each of those possibilities, the blue die, likewise, 4425 03:46:25,320 --> 03:46:30,320 could also be either 1 or 2 or 3 or 4 or 5 or 6. 4426 03:46:30,320 --> 03:46:33,000 And it just so happens that in this particular case, 4427 03:46:33,000 --> 03:46:36,200 each of these possible combinations is equally likely. 4428 03:46:36,200 --> 03:46:39,400 Equally likely are all of these various different possible worlds. 4429 03:46:39,400 --> 03:46:41,080 That's not always going to be the case. 4430 03:46:41,080 --> 03:46:44,160 If you imagine more complex models that we could try to build and things 4431 03:46:44,160 --> 03:46:46,400 that we could try to represent in the real world, 4432 03:46:46,400 --> 03:46:49,560 it's probably not going to be the case that every single possible world is 4433 03:46:49,560 --> 03:46:50,920 always equally likely. 4434 03:46:50,920 --> 03:46:53,600 But in the case of fair dice, where in any given die roll, 4435 03:46:53,600 --> 03:46:57,080 any one number has just as good a chance of coming up as any other number, 4436 03:46:57,080 --> 03:47:01,360 we can consider all of these possible worlds to be equally likely. 4437 03:47:01,360 --> 03:47:04,120 But even though all of the possible worlds are equally likely, 4438 03:47:04,120 --> 03:47:07,320 that doesn't necessarily mean that their sums are equally likely. 4439 03:47:07,320 --> 03:47:10,320 So if we consider what the sum is of all of these two, so 1 plus 1, 4440 03:47:10,320 --> 03:47:11,240 that's a 2. 4441 03:47:11,240 --> 03:47:12,600 2 plus 1 is a 3. 4442 03:47:12,600 --> 03:47:15,320 And consider for each of these possible pairs of numbers 4443 03:47:15,320 --> 03:47:18,720 what their sum ultimately is, we can notice that there are some patterns 4444 03:47:18,720 --> 03:47:22,000 here, where it's not entirely the case that every number comes up 4445 03:47:22,000 --> 03:47:23,240 equally likely. 4446 03:47:23,240 --> 03:47:26,880 If you consider 7, for example, what's the probability that when I roll two 4447 03:47:26,880 --> 03:47:28,720 dice, their sum is 7? 4448 03:47:28,720 --> 03:47:30,280 There are several ways this can happen. 4449 03:47:30,280 --> 03:47:33,080 There are six possible worlds where the sum is 7. 4450 03:47:33,080 --> 03:47:37,480 It could be a 1 and a 6, or a 2 and a 5, or a 3 and a 4, a 4 and a 3, 4451 03:47:37,480 --> 03:47:39,040 and so forth. 4452 03:47:39,040 --> 03:47:42,720 But if you instead consider what's the probability that I roll two dice, 4453 03:47:42,720 --> 03:47:45,920 and the sum of those two die rolls is 12, for example, 4454 03:47:45,920 --> 03:47:49,880 we're looking at this diagram, there's only one possible world in which that 4455 03:47:49,880 --> 03:47:50,400 can happen. 4456 03:47:50,400 --> 03:47:54,200 And that's the possible world where both the red die and the blue die 4457 03:47:54,200 --> 03:47:58,400 both come up as sixes to give us a sum total of 12. 4458 03:47:58,400 --> 03:48:00,520 So based on just taking a look at this diagram, 4459 03:48:00,520 --> 03:48:03,000 we see that some of these probabilities are likely different. 4460 03:48:03,000 --> 03:48:07,200 The probability that the sum is a 7 must be greater than the probability 4461 03:48:07,200 --> 03:48:08,440 that the sum is a 12. 4462 03:48:08,440 --> 03:48:11,680 And we can represent that even more formally by saying, OK, the probability 4463 03:48:11,680 --> 03:48:15,320 that we sum to 12 is 1 out of 36. 4464 03:48:15,320 --> 03:48:18,680 Out of the 36 equally likely possible worlds, 4465 03:48:18,680 --> 03:48:22,040 6 squared because we have six options for the red die and six 4466 03:48:22,040 --> 03:48:24,960 options for the blue die, out of those 36 options, 4467 03:48:24,960 --> 03:48:27,840 only one of them sums to 12. 4468 03:48:27,840 --> 03:48:29,600 Whereas on the other hand, the probability 4469 03:48:29,600 --> 03:48:33,360 that if we take two dice rolls and they sum up to the number 7, well, 4470 03:48:33,360 --> 03:48:37,840 out of those 36 possible worlds, there were six worlds where the sum was 7. 4471 03:48:37,840 --> 03:48:42,280 And so we get 6 over 36, which we can simplify as a fraction to just 1 4472 03:48:42,280 --> 03:48:43,720 over 6. 4473 03:48:43,720 --> 03:48:46,360 So here now, we're able to represent these different ideas 4474 03:48:46,360 --> 03:48:49,400 of probability, representing some events that might be more likely 4475 03:48:49,400 --> 03:48:52,720 and then other events that are less likely as well. 4476 03:48:52,720 --> 03:48:55,840 And these sorts of judgments, where we're figuring out just in the abstract 4477 03:48:55,840 --> 03:48:58,760 what is the probability that this thing takes place, 4478 03:48:58,760 --> 03:49:01,680 are generally known as unconditional probabilities. 4479 03:49:01,680 --> 03:49:04,000 Some degree of belief we have in some proposition, 4480 03:49:04,000 --> 03:49:07,840 some fact about the world, in the absence of any other evidence. 4481 03:49:07,840 --> 03:49:10,600 Without knowing any additional information, if I roll a die, 4482 03:49:10,600 --> 03:49:12,240 what's the chance it comes up as a 2? 4483 03:49:12,240 --> 03:49:15,240 Or if I roll two dice, what's the chance that the sum of those two die 4484 03:49:15,240 --> 03:49:17,080 rolls is a 7? 4485 03:49:17,080 --> 03:49:20,080 But usually when we're thinking about probability, especially when 4486 03:49:20,080 --> 03:49:22,400 we're thinking about training in AI to intelligently 4487 03:49:22,400 --> 03:49:24,320 be able to know something about the world 4488 03:49:24,320 --> 03:49:26,600 and make predictions based on that information, 4489 03:49:26,600 --> 03:49:30,120 it's not unconditional probability that our AI is dealing with, 4490 03:49:30,120 --> 03:49:32,680 but rather conditional probability, probability 4491 03:49:32,680 --> 03:49:35,360 where rather than having no original knowledge, 4492 03:49:35,360 --> 03:49:37,600 we have some initial knowledge about the world 4493 03:49:37,600 --> 03:49:39,320 and how the world actually works. 4494 03:49:39,320 --> 03:49:43,120 So conditional probability is the degree of belief in a proposition 4495 03:49:43,120 --> 03:49:47,840 given some evidence that has already been revealed to us. 4496 03:49:47,840 --> 03:49:49,000 So what does this look like? 4497 03:49:49,000 --> 03:49:51,720 Well, it looks like this in terms of notation. 4498 03:49:51,720 --> 03:49:56,240 We're going to represent conditional probability as probability of A 4499 03:49:56,240 --> 03:49:59,920 and then this vertical bar and then B. And the way to read this 4500 03:49:59,920 --> 03:50:02,720 is the thing on the left-hand side of the vertical bar 4501 03:50:02,720 --> 03:50:05,000 is what we want the probability of. 4502 03:50:05,000 --> 03:50:08,200 Here now, I want the probability that A is true, 4503 03:50:08,200 --> 03:50:12,000 that it is the real world, that it is the event that actually does take place. 4504 03:50:12,000 --> 03:50:14,920 And then on the right side of the vertical bar is our evidence, 4505 03:50:14,920 --> 03:50:18,520 the information that we already know for certain about the world. 4506 03:50:18,520 --> 03:50:21,200 For example, that B is true. 4507 03:50:21,200 --> 03:50:23,080 So the way to read this entire expression 4508 03:50:23,080 --> 03:50:28,480 is what is the probability of A given B, the probability that A is true, 4509 03:50:28,480 --> 03:50:31,480 given that we already know that B is true. 4510 03:50:31,480 --> 03:50:34,120 And this type of judgment, conditional probability, 4511 03:50:34,120 --> 03:50:37,160 the probability of one thing given some other fact, 4512 03:50:37,160 --> 03:50:40,200 comes up quite a lot when we think about the types of calculations 4513 03:50:40,200 --> 03:50:42,240 we might want our AI to be able to do. 4514 03:50:42,240 --> 03:50:45,640 For example, we might care about the probability of rain today 4515 03:50:45,640 --> 03:50:47,720 given that we know that it rained yesterday. 4516 03:50:47,720 --> 03:50:51,000 We could think about the probability of rain today just in the abstract. 4517 03:50:51,000 --> 03:50:52,960 What is the chance that today it rains? 4518 03:50:52,960 --> 03:50:54,960 But usually, we have some additional evidence. 4519 03:50:54,960 --> 03:50:57,520 I know for certain that it rained yesterday. 4520 03:50:57,520 --> 03:51:00,920 And so I would like to calculate the probability that it rains today 4521 03:51:00,920 --> 03:51:03,240 given that I know that it rained yesterday. 4522 03:51:03,240 --> 03:51:06,200 Or you might imagine that I want to know the probability that my optimal 4523 03:51:06,200 --> 03:51:09,920 route to my destination changes given the current traffic condition. 4524 03:51:09,920 --> 03:51:12,120 So whether or not traffic conditions change, 4525 03:51:12,120 --> 03:51:16,200 that might change the probability that this route is actually the optimal route. 4526 03:51:16,200 --> 03:51:18,160 Or you might imagine in a medical context, 4527 03:51:18,160 --> 03:51:22,480 I want to know the probability that a patient has a particular disease given 4528 03:51:22,480 --> 03:51:25,600 some results of some tests that have been performed on that patient. 4529 03:51:25,600 --> 03:51:28,440 And I have some evidence, the results of that test, 4530 03:51:28,440 --> 03:51:31,760 and I would like to know the probability that a patient has 4531 03:51:31,760 --> 03:51:33,080 a particular disease. 4532 03:51:33,080 --> 03:51:35,840 So this notion of conditional probability comes up everywhere. 4533 03:51:35,840 --> 03:51:38,320 So we begin to think about what we would like to reason about, 4534 03:51:38,320 --> 03:51:40,800 but being able to reason a little more intelligently 4535 03:51:40,800 --> 03:51:43,760 by taking into account evidence that we already have. 4536 03:51:43,760 --> 03:51:46,920 We're more able to get an accurate result for what is the likelihood 4537 03:51:46,920 --> 03:51:50,960 that someone has this disease if we know this evidence, the results of the test, 4538 03:51:50,960 --> 03:51:55,240 as opposed to if we were just calculating the unconditional probability of saying, 4539 03:51:55,240 --> 03:51:58,600 what is the probability they have the disease without any evidence 4540 03:51:58,600 --> 03:52:03,360 to try and back up our result one way or the other. 4541 03:52:03,360 --> 03:52:06,400 So now that we've got this idea of what conditional probability is, 4542 03:52:06,400 --> 03:52:08,200 the next question we have to ask is, all right, 4543 03:52:08,200 --> 03:52:10,200 how do we calculate conditional probability? 4544 03:52:10,200 --> 03:52:13,880 How do we figure out mathematically, if I have an expression like this, 4545 03:52:13,880 --> 03:52:15,240 how do I get a number from that? 4546 03:52:15,240 --> 03:52:17,560 What does conditional probability actually mean? 4547 03:52:17,560 --> 03:52:19,560 Well, the formula for conditional probability 4548 03:52:19,560 --> 03:52:21,120 looks a little something like this. 4549 03:52:21,120 --> 03:52:25,640 The probability of a given b, the probability that a is true, 4550 03:52:25,640 --> 03:52:29,320 given that we know that b is true, is equal to this fraction, 4551 03:52:29,320 --> 03:52:34,520 the probability that a and b are true, divided by just the probability 4552 03:52:34,520 --> 03:52:35,520 that b is true. 4553 03:52:35,520 --> 03:52:37,800 And the way to intuitively try to think about this 4554 03:52:37,800 --> 03:52:40,960 is that if I want to know the probability that a is true, given 4555 03:52:40,960 --> 03:52:46,000 that b is true, well, I want to consider all the ways they could both be true out 4556 03:52:46,000 --> 03:52:50,040 of the only worlds that I care about are the worlds where b is already true. 4557 03:52:50,040 --> 03:52:52,840 I can sort of ignore all the cases where b isn't true, 4558 03:52:52,840 --> 03:52:55,640 because those aren't relevant to my ultimate computation. 4559 03:52:55,640 --> 03:52:59,720 They're not relevant to what it is that I want to get information about. 4560 03:52:59,720 --> 03:53:01,220 So let's take a look at an example. 4561 03:53:01,220 --> 03:53:04,160 Let's go back to that example of rolling two dice and the idea 4562 03:53:04,160 --> 03:53:06,920 that those two dice might sum up to the number 12. 4563 03:53:06,920 --> 03:53:09,680 We discussed earlier that the unconditional probability 4564 03:53:09,680 --> 03:53:13,160 that if I roll two dice and they sum to 12 is 1 out of 36, 4565 03:53:13,160 --> 03:53:16,280 because out of the 36 possible worlds that I might care about, 4566 03:53:16,280 --> 03:53:19,280 in only one of them is the sum of those two dice 12. 4567 03:53:19,280 --> 03:53:22,880 It's only when red is 6 and blue is also 6. 4568 03:53:22,880 --> 03:53:25,400 But let's say now that I have some additional information. 4569 03:53:25,400 --> 03:53:29,400 I now want to know what is the probability that the two dice sum to 12, 4570 03:53:29,400 --> 03:53:33,720 given that I know that the red die was a 6. 4571 03:53:33,720 --> 03:53:35,320 So I already have some evidence. 4572 03:53:35,320 --> 03:53:36,960 I already know the red die is a 6. 4573 03:53:36,960 --> 03:53:38,320 I don't know what the blue die is. 4574 03:53:38,320 --> 03:53:41,200 That information isn't given to me in this expression. 4575 03:53:41,200 --> 03:53:44,080 But given the fact that I know that the red die rolled a 6, 4576 03:53:44,080 --> 03:53:47,080 what is the probability that we sum to 12? 4577 03:53:47,080 --> 03:53:50,040 And so we can begin to do the math using that expression from before. 4578 03:53:50,040 --> 03:53:52,440 Here, again, are all of the possibilities, 4579 03:53:52,440 --> 03:53:55,800 all of the possible combinations of red die being 1 through 6 4580 03:53:55,800 --> 03:53:58,600 and blue die being 1 through 6. 4581 03:53:58,600 --> 03:54:00,320 And I might consider first, all right, what 4582 03:54:00,320 --> 03:54:04,320 is the probability of my evidence, my B variable, where I want to know, 4583 03:54:04,320 --> 03:54:07,400 what is the probability that the red die is a 6? 4584 03:54:07,400 --> 03:54:11,200 Well, the probability that the red die is a 6 is just 1 out of 6. 4585 03:54:11,200 --> 03:54:14,800 So these 1 out of 6 options are really the only worlds 4586 03:54:14,800 --> 03:54:16,200 that I care about here now. 4587 03:54:16,200 --> 03:54:19,320 All the rest of them are irrelevant to my calculation, 4588 03:54:19,320 --> 03:54:22,200 because I already have this evidence that the red die was a 6, 4589 03:54:22,200 --> 03:54:26,280 so I don't need to care about all of the other possibilities that could result. 4590 03:54:26,280 --> 03:54:29,760 So now, in addition to the fact that the red die rolled as a 6 4591 03:54:29,760 --> 03:54:32,280 and the probability of that, the other piece of information 4592 03:54:32,280 --> 03:54:35,560 I need to know in order to calculate this conditional probability 4593 03:54:35,560 --> 03:54:39,480 is the probability that both of my variables, A and B, are true. 4594 03:54:39,480 --> 03:54:44,360 The probability that both the red die is a 6, and they all sum to 12. 4595 03:54:44,360 --> 03:54:47,120 So what is the probability that both of these things happen? 4596 03:54:47,120 --> 03:54:51,640 Well, it only happens in one possible case in 1 out of these 36 cases, 4597 03:54:51,640 --> 03:54:55,520 and it's the case where both the red and the blue die are equal to 6. 4598 03:54:55,520 --> 03:54:57,800 This is a piece of information that we already knew. 4599 03:54:57,800 --> 03:55:01,880 And so this probability is equal to 1 over 36. 4600 03:55:01,880 --> 03:55:05,680 And so to get the conditional probability that the sum is 12, 4601 03:55:05,680 --> 03:55:08,560 given that I know that the red dice is equal to 6, 4602 03:55:08,560 --> 03:55:10,640 well, I just divide these two values together, 4603 03:55:10,640 --> 03:55:16,600 and 1 over 36 divided by 1 over 6 gives us this probability of 1 over 6. 4604 03:55:16,600 --> 03:55:19,960 Given that I know that the red die rolled a value of 6, 4605 03:55:19,960 --> 03:55:25,320 the probability that the sum of the two dice is 12 is also 1 over 6. 4606 03:55:25,320 --> 03:55:27,480 And that probably makes intuitive sense to you, too, 4607 03:55:27,480 --> 03:55:30,880 because if the red die is a 6, the only way for me to get to a 12 4608 03:55:30,880 --> 03:55:33,240 is if the blue die also rolls a 6, and we 4609 03:55:33,240 --> 03:55:37,040 know that the probability of the blue die rolling a 6 is 1 over 6. 4610 03:55:37,040 --> 03:55:40,680 So in this case, the conditional probability seems fairly straightforward. 4611 03:55:40,680 --> 03:55:44,040 But this idea of calculating a conditional probability 4612 03:55:44,040 --> 03:55:47,880 by looking at the probability that both of these events take place 4613 03:55:47,880 --> 03:55:49,920 is an idea that's going to come up again and again. 4614 03:55:49,920 --> 03:55:52,880 This is the definition now of conditional probability. 4615 03:55:52,880 --> 03:55:54,800 And we're going to use that definition as we 4616 03:55:54,800 --> 03:55:56,960 think about probability more generally to be 4617 03:55:56,960 --> 03:55:59,120 able to draw conclusions about the world. 4618 03:55:59,120 --> 03:56:00,760 This, again, is that formula. 4619 03:56:00,760 --> 03:56:04,440 The probability of A given B is equal to the probability 4620 03:56:04,440 --> 03:56:08,840 that A and B take place divided by the probability of B. 4621 03:56:08,840 --> 03:56:11,880 And you'll see this formula sometimes written in a couple of different ways. 4622 03:56:11,880 --> 03:56:15,520 You could imagine algebraically multiplying both sides of this equation 4623 03:56:15,520 --> 03:56:18,720 by probability of B to get rid of the fraction, 4624 03:56:18,720 --> 03:56:20,320 and you'll get an expression like this. 4625 03:56:20,320 --> 03:56:24,520 The probability of A and B, which is this expression over here, 4626 03:56:24,520 --> 03:56:28,520 is just the probability of B times the probability of A given B. 4627 03:56:28,520 --> 03:56:31,840 Or you could represent this equivalently since A and B in this expression 4628 03:56:31,840 --> 03:56:32,840 are interchangeable. 4629 03:56:32,840 --> 03:56:36,440 A and B is the same thing as B and A. You could imagine also 4630 03:56:36,440 --> 03:56:41,040 representing the probability of A and B as the probability of A 4631 03:56:41,040 --> 03:56:45,080 times the probability of B given A, just switching all of the A's and B's. 4632 03:56:45,080 --> 03:56:47,280 These three are all equivalent ways of trying 4633 03:56:47,280 --> 03:56:49,760 to represent what joint probability means. 4634 03:56:49,760 --> 03:56:52,120 And so you'll sometimes see all of these equations, 4635 03:56:52,120 --> 03:56:55,680 and they might be useful to you as you begin to reason about probability 4636 03:56:55,680 --> 03:57:00,080 and to think about what values might be taking place in the real world. 4637 03:57:00,080 --> 03:57:02,120 Now, sometimes when we deal with probability, 4638 03:57:02,120 --> 03:57:05,320 we don't just care about a Boolean event like did this happen 4639 03:57:05,320 --> 03:57:06,720 or did this not happen. 4640 03:57:06,720 --> 03:57:10,160 Sometimes we might want the ability to represent variable values 4641 03:57:10,160 --> 03:57:13,400 in a probability space where some variable might take 4642 03:57:13,400 --> 03:57:16,080 on multiple different possible values. 4643 03:57:16,080 --> 03:57:19,440 And in probability, we call a variable in probability theory 4644 03:57:19,440 --> 03:57:21,040 a random variable. 4645 03:57:21,040 --> 03:57:25,440 A random variable in probability is just some variable in probability theory 4646 03:57:25,440 --> 03:57:28,800 that has some domain of values that it can take on. 4647 03:57:28,800 --> 03:57:29,920 So what do I mean by this? 4648 03:57:29,920 --> 03:57:32,640 Well, what I mean is I might have a random variable that is just 4649 03:57:32,640 --> 03:57:36,120 called roll, for example, that has six possible values. 4650 03:57:36,120 --> 03:57:39,720 Roll is my variable, and the possible values, the domain of values 4651 03:57:39,720 --> 03:57:43,160 that it can take on are 1, 2, 3, 4, 5, and 6. 4652 03:57:43,160 --> 03:57:45,520 And I might like to know the probability of each. 4653 03:57:45,520 --> 03:57:47,440 In this case, they happen to all be the same. 4654 03:57:47,440 --> 03:57:50,360 But in other random variables, that might not be the case. 4655 03:57:50,360 --> 03:57:52,160 For example, I might have a random variable 4656 03:57:52,160 --> 03:57:55,200 to represent the weather, for example, where the domain of values 4657 03:57:55,200 --> 03:57:59,680 it could take on are things like sun or cloudy or rainy or windy or snowy. 4658 03:57:59,680 --> 03:58:02,120 And each of those might have a different probability. 4659 03:58:02,120 --> 03:58:05,560 And I care about knowing what is the probability that the weather equals 4660 03:58:05,560 --> 03:58:08,600 sun or that the weather equals clouds, for instance. 4661 03:58:08,600 --> 03:58:11,080 And I might like to do some mathematical calculations 4662 03:58:11,080 --> 03:58:12,760 based on that information. 4663 03:58:12,760 --> 03:58:15,320 Other random variables might be something like traffic. 4664 03:58:15,320 --> 03:58:18,840 What are the odds that there is no traffic or light traffic or heavy traffic? 4665 03:58:18,840 --> 03:58:21,200 Traffic, in this case, is my random variable. 4666 03:58:21,200 --> 03:58:24,560 And the values that that random variable can take on are here. 4667 03:58:24,560 --> 03:58:26,760 It's either none or light or heavy. 4668 03:58:26,760 --> 03:58:28,640 And I, the person doing these calculations, 4669 03:58:28,640 --> 03:58:32,280 I, the person encoding these random variables into my computer, 4670 03:58:32,280 --> 03:58:36,600 need to make the decision as to what these possible values actually are. 4671 03:58:36,600 --> 03:58:38,880 You might imagine, for example, for a flight. 4672 03:58:38,880 --> 03:58:41,320 If I care about whether or not I make it or do a flight on time, 4673 03:58:41,320 --> 03:58:43,880 my flight has a couple of possible values that it could take on. 4674 03:58:43,880 --> 03:58:45,280 My flight could be on time. 4675 03:58:45,280 --> 03:58:46,520 My flight could be delayed. 4676 03:58:46,520 --> 03:58:47,800 My flight could be canceled. 4677 03:58:47,800 --> 03:58:51,480 So flight, in this case, is my random variable. 4678 03:58:51,480 --> 03:58:54,120 And these are the values that it can take on. 4679 03:58:54,120 --> 03:58:57,360 And often, I want to know something about the probability 4680 03:58:57,360 --> 03:59:00,880 that my random variable takes on each of those possible values. 4681 03:59:00,880 --> 03:59:04,360 And this is what we then call a probability distribution. 4682 03:59:04,360 --> 03:59:07,320 A probability distribution takes a random variable 4683 03:59:07,320 --> 03:59:12,040 and gives me the probability for each of the possible values in its domain. 4684 03:59:12,040 --> 03:59:15,600 So in the case of this flight, for example, my probability distribution 4685 03:59:15,600 --> 03:59:16,960 might look something like this. 4686 03:59:16,960 --> 03:59:19,920 My probability distribution says the probability 4687 03:59:19,920 --> 03:59:25,880 that the random variable flight is equal to the value on time is 0.6. 4688 03:59:25,880 --> 03:59:28,480 Or otherwise, put into more English human-friendly terms, 4689 03:59:28,480 --> 03:59:32,080 the likelihood that my flight is on time is 60%, for example. 4690 03:59:32,080 --> 03:59:35,760 And in this case, the probability that my flight is delayed is 30%. 4691 03:59:35,760 --> 03:59:39,720 The probability that my flight is canceled is 10% or 0.1. 4692 03:59:39,720 --> 03:59:42,480 And if you sum up all of these possible values, 4693 03:59:42,480 --> 03:59:43,840 the sum is going to be 1, right? 4694 03:59:43,840 --> 03:59:46,360 If you take all of the possible worlds, here 4695 03:59:46,360 --> 03:59:49,800 are my three possible worlds for the value of the random variable flight, 4696 03:59:49,800 --> 03:59:52,160 add them all up together, the result needs 4697 03:59:52,160 --> 03:59:55,280 to be the number 1 per that axiom of probability theory 4698 03:59:55,280 --> 03:59:57,160 that we've discussed before. 4699 03:59:57,160 --> 04:00:00,440 So this now is one way of representing this probability 4700 04:00:00,440 --> 04:00:03,600 distribution for the random variable flight. 4701 04:00:03,600 --> 04:00:06,160 Sometimes you'll see it represented a little bit more concisely 4702 04:00:06,160 --> 04:00:08,440 that this is pretty verbose for really just trying 4703 04:00:08,440 --> 04:00:10,720 to express three possible values. 4704 04:00:10,720 --> 04:00:13,280 And so often, you'll instead see the same notation 4705 04:00:13,280 --> 04:00:15,120 representing using a vector. 4706 04:00:15,120 --> 04:00:17,880 And all a vector is is a sequence of values. 4707 04:00:17,880 --> 04:00:21,160 As opposed to just a single value, I might have multiple values. 4708 04:00:21,160 --> 04:00:25,200 And so I could extend instead, represent this idea this way. 4709 04:00:25,200 --> 04:00:29,920 Bold p, so a larger p, generally meaning the probability distribution 4710 04:00:29,920 --> 04:00:35,520 of this variable flight is equal to this vector represented in angle brackets. 4711 04:00:35,520 --> 04:00:39,880 The probability distribution is 0.6, 0.3, and 0.1. 4712 04:00:39,880 --> 04:00:42,840 And I would just have to know that this probability distribution is 4713 04:00:42,840 --> 04:00:46,600 in order of on time or delayed and canceled 4714 04:00:46,600 --> 04:00:48,280 to know how to interpret this vector. 4715 04:00:48,280 --> 04:00:51,000 To mean the first value in the vector is the probability 4716 04:00:51,000 --> 04:00:52,520 that my flight is on time. 4717 04:00:52,520 --> 04:00:56,040 The second value in the vector is the probability that my flight is delayed. 4718 04:00:56,040 --> 04:00:58,480 And the third value in the vector is the probability 4719 04:00:58,480 --> 04:01:00,560 that my flight is canceled. 4720 04:01:00,560 --> 04:01:03,720 And so this is just an alternate way of representing this idea, 4721 04:01:03,720 --> 04:01:05,040 a little more verbosely. 4722 04:01:05,040 --> 04:01:08,840 But oftentimes, you'll see us just talk about a probability distribution 4723 04:01:08,840 --> 04:01:10,360 over a random variable. 4724 04:01:10,360 --> 04:01:12,600 And whenever we talk about that, what we're really doing 4725 04:01:12,600 --> 04:01:16,040 is trying to figure out the probabilities of each of the possible values 4726 04:01:16,040 --> 04:01:17,840 that that random variable can take on. 4727 04:01:17,840 --> 04:01:20,640 But this notation is just a little bit more succinct, 4728 04:01:20,640 --> 04:01:22,760 even though it can sometimes be a little confusing, 4729 04:01:22,760 --> 04:01:24,480 depending on the context in which you see it. 4730 04:01:24,480 --> 04:01:27,720 So we'll start to look at examples where we use this sort of notation 4731 04:01:27,720 --> 04:01:33,480 to describe probability and to describe events that might take place. 4732 04:01:33,480 --> 04:01:37,080 A couple of other important ideas to know with regards to probability theory. 4733 04:01:37,080 --> 04:01:39,480 One is this idea of independence. 4734 04:01:39,480 --> 04:01:43,080 And independence refers to the idea that the knowledge of one event 4735 04:01:43,080 --> 04:01:46,480 doesn't influence the probability of another event. 4736 04:01:46,480 --> 04:01:48,760 So for example, in the context of my two dice rolls, 4737 04:01:48,760 --> 04:01:51,560 where I had the red die and the blue die, the probability 4738 04:01:51,560 --> 04:01:54,040 that I roll the red die and the blue die, 4739 04:01:54,040 --> 04:01:57,120 those two events, red die and blue die, are independent. 4740 04:01:57,120 --> 04:02:00,160 Knowing the result of the red die doesn't change 4741 04:02:00,160 --> 04:02:01,520 the probabilities for the blue die. 4742 04:02:01,520 --> 04:02:03,960 It doesn't give me any additional information 4743 04:02:03,960 --> 04:02:06,920 about what the value of the blue die is ultimately going to be. 4744 04:02:06,920 --> 04:02:08,760 But that's not always going to be the case. 4745 04:02:08,760 --> 04:02:11,480 You might imagine that in the case of weather, something 4746 04:02:11,480 --> 04:02:15,240 like clouds and rain, those are probably not independent. 4747 04:02:15,240 --> 04:02:18,720 But if it is cloudy, that might increase the probability that later 4748 04:02:18,720 --> 04:02:20,240 in the day it's going to rain. 4749 04:02:20,240 --> 04:02:24,680 So some information informs some other event or some other random variable. 4750 04:02:24,680 --> 04:02:29,080 So independence refers to the idea that one event doesn't influence the other. 4751 04:02:29,080 --> 04:02:34,280 And if they're not independent, then there might be some relationship. 4752 04:02:34,280 --> 04:02:37,440 So mathematically, formally, what does independence actually mean? 4753 04:02:37,440 --> 04:02:42,200 Well, recall this formula from before, that the probability of A and B 4754 04:02:42,200 --> 04:02:46,320 is the probability of A times the probability of B given A. 4755 04:02:46,320 --> 04:02:48,160 And the more intuitive way to think about this 4756 04:02:48,160 --> 04:02:51,680 is that to know how likely it is that A and B happen, 4757 04:02:51,680 --> 04:02:54,520 well, let's first figure out the likelihood that A happens. 4758 04:02:54,520 --> 04:02:56,880 And then given that we know that A happens, 4759 04:02:56,880 --> 04:02:58,720 let's figure out the likelihood that B happens 4760 04:02:58,720 --> 04:03:01,560 and multiply those two things together. 4761 04:03:01,560 --> 04:03:05,680 But if A and B were independent, meaning knowing A 4762 04:03:05,680 --> 04:03:09,440 doesn't change anything about the likelihood that B is true, 4763 04:03:09,440 --> 04:03:14,680 well, then the probability of B given A, meaning the probability that B is true, 4764 04:03:14,680 --> 04:03:17,680 given that I know A is true, well, that I know A is true 4765 04:03:17,680 --> 04:03:20,400 shouldn't really make a difference if these two things are independent, 4766 04:03:20,400 --> 04:03:22,880 that A shouldn't influence B at all. 4767 04:03:22,880 --> 04:03:27,760 So the probability of B given A is really just the probability of B. 4768 04:03:27,760 --> 04:03:30,800 If it is true that A and B are independent. 4769 04:03:30,800 --> 04:03:33,840 And so this right here is one example of a definition 4770 04:03:33,840 --> 04:03:36,440 for what it means for A and B to be independent. 4771 04:03:36,440 --> 04:03:39,600 The probability of A and B is just the probability 4772 04:03:39,600 --> 04:03:44,320 of A times the probability of B. Anytime you find two events A and B 4773 04:03:44,320 --> 04:03:49,640 where this relationship holds, then you can say that A and B are independent. 4774 04:03:49,640 --> 04:03:53,640 So an example of that might be the dice that we were taking a look at before. 4775 04:03:53,640 --> 04:03:58,320 Here, if I wanted the probability of red being a 6 and blue being a 6, 4776 04:03:58,320 --> 04:04:01,680 well, that's just the probability that red is a 6 multiplied 4777 04:04:01,680 --> 04:04:03,480 by the probability that blue is a 6. 4778 04:04:03,480 --> 04:04:05,760 It's both equal to 1 over 36. 4779 04:04:05,760 --> 04:04:10,320 So I can say that these two events are independent. 4780 04:04:10,320 --> 04:04:13,920 What wouldn't be independent, for example, would be an example. 4781 04:04:13,920 --> 04:04:16,320 So this, for example, has a probability of 1 over 36, 4782 04:04:16,320 --> 04:04:17,640 as we talked about before. 4783 04:04:17,640 --> 04:04:20,560 But what wouldn't be independent would be a case like this, 4784 04:04:20,560 --> 04:04:26,360 the probability that the red die rolls a 6 and the red die rolls a 4. 4785 04:04:26,360 --> 04:04:29,600 If you just naively took, OK, red die 6, red die 4, 4786 04:04:29,600 --> 04:04:31,280 well, if I'm only rolling the die once, you 4787 04:04:31,280 --> 04:04:34,120 might imagine the naive approach is to say, well, each of these 4788 04:04:34,120 --> 04:04:35,800 has a probability of 1 over 6. 4789 04:04:35,800 --> 04:04:39,440 So multiply them together, and the probability is 1 over 36. 4790 04:04:39,440 --> 04:04:41,720 But of course, if you're only rolling the red die once, 4791 04:04:41,720 --> 04:04:45,360 there's no way you could get two different values for the red die. 4792 04:04:45,360 --> 04:04:48,000 It couldn't both be a 6 and a 4. 4793 04:04:48,000 --> 04:04:50,200 So the probability should be 0. 4794 04:04:50,200 --> 04:04:53,680 But if you were to multiply probability of red 6 times 4795 04:04:53,680 --> 04:04:57,440 probability of red 4, well, that would equal 1 over 36. 4796 04:04:57,440 --> 04:04:58,760 But of course, that's not true. 4797 04:04:58,760 --> 04:05:01,800 Because we know that there is no way, probability 0, 4798 04:05:01,800 --> 04:05:06,200 that when we roll the red die once, we get both a 6 and a 4, 4799 04:05:06,200 --> 04:05:10,760 because only one of those possibilities can actually be the result. 4800 04:05:10,760 --> 04:05:14,280 And so we can say that the event that red roll is 6 4801 04:05:14,280 --> 04:05:18,360 and the event that red roll is 4, those two events are not independent. 4802 04:05:18,360 --> 04:05:23,200 If I know that the red roll is a 6, I know that the red roll cannot possibly 4803 04:05:23,200 --> 04:05:25,880 be a 4, so these things are not independent. 4804 04:05:25,880 --> 04:05:28,240 And instead, if I wanted to calculate the probability, 4805 04:05:28,240 --> 04:05:31,480 I would need to use this conditional probability 4806 04:05:31,480 --> 04:05:36,160 as the regular definition of the probability of two events taking place. 4807 04:05:36,160 --> 04:05:38,560 And the probability of this now, well, the probability 4808 04:05:38,560 --> 04:05:41,320 of the red roll being a 6, that's 1 over 6. 4809 04:05:41,320 --> 04:05:45,960 But what's the probability that the roll is a 4 given that the roll is a 6? 4810 04:05:45,960 --> 04:05:50,680 Well, this is just 0, because there's no way for the red roll to be a 4, 4811 04:05:50,680 --> 04:05:53,560 given that we already know the red roll is a 6. 4812 04:05:53,560 --> 04:05:59,320 And so the value, if we do add all that multiplication, is we get the number 0. 4813 04:05:59,320 --> 04:06:02,520 So this idea of conditional probability is going to come up again and again, 4814 04:06:02,520 --> 04:06:06,400 especially as we begin to reason about multiple different random variables 4815 04:06:06,400 --> 04:06:08,760 that might be interacting with each other in some way. 4816 04:06:08,760 --> 04:06:10,880 And this gets us to one of the most important rules 4817 04:06:10,880 --> 04:06:14,400 in probability theory, which is known as Bayes rule. 4818 04:06:14,400 --> 04:06:17,000 And it turns out that just using the information we've already 4819 04:06:17,000 --> 04:06:20,440 learned about probability and just applying a little bit of algebra, 4820 04:06:20,440 --> 04:06:23,480 we can actually derive Bayes rule for ourselves. 4821 04:06:23,480 --> 04:06:26,200 But it's a very important rule when it comes to inference 4822 04:06:26,200 --> 04:06:28,640 and thinking about probability in the context of what 4823 04:06:28,640 --> 04:06:31,200 it is that a computer can do or what a mathematician could 4824 04:06:31,200 --> 04:06:34,920 do by having access to information about probability. 4825 04:06:34,920 --> 04:06:39,400 So let's go back to these equations to be able to derive Bayes rule ourselves. 4826 04:06:39,400 --> 04:06:43,800 We know the probability of A and B, the likelihood that A and B take place, 4827 04:06:43,800 --> 04:06:47,240 is the likelihood of B, and then the likelihood of A, 4828 04:06:47,240 --> 04:06:49,680 given that we know that B is already true. 4829 04:06:49,680 --> 04:06:52,800 And likewise, the probability of A given A and B 4830 04:06:52,800 --> 04:06:56,240 is the probability of A times the probability of B, 4831 04:06:56,240 --> 04:06:58,280 given that we know that A is already true. 4832 04:06:58,280 --> 04:07:00,280 This is sort of a symmetric relationship where 4833 04:07:00,280 --> 04:07:04,000 it doesn't matter the order of A and B and B and A mean the same thing. 4834 04:07:04,000 --> 04:07:07,520 And so in these equations, we can just swap out A and B 4835 04:07:07,520 --> 04:07:09,720 to be able to represent the exact same idea. 4836 04:07:09,720 --> 04:07:12,200 So we know that these two equations are already true. 4837 04:07:12,200 --> 04:07:13,480 We've seen that already. 4838 04:07:13,480 --> 04:07:17,000 And now let's just do a little bit of algebraic manipulation of this stuff. 4839 04:07:17,000 --> 04:07:19,800 Both of these expressions on the right-hand side 4840 04:07:19,800 --> 04:07:24,040 are equal to the probability of A and B. So what I can do 4841 04:07:24,040 --> 04:07:26,600 is take these two expressions on the right-hand side 4842 04:07:26,600 --> 04:07:28,760 and just set them equal to each other. 4843 04:07:28,760 --> 04:07:32,480 If they're both equal to the probability of A and B, 4844 04:07:32,480 --> 04:07:34,600 then they both must be equal to each other. 4845 04:07:34,600 --> 04:07:38,400 So probability of A times probability of B given A 4846 04:07:38,400 --> 04:07:44,360 is equal to the probability of B times the probability of A given B. 4847 04:07:44,360 --> 04:07:47,480 And now all we're going to do is do a little bit of division. 4848 04:07:47,480 --> 04:07:53,480 I'm going to divide both sides by P of A. And now I get what is Bayes' rule. 4849 04:07:53,480 --> 04:07:59,000 The probability of B given A is equal to the probability of B 4850 04:07:59,000 --> 04:08:03,120 times the probability of A given B divided by the probability of A. 4851 04:08:03,120 --> 04:08:05,040 And sometimes in Bayes' rule, you'll see the order 4852 04:08:05,040 --> 04:08:06,320 of these two arguments switched. 4853 04:08:06,320 --> 04:08:10,520 So instead of B times A given B, it'll be A given B times B. 4854 04:08:10,520 --> 04:08:12,940 That ultimately doesn't matter because in multiplication, 4855 04:08:12,940 --> 04:08:15,600 you can switch the order of the two things you're multiplying, 4856 04:08:15,600 --> 04:08:18,480 and it doesn't change the result. But this here right now 4857 04:08:18,480 --> 04:08:21,120 is the most common formulation of Bayes' rule. 4858 04:08:21,120 --> 04:08:26,240 The probability of B given A is equal to the probability of A given 4859 04:08:26,240 --> 04:08:31,200 B times the probability of B divided by the probability of A. 4860 04:08:31,200 --> 04:08:33,640 And this rule, it turns out, is really important 4861 04:08:33,640 --> 04:08:36,280 when it comes to trying to infer things about the world, 4862 04:08:36,280 --> 04:08:39,720 because it means you can express one conditional probability, 4863 04:08:39,720 --> 04:08:44,000 the conditional probability of B given A, using knowledge 4864 04:08:44,000 --> 04:08:47,960 about the probability of A given B, using the reverse 4865 04:08:47,960 --> 04:08:49,680 of that conditional probability. 4866 04:08:49,680 --> 04:08:51,960 So let's first do a little bit of an example with this, 4867 04:08:51,960 --> 04:08:54,200 just to see how we might use it, and then explore 4868 04:08:54,200 --> 04:08:56,680 what this means a little bit more generally. 4869 04:08:56,680 --> 04:08:59,840 So we're going to construct a situation where I have some information. 4870 04:08:59,840 --> 04:09:02,400 There are two events that I care about, the idea 4871 04:09:02,400 --> 04:09:05,240 that it's cloudy in the morning and the idea 4872 04:09:05,240 --> 04:09:07,600 that it is rainy in the afternoon. 4873 04:09:07,600 --> 04:09:10,240 Those are two different possible events that could take place, 4874 04:09:10,240 --> 04:09:13,680 cloudy in the morning, or the AM, rainy in the PM. 4875 04:09:13,680 --> 04:09:17,160 And what I care about is, given clouds in the morning, 4876 04:09:17,160 --> 04:09:19,840 what is the probability of rain in the afternoon? 4877 04:09:19,840 --> 04:09:22,040 A reasonable question I might ask, in the morning, 4878 04:09:22,040 --> 04:09:24,840 I look outside, or an AI's camera looks outside 4879 04:09:24,840 --> 04:09:27,480 and sees that there are clouds in the morning. 4880 04:09:27,480 --> 04:09:30,880 And we want to conclude, we want to figure out what is the probability 4881 04:09:30,880 --> 04:09:34,000 that in the afternoon, there is going to be rain. 4882 04:09:34,000 --> 04:09:36,080 Of course, in the abstract, we don't have access 4883 04:09:36,080 --> 04:09:38,600 to this kind of information, but we can use data 4884 04:09:38,600 --> 04:09:40,400 to begin to try and figure this out. 4885 04:09:40,400 --> 04:09:44,680 So let's imagine now that I have access to some pieces of information. 4886 04:09:44,680 --> 04:09:48,440 I have access to the idea that 80% of rainy afternoons 4887 04:09:48,440 --> 04:09:50,400 start out with a cloudy morning. 4888 04:09:50,400 --> 04:09:52,920 And you might imagine that I could have gathered this data just 4889 04:09:52,920 --> 04:09:54,640 by looking at data over a sequence of time, 4890 04:09:54,640 --> 04:09:58,360 that I know that 80% of the time when it's raining in the afternoon, 4891 04:09:58,360 --> 04:10:01,360 it was cloudy that morning. 4892 04:10:01,360 --> 04:10:04,760 I also know that 40% of days have cloudy mornings. 4893 04:10:04,760 --> 04:10:08,680 And I also know that 10% of days have rainy afternoons. 4894 04:10:08,680 --> 04:10:12,280 And now using this information, I would like to figure out, 4895 04:10:12,280 --> 04:10:15,320 given clouds in the morning, what is the probability 4896 04:10:15,320 --> 04:10:16,720 that it rains in the afternoon? 4897 04:10:16,720 --> 04:10:21,200 I want to know the probability of afternoon rain given morning clouds. 4898 04:10:21,200 --> 04:10:26,160 And I can do that, in particular, using this fact, the probability of, 4899 04:10:26,160 --> 04:10:29,880 so if I know that 80% of rainy afternoons start with cloudy mornings, 4900 04:10:29,880 --> 04:10:34,040 then I know the probability of cloudy mornings given rainy afternoons. 4901 04:10:34,040 --> 04:10:36,760 So using sort of the reverse conditional probability, 4902 04:10:36,760 --> 04:10:38,080 I can figure that out. 4903 04:10:38,080 --> 04:10:41,160 Expressed in terms of Bayes rule, this is what that would look like. 4904 04:10:41,160 --> 04:10:46,520 Probability of rain given clouds is the probability of clouds given rain 4905 04:10:46,520 --> 04:10:50,000 times the probability of rain divided by the probability of clouds. 4906 04:10:50,000 --> 04:10:53,160 Here I'm just substituting in for the values of a and b 4907 04:10:53,160 --> 04:10:55,280 from that equation of Bayes rule from before. 4908 04:10:55,280 --> 04:10:56,320 And then I can just do the math. 4909 04:10:56,320 --> 04:10:57,400 I have this information. 4910 04:10:57,400 --> 04:11:00,880 I know that 80% of the time, if it was raining, 4911 04:11:00,880 --> 04:11:01,960 then there were clouds in the morning. 4912 04:11:01,960 --> 04:11:03,240 So 0.8 here. 4913 04:11:03,240 --> 04:11:06,640 Probability of rain is 0.1, because 10% of days were rainy, 4914 04:11:06,640 --> 04:11:08,480 and 40% of days were cloudy. 4915 04:11:08,480 --> 04:11:11,560 I do the math, and I can figure out the answer is 0.2. 4916 04:11:11,560 --> 04:11:14,440 So the probability that it rains in the afternoon, 4917 04:11:14,440 --> 04:11:19,720 given that it was cloudy in the morning, is 0.2 in this case. 4918 04:11:19,720 --> 04:11:22,120 And this now is an application of Bayes rule, 4919 04:11:22,120 --> 04:11:24,760 the idea that using one conditional probability, 4920 04:11:24,760 --> 04:11:27,720 we can get the reverse conditional probability. 4921 04:11:27,720 --> 04:11:31,040 And this is often useful when one of the conditional probabilities 4922 04:11:31,040 --> 04:11:34,840 might be easier for us to know about or easier for us to have data about. 4923 04:11:34,840 --> 04:11:37,520 And using that information, we can calculate 4924 04:11:37,520 --> 04:11:39,360 the other conditional probability. 4925 04:11:39,360 --> 04:11:40,640 So what does this look like? 4926 04:11:40,640 --> 04:11:43,720 Well, it means that knowing the probability of cloudy mornings 4927 04:11:43,720 --> 04:11:47,200 given rainy afternoons, we can calculate the probability 4928 04:11:47,200 --> 04:11:50,120 of rainy afternoons given cloudy mornings. 4929 04:11:50,120 --> 04:11:54,320 Or, for example, more generally, if we know the probability 4930 04:11:54,320 --> 04:11:58,480 of some visible effect, some effect that we can see and observe, 4931 04:11:58,480 --> 04:12:02,040 given some unknown cause that we're not sure about, 4932 04:12:02,040 --> 04:12:05,760 well, then we can calculate the probability of that unknown cause 4933 04:12:05,760 --> 04:12:08,440 given the visible effect. 4934 04:12:08,440 --> 04:12:10,080 So what might that look like? 4935 04:12:10,080 --> 04:12:12,200 Well, in the context of medicine, for example, 4936 04:12:12,200 --> 04:12:17,080 I might know the probability of some medical test result given a disease. 4937 04:12:17,080 --> 04:12:19,400 Like, I know that if someone has a disease, 4938 04:12:19,400 --> 04:12:23,040 then x% of the time the medical test result will show up as this, 4939 04:12:23,040 --> 04:12:24,000 for instance. 4940 04:12:24,000 --> 04:12:26,760 And using that information, then I can calculate, all right, 4941 04:12:26,760 --> 04:12:31,040 what is the probability that given I know the medical test result, what 4942 04:12:31,040 --> 04:12:33,120 is the likelihood that someone has the disease? 4943 04:12:33,120 --> 04:12:36,280 This is the piece of information that is usually easier to know, 4944 04:12:36,280 --> 04:12:38,760 easier to immediately have access to data for. 4945 04:12:38,760 --> 04:12:42,320 And this is the information that I actually want to calculate. 4946 04:12:42,320 --> 04:12:44,080 Or I might want to know, for example, if I 4947 04:12:44,080 --> 04:12:48,040 know that some probability of counterfeit bills 4948 04:12:48,040 --> 04:12:51,440 have blurry text around the edges, because counterfeit printers aren't 4949 04:12:51,440 --> 04:12:53,560 nearly as good at printing text precisely. 4950 04:12:53,560 --> 04:12:56,000 So I have some information about, given that something 4951 04:12:56,000 --> 04:12:59,160 is a counterfeit bill, like x% of counterfeit bills 4952 04:12:59,160 --> 04:13:01,120 have blurry text, for example. 4953 04:13:01,120 --> 04:13:04,480 And using that information, then I can calculate some piece of information 4954 04:13:04,480 --> 04:13:08,160 that I might want to know, like, given that I know there's blurry text 4955 04:13:08,160 --> 04:13:12,200 on a bill, what is the probability that that bill is counterfeit? 4956 04:13:12,200 --> 04:13:14,600 So given one conditional probability, I can 4957 04:13:14,600 --> 04:13:19,320 calculate the other conditional probability as well. 4958 04:13:19,320 --> 04:13:22,640 And so now we've taken a look at a couple of different types of probability. 4959 04:13:22,640 --> 04:13:24,840 And we've looked at unconditional probability, 4960 04:13:24,840 --> 04:13:27,920 where I just look at what is the probability of this event occurring, 4961 04:13:27,920 --> 04:13:31,040 given no additional evidence that I might have access to. 4962 04:13:31,040 --> 04:13:33,560 And we've also looked at conditional probability, 4963 04:13:33,560 --> 04:13:35,400 where I have some sort of evidence, and I 4964 04:13:35,400 --> 04:13:38,760 would like to, using that evidence, be able to calculate some other 4965 04:13:38,760 --> 04:13:40,480 probability as well. 4966 04:13:40,480 --> 04:13:43,560 And the other kind of probability that will be important for us to think about 4967 04:13:43,560 --> 04:13:45,280 is joint probability. 4968 04:13:45,280 --> 04:13:47,440 And this is when we're considering the likelihood 4969 04:13:47,440 --> 04:13:50,800 of multiple different events simultaneously. 4970 04:13:50,800 --> 04:13:52,200 And so what do we mean by this? 4971 04:13:52,200 --> 04:13:55,320 For example, I might have probability distributions 4972 04:13:55,320 --> 04:13:56,880 that look a little something like this. 4973 04:13:56,880 --> 04:13:59,800 Like, oh, I want to know the probability distribution of clouds 4974 04:13:59,800 --> 04:14:00,640 in the morning. 4975 04:14:00,640 --> 04:14:02,400 And that distribution looks like this. 4976 04:14:02,400 --> 04:14:06,080 40% of the time, C, which is my random variable here, 4977 04:14:06,080 --> 04:14:07,680 is equal to it's cloudy. 4978 04:14:07,680 --> 04:14:10,560 And 60% of the time, it's not cloudy. 4979 04:14:10,560 --> 04:14:13,040 So here is just a simple probability distribution 4980 04:14:13,040 --> 04:14:17,320 that is effectively telling me that 40% of the time, it's cloudy. 4981 04:14:17,320 --> 04:14:20,800 I might also have a probability distribution for rain in the afternoon, 4982 04:14:20,800 --> 04:14:24,240 where 10% of the time, or with probability 0.1, 4983 04:14:24,240 --> 04:14:25,800 it is raining in the afternoon. 4984 04:14:25,800 --> 04:14:30,680 And with probability 0.9, it is not raining in the afternoon. 4985 04:14:30,680 --> 04:14:34,080 And using just these two pieces of information, 4986 04:14:34,080 --> 04:14:36,160 I don't actually have a whole lot of information 4987 04:14:36,160 --> 04:14:39,480 about how these two variables relate to each other. 4988 04:14:39,480 --> 04:14:42,520 But I could if I had access to their joint probability, 4989 04:14:42,520 --> 04:14:45,160 meaning for every combination of these two things, 4990 04:14:45,160 --> 04:14:49,200 meaning morning cloudy and afternoon rain, morning cloudy and afternoon not 4991 04:14:49,200 --> 04:14:52,120 rain, morning not cloudy and afternoon rain, 4992 04:14:52,120 --> 04:14:54,760 and morning not cloudy and afternoon not raining, 4993 04:14:54,760 --> 04:14:57,320 if I had access to values for each of those four, 4994 04:14:57,320 --> 04:14:58,800 I'd have more information. 4995 04:14:58,800 --> 04:15:02,040 So information that'd be organized in a table like this, 4996 04:15:02,040 --> 04:15:05,320 and this, rather than just a probability distribution, 4997 04:15:05,320 --> 04:15:07,600 is a joint probability distribution. 4998 04:15:07,600 --> 04:15:10,720 It tells me the probability distribution of each 4999 04:15:10,720 --> 04:15:15,800 of the possible combinations of values that these random variables can take on. 5000 04:15:15,800 --> 04:15:19,280 So if I want to know what is the probability that on any given day 5001 04:15:19,280 --> 04:15:22,440 it is both cloudy and rainy, well, I would say, all right, 5002 04:15:22,440 --> 04:15:26,520 we're looking at cases where it is cloudy and cases where it is raining. 5003 04:15:26,520 --> 04:15:30,960 And the intersection of those two, that row in that column, is 0.08. 5004 04:15:30,960 --> 04:15:35,160 So that is the probability that it is both cloudy and rainy using 5005 04:15:35,160 --> 04:15:36,720 that information. 5006 04:15:36,720 --> 04:15:39,640 And using this conditional probability table, 5007 04:15:39,640 --> 04:15:41,880 using this joint probability table, I can 5008 04:15:41,880 --> 04:15:46,200 begin to draw other pieces of information about things like conditional 5009 04:15:46,200 --> 04:15:47,000 probability. 5010 04:15:47,000 --> 04:15:51,520 So I might ask a question like, what is the probability distribution of clouds 5011 04:15:51,520 --> 04:15:53,800 given that I know that it is raining? 5012 04:15:53,800 --> 04:15:56,280 Meaning I know for sure that it's raining. 5013 04:15:56,280 --> 04:15:59,800 Tell me the probability distribution over whether it's cloudy or not, 5014 04:15:59,800 --> 04:16:02,320 given that I know already that it is, in fact, raining. 5015 04:16:02,320 --> 04:16:05,080 And here I'm using C to stand for that random variable. 5016 04:16:05,080 --> 04:16:07,640 I'm looking for a distribution, meaning the answer to this 5017 04:16:07,640 --> 04:16:09,480 is not going to be a single value. 5018 04:16:09,480 --> 04:16:12,080 It's going to be two values, a vector of two values, 5019 04:16:12,080 --> 04:16:14,800 where the first value is probability of clouds, 5020 04:16:14,800 --> 04:16:17,600 the second value is probability that it is not cloudy, 5021 04:16:17,600 --> 04:16:19,880 but the sum of those two values is going to be 1. 5022 04:16:19,880 --> 04:16:23,280 Because when you add up the probabilities of all of the possible worlds, 5023 04:16:23,280 --> 04:16:26,840 the result that you get must be the number 1. 5024 04:16:26,840 --> 04:16:30,360 And well, what do we know about how to calculate a conditional probability? 5025 04:16:30,360 --> 04:16:33,600 Well, we know that the probability of A given B 5026 04:16:33,600 --> 04:16:38,960 is the probability of A and B divided by the probability of B. 5027 04:16:38,960 --> 04:16:40,280 So what does this mean? 5028 04:16:40,280 --> 04:16:43,240 Well, it means that I can calculate the probability of clouds 5029 04:16:43,240 --> 04:16:49,080 given that it's raining as the probability of clouds and raining 5030 04:16:49,080 --> 04:16:50,880 divided by the probability of rain. 5031 04:16:50,880 --> 04:16:53,640 And this comma here for the probability distribution 5032 04:16:53,640 --> 04:16:57,320 of clouds and rain, this comma sort of stands in for the word and. 5033 04:16:57,320 --> 04:16:59,920 You'll sort of see in the logical operator and and the comma 5034 04:16:59,920 --> 04:17:01,120 used interchangeably. 5035 04:17:01,120 --> 04:17:04,200 This means the probability distribution over the clouds 5036 04:17:04,200 --> 04:17:06,680 and knowing the fact that it is raining divided 5037 04:17:06,680 --> 04:17:09,160 by the probability of rain. 5038 04:17:09,160 --> 04:17:11,080 And the interesting thing to note here and what 5039 04:17:11,080 --> 04:17:13,640 we'll often do in order to simplify our mathematics 5040 04:17:13,640 --> 04:17:16,760 is that dividing by the probability of rain, 5041 04:17:16,760 --> 04:17:19,760 the probability of rain here is just some numerical constant. 5042 04:17:19,760 --> 04:17:20,560 It is some number. 5043 04:17:20,560 --> 04:17:24,480 Dividing by probability of rain is just dividing by some constant, 5044 04:17:24,480 --> 04:17:27,760 or in other words, multiplying by the inverse of that constant. 5045 04:17:27,760 --> 04:17:30,480 And it turns out that oftentimes we can just not 5046 04:17:30,480 --> 04:17:32,880 worry about what the exact value of this is 5047 04:17:32,880 --> 04:17:36,040 and just know that it is, in fact, a constant value. 5048 04:17:36,040 --> 04:17:37,280 And we'll see why in a moment. 5049 04:17:37,280 --> 04:17:41,040 So instead of expressing this as this joint probability divided 5050 04:17:41,040 --> 04:17:43,040 by the probability of rain, sometimes we'll 5051 04:17:43,040 --> 04:17:47,240 just represent it as alpha times the numerator here, 5052 04:17:47,240 --> 04:17:50,480 the probability distribution of C, this variable, 5053 04:17:50,480 --> 04:17:53,000 and that we know that it is raining, for instance. 5054 04:17:53,000 --> 04:17:57,920 So all we've done here is said this value of 1 over the probability of rain, 5055 04:17:57,920 --> 04:18:00,720 that's really just a constant we're going to divide by or equivalently 5056 04:18:00,720 --> 04:18:02,800 multiply by the inverse of at the end. 5057 04:18:02,800 --> 04:18:06,400 We'll just call it alpha for now and deal with it a little bit later. 5058 04:18:06,400 --> 04:18:09,800 But the key idea here now, and this is an idea that's going to come up again, 5059 04:18:09,800 --> 04:18:14,040 is that the conditional distribution of C given rain 5060 04:18:14,040 --> 04:18:17,120 is proportional to, meaning just some factor multiplied 5061 04:18:17,120 --> 04:18:22,200 by the joint probability of C and rain being true. 5062 04:18:22,200 --> 04:18:23,560 And so how do we figure this out? 5063 04:18:23,560 --> 04:18:25,760 Well, this is going to be the probability that it 5064 04:18:25,760 --> 04:18:28,440 is cloudy given that it's raining, which is 0.08, 5065 04:18:28,440 --> 04:18:30,680 and the probability that it's not cloudy given 5066 04:18:30,680 --> 04:18:32,960 that it's raining, which is 0.02. 5067 04:18:32,960 --> 04:18:37,680 And so we get alpha times here now is that probability distribution. 5068 04:18:37,680 --> 04:18:40,000 0.08 is clouds and rain. 5069 04:18:40,000 --> 04:18:43,840 0.02 is not cloudy and rain. 5070 04:18:43,840 --> 04:18:47,920 But of course, 0.08 and 0.02 don't sum up to the number 1. 5071 04:18:47,920 --> 04:18:50,400 And we know that in a probability distribution, 5072 04:18:50,400 --> 04:18:52,680 if you consider all of the possible values, 5073 04:18:52,680 --> 04:18:55,360 they must sum up to a probability of 1. 5074 04:18:55,360 --> 04:18:57,600 And so we know that we just need to figure out 5075 04:18:57,600 --> 04:19:01,720 some constant to normalize, so to speak, these values, something 5076 04:19:01,720 --> 04:19:05,480 we can multiply or divide by to get it so that all these probabilities sum up 5077 04:19:05,480 --> 04:19:08,920 to 1, and it turns out that if we multiply both numbers by 10, 5078 04:19:08,920 --> 04:19:11,920 then we can get that result of 0.8 and 0.2. 5079 04:19:11,920 --> 04:19:15,640 The proportions are still equivalent, but now 0.8 plus 0.2, 5080 04:19:15,640 --> 04:19:18,280 those sum up to the number 1. 5081 04:19:18,280 --> 04:19:21,400 So take a look at this and see if you can understand step by step 5082 04:19:21,400 --> 04:19:23,600 how it is we're getting from one point to another. 5083 04:19:23,600 --> 04:19:27,840 The key idea here is that by using the joint probabilities, 5084 04:19:27,840 --> 04:19:31,360 these probabilities that it is both cloudy and rainy 5085 04:19:31,360 --> 04:19:35,240 and that it is not cloudy and rainy, I can take that information 5086 04:19:35,240 --> 04:19:39,440 and figure out the conditional probability given that it's raining. 5087 04:19:39,440 --> 04:19:41,960 What is the chance that it's cloudy versus not cloudy? 5088 04:19:41,960 --> 04:19:46,320 Just by multiplying by some normalization constant, so to speak. 5089 04:19:46,320 --> 04:19:48,520 And this is what a computer can begin to use 5090 04:19:48,520 --> 04:19:52,880 to be able to interact with these various different types of probabilities. 5091 04:19:52,880 --> 04:19:55,420 And it turns out there are a number of other probability rules 5092 04:19:55,420 --> 04:19:57,440 that are going to be useful to us as we begin 5093 04:19:57,440 --> 04:20:01,200 to explore how we can actually use this information to encode 5094 04:20:01,200 --> 04:20:05,640 into our computers some more complex analysis that we might want to do 5095 04:20:05,640 --> 04:20:08,840 about probability and distributions and random variables 5096 04:20:08,840 --> 04:20:10,440 that we might be interacting with. 5097 04:20:10,440 --> 04:20:12,840 So here are a couple of those important probability rules. 5098 04:20:12,840 --> 04:20:15,480 One of the simplest rules is just this negation rule. 5099 04:20:15,480 --> 04:20:19,080 What is the probability of not event A? 5100 04:20:19,080 --> 04:20:21,600 So A is an event that has some probability, 5101 04:20:21,600 --> 04:20:25,480 and I would like to know what is the probability that A does not occur. 5102 04:20:25,480 --> 04:20:29,980 And it turns out it's just 1 minus P of A, which makes sense. 5103 04:20:29,980 --> 04:20:33,720 Because if those are the two possible cases, either A happens or A 5104 04:20:33,720 --> 04:20:37,600 doesn't happen, then when you add up those two cases, you must get 1, 5105 04:20:37,600 --> 04:20:42,600 which means that P of not A must just be 1 minus P of A. 5106 04:20:42,600 --> 04:20:46,560 Because P of A and P of not A must sum up to the number 1. 5107 04:20:46,560 --> 04:20:49,680 They must include all of the possible cases. 5108 04:20:49,680 --> 04:20:53,640 We've seen an expression for calculating the probability of A and B. 5109 04:20:53,640 --> 04:20:57,840 We might also reasonably want to calculate the probability of A or B. 5110 04:20:57,840 --> 04:21:01,200 What is the probability that one thing happens or another thing happens? 5111 04:21:01,200 --> 04:21:04,520 So for example, I might want to calculate what is the probability 5112 04:21:04,520 --> 04:21:07,880 that if I roll two dice, a red die and a blue die, what is the likelihood 5113 04:21:07,880 --> 04:21:11,480 that A is a 6 or B is a 6, like one or the other? 5114 04:21:11,480 --> 04:21:14,480 And what you might imagine you could do, and the wrong way to approach it, 5115 04:21:14,480 --> 04:21:19,000 would be just to say, all right, well, A comes up as a 6 with the red die 5116 04:21:19,000 --> 04:21:21,560 comes up as a 6 with probability 1 over 6. 5117 04:21:21,560 --> 04:21:23,720 The same for the blue die, it's also 1 over 6. 5118 04:21:23,720 --> 04:21:27,160 Add them together, and you get 2 over 6, otherwise known as 1 third. 5119 04:21:27,160 --> 04:21:30,480 But this suffers from a problem of over counting, 5120 04:21:30,480 --> 04:21:34,560 that we've double counted the case, where both A and B, both the red die 5121 04:21:34,560 --> 04:21:37,320 and the blue die, both come up as a 6-roll. 5122 04:21:37,320 --> 04:21:39,440 And I've counted that instance twice. 5123 04:21:39,440 --> 04:21:43,880 So to resolve this, the actual expression for calculating the probability of A 5124 04:21:43,880 --> 04:21:47,720 or B uses what we call the inclusion-exclusion formula. 5125 04:21:47,720 --> 04:21:51,120 So I take the probability of A, add it to the probability of B. 5126 04:21:51,120 --> 04:21:52,520 That's all same as before. 5127 04:21:52,520 --> 04:21:56,080 But then I need to exclude the cases that I've double counted. 5128 04:21:56,080 --> 04:22:01,240 So I subtract from that the probability of A and B. 5129 04:22:01,240 --> 04:22:05,160 And that gets me the result for A or B. I consider all the cases where A is true 5130 04:22:05,160 --> 04:22:07,000 and all the cases where B is true. 5131 04:22:07,000 --> 04:22:09,920 And if you imagine this is like a Venn diagram of cases where A is true, 5132 04:22:09,920 --> 04:22:12,800 cases where B is true, I just need to subtract out the middle 5133 04:22:12,800 --> 04:22:16,720 to get rid of the cases that I have overcounted by double counting them 5134 04:22:16,720 --> 04:22:21,160 inside of both of these individual expressions. 5135 04:22:21,160 --> 04:22:23,160 One other rule that's going to be quite helpful 5136 04:22:23,160 --> 04:22:25,400 is a rule called marginalization. 5137 04:22:25,400 --> 04:22:27,520 So marginalization is answering the question 5138 04:22:27,520 --> 04:22:31,760 of how do I figure out the probability of A using some other variable 5139 04:22:31,760 --> 04:22:33,600 that I might have access to, like B? 5140 04:22:33,600 --> 04:22:35,840 Even if I don't know additional information about it, 5141 04:22:35,840 --> 04:22:40,320 I know that B, some event, can have two possible states, either B 5142 04:22:40,320 --> 04:22:44,720 happens or B doesn't happen, assuming it's a Boolean, true or false. 5143 04:22:44,720 --> 04:22:47,160 And well, what that means is that for me to be 5144 04:22:47,160 --> 04:22:50,760 able to calculate the probability of A, there are only two cases. 5145 04:22:50,760 --> 04:22:55,560 Either A happens and B happens, or A happens and B doesn't happen. 5146 04:22:55,560 --> 04:22:58,840 And those are two disjoint, meaning they can't both happen together. 5147 04:22:58,840 --> 04:23:01,160 Either B happens or B doesn't happen. 5148 04:23:01,160 --> 04:23:03,280 They're disjoint or separate cases. 5149 04:23:03,280 --> 04:23:05,680 And so I can figure out the probability of A 5150 04:23:05,680 --> 04:23:07,800 just by adding up those two cases. 5151 04:23:07,800 --> 04:23:13,360 The probability that A is true is the probability that A and B is true, 5152 04:23:13,360 --> 04:23:16,520 plus the probability that A is true and B isn't true. 5153 04:23:16,520 --> 04:23:19,880 So by marginalizing, I've looked at the two possible cases 5154 04:23:19,880 --> 04:23:23,600 that might take place, either B happens or B doesn't happen. 5155 04:23:23,600 --> 04:23:25,560 And in either of those cases, I look at what's 5156 04:23:25,560 --> 04:23:27,240 the probability that A happens. 5157 04:23:27,240 --> 04:23:30,080 And if I add those together, well, then I get the probability 5158 04:23:30,080 --> 04:23:32,360 that A happens as a whole. 5159 04:23:32,360 --> 04:23:33,640 So take a look at that rule. 5160 04:23:33,640 --> 04:23:36,760 It doesn't matter what B is or how it's related to A. 5161 04:23:36,760 --> 04:23:39,200 So long as I know these joint distributions, 5162 04:23:39,200 --> 04:23:42,120 I can figure out the overall probability of A. 5163 04:23:42,120 --> 04:23:44,760 And this can be a useful way if I have a joint distribution, 5164 04:23:44,760 --> 04:23:48,200 like the joint distribution of A and B, to just figure out 5165 04:23:48,200 --> 04:23:51,320 some unconditional probability, like the probability of A. 5166 04:23:51,320 --> 04:23:54,160 And we'll see examples of this soon as well. 5167 04:23:54,160 --> 04:23:55,920 Now, sometimes these might not just be random, 5168 04:23:55,920 --> 04:23:58,680 might not just be variables that are events that are like they happened 5169 04:23:58,680 --> 04:24:00,800 or they didn't happen, like B is here. 5170 04:24:00,800 --> 04:24:03,320 They might be some broader probability distribution 5171 04:24:03,320 --> 04:24:05,520 where there are multiple possible values. 5172 04:24:05,520 --> 04:24:08,360 And so here, in order to use this marginalization rule, 5173 04:24:08,360 --> 04:24:11,720 I need to sum up not just over B and not B, 5174 04:24:11,720 --> 04:24:15,760 but for all of the possible values that the other random variable could take 5175 04:24:15,760 --> 04:24:16,320 on. 5176 04:24:16,320 --> 04:24:19,000 And so here, we'll see a version of this rule for random variables. 5177 04:24:19,000 --> 04:24:21,280 And it's going to include that summation notation 5178 04:24:21,280 --> 04:24:25,800 to indicate that I'm summing up, adding up a whole bunch of individual values. 5179 04:24:25,800 --> 04:24:26,800 So here's the rule. 5180 04:24:26,800 --> 04:24:28,760 Looks a lot more complicated, but it's actually 5181 04:24:28,760 --> 04:24:30,960 the equivalent exactly the same rule. 5182 04:24:30,960 --> 04:24:35,120 What I'm saying here is that if I have two random variables, one called x 5183 04:24:35,120 --> 04:24:41,000 and one called y, well, the probability that x is equal to some value x sub i, 5184 04:24:41,000 --> 04:24:43,800 this is just some value that this variable takes on. 5185 04:24:43,800 --> 04:24:45,120 How do I figure it out? 5186 04:24:45,120 --> 04:24:48,720 Well, I'm going to sum up over j, where j is going 5187 04:24:48,720 --> 04:24:53,000 to range over all of the possible values that y can take on. 5188 04:24:53,000 --> 04:24:58,240 Well, let's look at the probability that x equals xi and y equals yj. 5189 04:24:58,240 --> 04:25:00,240 So the exact same rule, the only difference here 5190 04:25:00,240 --> 04:25:03,000 is now I'm summing up over all of the possible values 5191 04:25:03,000 --> 04:25:06,960 that y can take on, saying let's add up all of those possible cases 5192 04:25:06,960 --> 04:25:10,760 and look at this joint distribution, this joint probability, 5193 04:25:10,760 --> 04:25:15,640 that x takes on the value I care about, given all of the possible values for y. 5194 04:25:15,640 --> 04:25:18,560 And if I add all those up, then I can get 5195 04:25:18,560 --> 04:25:22,360 this unconditional probability of what x is equal to, 5196 04:25:22,360 --> 04:25:26,080 whether or not x is equal to some value x sub i. 5197 04:25:26,080 --> 04:25:27,880 So let's take a look at this rule, because it 5198 04:25:27,880 --> 04:25:29,000 does look a little bit complicated. 5199 04:25:29,000 --> 04:25:31,280 Let's try and put a concrete example to it. 5200 04:25:31,280 --> 04:25:34,080 Here again is that same joint distribution from before. 5201 04:25:34,080 --> 04:25:38,120 I have cloud, not cloudy, rainy, not rainy. 5202 04:25:38,120 --> 04:25:40,480 And maybe I want to access some variable. 5203 04:25:40,480 --> 04:25:44,520 I want to know what is the probability that it is cloudy. 5204 04:25:44,520 --> 04:25:48,120 Well, marginalization says that if I have this joint distribution 5205 04:25:48,120 --> 04:25:51,600 and I want to know what is the probability that it is cloudy, 5206 04:25:51,600 --> 04:25:55,320 well, I need to consider the other variable, the variable that's not here, 5207 04:25:55,320 --> 04:25:56,720 the idea that it's rainy. 5208 04:25:56,720 --> 04:26:00,440 And I consider the two cases, either it's raining or it's not raining. 5209 04:26:00,440 --> 04:26:04,000 And I just sum up the values for each of those possibilities. 5210 04:26:04,000 --> 04:26:07,040 In other words, the probability that it is cloudy 5211 04:26:07,040 --> 04:26:12,320 is equal to the sum of the probability that it's cloudy and it's rainy 5212 04:26:12,320 --> 04:26:17,720 and the probability that it's cloudy and it is not raining. 5213 04:26:17,720 --> 04:26:20,080 And so these now are values that I have access to. 5214 04:26:20,080 --> 04:26:24,480 These are values that are just inside of this joint probability table. 5215 04:26:24,480 --> 04:26:27,600 What is the probability that it is both cloudy and rainy? 5216 04:26:27,600 --> 04:26:31,000 Well, it's just the intersection of these two here, which is 0.08. 5217 04:26:31,000 --> 04:26:34,240 And the probability that it's cloudy and not raining is, all right, 5218 04:26:34,240 --> 04:26:36,120 here's cloudy, here's not raining. 5219 04:26:36,120 --> 04:26:37,640 It's 0.32. 5220 04:26:37,640 --> 04:26:42,240 So it's 0.08 plus 0.32, which just gives us equal to 0.4. 5221 04:26:42,240 --> 04:26:46,560 That is the unconditional probability that it is, in fact, cloudy. 5222 04:26:46,560 --> 04:26:50,800 And so marginalization gives us a way to go from these joint distributions 5223 04:26:50,800 --> 04:26:53,960 to just some individual probability that I might care about. 5224 04:26:53,960 --> 04:26:56,680 And you'll see a little bit later why it is that we care about that 5225 04:26:56,680 --> 04:26:59,280 and why that's actually useful to us as we begin 5226 04:26:59,280 --> 04:27:01,860 doing some of these calculations. 5227 04:27:01,860 --> 04:27:04,020 Last rule we'll take a look at before transitioning 5228 04:27:04,020 --> 04:27:06,840 to something a little bit different is this rule of conditioning, 5229 04:27:06,840 --> 04:27:09,760 very similar to the marginalization rule. 5230 04:27:09,760 --> 04:27:12,240 But it says that, again, if I have two events, a and b, 5231 04:27:12,240 --> 04:27:15,440 but instead of having access to their joint probabilities, 5232 04:27:15,440 --> 04:27:17,820 I have access to their conditional probabilities, 5233 04:27:17,820 --> 04:27:19,520 how they relate to each other. 5234 04:27:19,520 --> 04:27:22,960 Well, again, if I want to know the probability that a happens, 5235 04:27:22,960 --> 04:27:26,480 and I know that there's some other variable b, either b happens or b 5236 04:27:26,480 --> 04:27:30,320 doesn't happen, and so I can say that the probability of a 5237 04:27:30,320 --> 04:27:35,840 is the probability of a given b times the probability of b, meaning b happened. 5238 04:27:35,840 --> 04:27:39,080 And given that I know b happened, what's the likelihood that a happened? 5239 04:27:39,080 --> 04:27:42,200 And then I consider the other case, that b didn't happen. 5240 04:27:42,200 --> 04:27:44,960 So here's the probability that b didn't happen. 5241 04:27:44,960 --> 04:27:47,160 And here's the probability that a happens, 5242 04:27:47,160 --> 04:27:49,520 given that I know that b didn't happen. 5243 04:27:49,520 --> 04:27:51,580 And this is really the equivalent rule just 5244 04:27:51,580 --> 04:27:55,280 using conditional probability instead of joint probability, 5245 04:27:55,280 --> 04:27:59,440 where I'm saying let's look at both of these two cases and condition on b. 5246 04:27:59,440 --> 04:28:03,120 Look at the case where b happens, and look at the case where b doesn't happen, 5247 04:28:03,120 --> 04:28:06,200 and look at what probabilities I get as a result. 5248 04:28:06,200 --> 04:28:08,320 And just as in the case of marginalization, 5249 04:28:08,320 --> 04:28:10,520 where there was an equivalent rule for random variables 5250 04:28:10,520 --> 04:28:14,480 that could take on multiple possible values in a domain of possible values, 5251 04:28:14,480 --> 04:28:17,160 here, too, conditioning has the same equivalent rule. 5252 04:28:17,160 --> 04:28:19,640 Again, there's a summation to mean I'm summing over 5253 04:28:19,640 --> 04:28:23,720 all of the possible values that some random variable y could take on. 5254 04:28:23,720 --> 04:28:27,760 But if I want to know what is the probability that x takes on this value, 5255 04:28:27,760 --> 04:28:31,840 then I'm going to sum up over all the values j that y could take on, 5256 04:28:31,840 --> 04:28:35,800 and say, all right, what's the chance that y takes on that value yj? 5257 04:28:35,800 --> 04:28:38,360 And multiply it by the conditional probability 5258 04:28:38,360 --> 04:28:42,840 that x takes on this value, given that y took on that value yj. 5259 04:28:42,840 --> 04:28:46,120 So equivalent rule just using conditional probabilities 5260 04:28:46,120 --> 04:28:47,760 instead of joint probabilities. 5261 04:28:47,760 --> 04:28:50,400 And using the equation we know about joint probabilities, 5262 04:28:50,400 --> 04:28:53,440 we can translate between these two. 5263 04:28:53,440 --> 04:28:55,440 So all right, we've seen a whole lot of mathematics, 5264 04:28:55,440 --> 04:28:57,760 and we've just laid the foundation for mathematics. 5265 04:28:57,760 --> 04:29:00,840 And no need to worry if you haven't seen probability in too much detail 5266 04:29:00,840 --> 04:29:02,000 up until this point. 5267 04:29:02,000 --> 04:29:05,080 These are the foundations of the ideas that are going to come up 5268 04:29:05,080 --> 04:29:09,600 as we begin to explore how we can now take these ideas from probability 5269 04:29:09,600 --> 04:29:12,720 and begin to apply them to represent something inside of our computer, 5270 04:29:12,720 --> 04:29:16,160 something inside of the AI agent we're trying to design that 5271 04:29:16,160 --> 04:29:18,920 is able to represent information and probabilities 5272 04:29:18,920 --> 04:29:22,240 and the likelihoods between various different events. 5273 04:29:22,240 --> 04:29:24,640 So there are a number of different probabilistic models 5274 04:29:24,640 --> 04:29:26,840 that we can generate, but the first of the models 5275 04:29:26,840 --> 04:29:30,160 we're going to talk about are what are known as Bayesian networks. 5276 04:29:30,160 --> 04:29:34,000 And a Bayesian network is just going to be some network of random variables, 5277 04:29:34,000 --> 04:29:37,160 connected random variables that are going to represent 5278 04:29:37,160 --> 04:29:39,880 the dependence between these random variables. 5279 04:29:39,880 --> 04:29:43,080 The odds are most random variables in this world 5280 04:29:43,080 --> 04:29:45,200 are not independent from each other, but there's 5281 04:29:45,200 --> 04:29:48,360 some relationship between things that are happening that we care about. 5282 04:29:48,360 --> 04:29:51,840 If it is rainy today, that might increase the likelihood 5283 04:29:51,840 --> 04:29:54,400 that my flight or my train gets delayed, for example. 5284 04:29:54,400 --> 04:29:57,240 There are some dependence between these random variables, 5285 04:29:57,240 --> 04:30:01,960 and a Bayesian network is going to be able to capture those dependencies. 5286 04:30:01,960 --> 04:30:03,280 So what is a Bayesian network? 5287 04:30:03,280 --> 04:30:06,040 What is its actual structure, and how does it work? 5288 04:30:06,040 --> 04:30:08,760 Well, a Bayesian network is going to be a directed graph. 5289 04:30:08,760 --> 04:30:10,800 And again, we've seen directed graphs before. 5290 04:30:10,800 --> 04:30:13,800 They are individual nodes with arrows or edges 5291 04:30:13,800 --> 04:30:18,520 that connect one node to another node pointing in a particular direction. 5292 04:30:18,520 --> 04:30:20,600 And so this directed graph is going to have nodes 5293 04:30:20,600 --> 04:30:23,480 as well, where each node in this directed graph 5294 04:30:23,480 --> 04:30:27,040 is going to represent a random variable, something like the weather, 5295 04:30:27,040 --> 04:30:30,880 or something like whether my train was on time or delayed. 5296 04:30:30,880 --> 04:30:34,440 And we're going to have an arrow from a node x to a node y 5297 04:30:34,440 --> 04:30:37,080 to mean that x is a parent of y. 5298 04:30:37,080 --> 04:30:38,200 So that'll be our notation. 5299 04:30:38,200 --> 04:30:42,600 If there's an arrow from x to y, x is going to be considered a parent of y. 5300 04:30:42,600 --> 04:30:46,000 And the reason that's important is because each of these nodes 5301 04:30:46,000 --> 04:30:48,840 is going to have a probability distribution that we're 5302 04:30:48,840 --> 04:30:52,280 going to store along with it, which is the distribution of x 5303 04:30:52,280 --> 04:30:56,160 given some evidence, given the parents of x. 5304 04:30:56,160 --> 04:30:58,120 So the way to more intuitively think about this 5305 04:30:58,120 --> 04:31:01,880 is the parents seem to be thought of as sort of causes for some effect 5306 04:31:01,880 --> 04:31:04,240 that we're going to observe. 5307 04:31:04,240 --> 04:31:07,400 And so let's take a look at an actual example of a Bayesian network 5308 04:31:07,400 --> 04:31:09,880 and think about the types of logic that might be involved 5309 04:31:09,880 --> 04:31:11,680 in reasoning about that network. 5310 04:31:11,680 --> 04:31:15,200 Let's imagine for a moment that I have an appointment out of town, 5311 04:31:15,200 --> 04:31:18,200 and I need to take a train in order to get to that appointment. 5312 04:31:18,200 --> 04:31:19,960 So what are the things I might care about? 5313 04:31:19,960 --> 04:31:22,240 Well, I care about getting to my appointment on time. 5314 04:31:22,240 --> 04:31:24,720 Whether I make it to my appointment and I'm able to attend it 5315 04:31:24,720 --> 04:31:26,360 or I miss the appointment. 5316 04:31:26,360 --> 04:31:29,120 And you might imagine that that's influenced by the train, 5317 04:31:29,120 --> 04:31:33,680 that the train is either on time or it's delayed, for example. 5318 04:31:33,680 --> 04:31:36,000 But that train itself is also influenced. 5319 04:31:36,000 --> 04:31:39,680 Whether the train is on time or not depends maybe on the rain. 5320 04:31:39,680 --> 04:31:40,520 Is there no rain? 5321 04:31:40,520 --> 04:31:41,180 Is it light rain? 5322 04:31:41,180 --> 04:31:42,480 Is there heavy rain? 5323 04:31:42,480 --> 04:31:44,720 And it might also be influenced by other variables too. 5324 04:31:44,720 --> 04:31:47,000 It might be influenced as well by whether or not 5325 04:31:47,000 --> 04:31:49,200 there's maintenance on the train track, for example. 5326 04:31:49,200 --> 04:31:51,080 If there is maintenance on the train track, 5327 04:31:51,080 --> 04:31:55,480 that probably increases the likelihood that my train is delayed. 5328 04:31:55,480 --> 04:31:57,640 And so we can represent all of these ideas 5329 04:31:57,640 --> 04:32:01,000 using a Bayesian network that looks a little something like this. 5330 04:32:01,000 --> 04:32:05,080 Here I have four nodes representing four random variables 5331 04:32:05,080 --> 04:32:06,600 that I would like to keep track of. 5332 04:32:06,600 --> 04:32:08,800 I have one random variable called rain that 5333 04:32:08,800 --> 04:32:12,840 can take on three possible values in its domain, either none or light 5334 04:32:12,840 --> 04:32:16,160 or heavy, for no rain, light rain, or heavy rain. 5335 04:32:16,160 --> 04:32:18,280 I have a variable called maintenance for whether or not 5336 04:32:18,280 --> 04:32:20,240 there is maintenance on the train track, which 5337 04:32:20,240 --> 04:32:22,600 it has two possible values, just either yes or no. 5338 04:32:22,600 --> 04:32:26,160 Either there is maintenance or there's no maintenance happening on the track. 5339 04:32:26,160 --> 04:32:28,840 Then I have a random variable for the train indicating whether or not 5340 04:32:28,840 --> 04:32:30,120 the train was on time or not. 5341 04:32:30,120 --> 04:32:33,480 That random variable has two possible values in its domain. 5342 04:32:33,480 --> 04:32:37,360 The train is either on time or the train is delayed. 5343 04:32:37,360 --> 04:32:39,480 And then finally, I have a random variable 5344 04:32:39,480 --> 04:32:41,120 for whether I make it to my appointment. 5345 04:32:41,120 --> 04:32:43,600 For my appointment down here, I have a random variable 5346 04:32:43,600 --> 04:32:49,120 called appointment that itself has two possible values, attend and miss. 5347 04:32:49,120 --> 04:32:50,560 And so here are the possible values. 5348 04:32:50,560 --> 04:32:54,040 Here are my four nodes, each of which represents a random variable, each 5349 04:32:54,040 --> 04:32:58,120 of which has a domain of possible values that it can take on. 5350 04:32:58,120 --> 04:33:01,600 And the arrows, the edges pointing from one node to another, 5351 04:33:01,600 --> 04:33:05,880 encode some notion of dependence inside of this graph, 5352 04:33:05,880 --> 04:33:08,440 that whether I make it to my appointment or not 5353 04:33:08,440 --> 04:33:12,200 is dependent upon whether the train is on time or delayed. 5354 04:33:12,200 --> 04:33:14,320 And whether the train is on time or delayed 5355 04:33:14,320 --> 04:33:18,520 is dependent on two things given by the two arrows pointing at this node. 5356 04:33:18,520 --> 04:33:22,000 It is dependent on whether or not there was maintenance on the train track. 5357 04:33:22,000 --> 04:33:25,640 And it is also dependent upon whether or not it was raining 5358 04:33:25,640 --> 04:33:27,360 or whether it is raining. 5359 04:33:27,360 --> 04:33:29,360 And just to make things a little complicated, 5360 04:33:29,360 --> 04:33:32,920 let's say as well that whether or not there is maintenance on the track, 5361 04:33:32,920 --> 04:33:34,920 this too might be influenced by the rain. 5362 04:33:34,920 --> 04:33:37,360 That if there's heavier rain, well, maybe it's 5363 04:33:37,360 --> 04:33:40,320 less likely that it's going to be maintenance on the train track that day 5364 04:33:40,320 --> 04:33:43,360 because they're more likely to want to do maintenance on the track on days 5365 04:33:43,360 --> 04:33:45,000 when it's not raining, for example. 5366 04:33:45,000 --> 04:33:47,920 And so these nodes might have different relationships between them. 5367 04:33:47,920 --> 04:33:51,360 But the idea is that we can come up with a probability distribution 5368 04:33:51,360 --> 04:33:56,000 for any of these nodes based only upon its parents. 5369 04:33:56,000 --> 04:33:59,760 And so let's look node by node at what this probability distribution might 5370 04:33:59,760 --> 04:34:00,480 actually look like. 5371 04:34:00,480 --> 04:34:03,600 And we'll go ahead and begin with this root node, this rain node here, 5372 04:34:03,600 --> 04:34:07,440 which is at the top, and has no arrows pointing into it, which 5373 04:34:07,440 --> 04:34:10,160 means its probability distribution is not 5374 04:34:10,160 --> 04:34:11,920 going to be a conditional distribution. 5375 04:34:11,920 --> 04:34:13,520 It's not based on anything. 5376 04:34:13,520 --> 04:34:17,920 I just have some probability distribution over the possible values 5377 04:34:17,920 --> 04:34:20,280 for the rain random variable. 5378 04:34:20,280 --> 04:34:23,200 And that distribution might look a little something like this. 5379 04:34:23,200 --> 04:34:25,800 None, light and heavy, each have a possible value. 5380 04:34:25,800 --> 04:34:31,120 Here I'm saying the likelihood of no rain is 0.7, of light rain is 0.2, 5381 04:34:31,120 --> 04:34:33,880 of heavy rain is 0.1, for example. 5382 04:34:33,880 --> 04:34:38,080 So here is a probability distribution for this root node in this Bayesian 5383 04:34:38,080 --> 04:34:39,360 network. 5384 04:34:39,360 --> 04:34:42,640 And let's now consider the next node in the network, maintenance. 5385 04:34:42,640 --> 04:34:44,680 Track maintenance is yes or no. 5386 04:34:44,680 --> 04:34:47,960 And the general idea of what this distribution is going to encode, 5387 04:34:47,960 --> 04:34:52,120 at least in this story, is the idea that the heavier the rain is, 5388 04:34:52,120 --> 04:34:55,240 the less likely it is that there's going to be maintenance on the track. 5389 04:34:55,240 --> 04:34:57,620 Because the people that are doing maintenance on the track probably 5390 04:34:57,620 --> 04:35:00,480 want to wait until a day when it's not as rainy in order 5391 04:35:00,480 --> 04:35:02,520 to do the track maintenance, for example. 5392 04:35:02,520 --> 04:35:05,120 And so what might that probability distribution look like? 5393 04:35:05,120 --> 04:35:08,720 Well, this now is going to be a conditional probability distribution, 5394 04:35:08,720 --> 04:35:12,400 that here are the three possible values for the rain random variable, which 5395 04:35:12,400 --> 04:35:15,680 I'm here just going to abbreviate to R, either no rain, light rain, 5396 04:35:15,680 --> 04:35:17,080 or heavy rain. 5397 04:35:17,080 --> 04:35:19,640 And for each of those possible values, either there 5398 04:35:19,640 --> 04:35:22,820 is yes track maintenance or no track maintenance. 5399 04:35:22,820 --> 04:35:25,760 And those have probabilities associated with them. 5400 04:35:25,760 --> 04:35:30,280 That I see here that if it is not raining, 5401 04:35:30,280 --> 04:35:33,280 then there is a probability of 0.4 that there's track maintenance 5402 04:35:33,280 --> 04:35:36,000 and a probability of 0.6 that there isn't. 5403 04:35:36,000 --> 04:35:38,840 But if there's heavy rain, then here the chance 5404 04:35:38,840 --> 04:35:41,640 that there is track maintenance is 0.1 and the chance 5405 04:35:41,640 --> 04:35:44,200 that there is not track maintenance is 0.9. 5406 04:35:44,200 --> 04:35:47,160 Each of these rows is going to sum up to 1. 5407 04:35:47,160 --> 04:35:49,640 Because each of these represent different values 5408 04:35:49,640 --> 04:35:52,360 of whether or not it's raining, the three possible values 5409 04:35:52,360 --> 04:35:54,320 that that random variable can take on. 5410 04:35:54,320 --> 04:35:57,800 And each is associated with its own probability distribution 5411 04:35:57,800 --> 04:36:02,080 that is ultimately all going to add up to the number 1. 5412 04:36:02,080 --> 04:36:05,920 So that there is our distribution for this random variable called maintenance, 5413 04:36:05,920 --> 04:36:09,720 about whether or not there is maintenance on the train track. 5414 04:36:09,720 --> 04:36:11,680 And now let's consider the next variable. 5415 04:36:11,680 --> 04:36:15,040 Here we have a node inside of our Bayesian network called train 5416 04:36:15,040 --> 04:36:18,080 that has two possible values, on time and delayed. 5417 04:36:18,080 --> 04:36:21,800 And this node is going to be dependent upon the two nodes that 5418 04:36:21,800 --> 04:36:23,800 are pointing towards it, that whether or not 5419 04:36:23,800 --> 04:36:27,200 the train is on time or delayed depends on whether or not 5420 04:36:27,200 --> 04:36:28,520 there is track maintenance. 5421 04:36:28,520 --> 04:36:30,480 And it depends on whether or not there is rain, 5422 04:36:30,480 --> 04:36:35,160 that heavier rain probably means more likely that my train is delayed. 5423 04:36:35,160 --> 04:36:38,200 And if there is track maintenance, that also probably 5424 04:36:38,200 --> 04:36:41,880 means it's more likely that my train is delayed as well. 5425 04:36:41,880 --> 04:36:45,000 And so you could construct a larger probability distribution, 5426 04:36:45,000 --> 04:36:47,760 a conditional probability distribution, that instead 5427 04:36:47,760 --> 04:36:51,160 of conditioning on just one variable, as was the case here, 5428 04:36:51,160 --> 04:36:54,000 is now conditioning on two variables, conditioning 5429 04:36:54,000 --> 04:36:58,920 both on rain represented by r and on maintenance represented by yes. 5430 04:36:58,920 --> 04:37:02,680 Again, each of these rows has two values that sum up to the number 1, 5431 04:37:02,680 --> 04:37:06,920 one for whether the train is on time, one for whether the train is delayed. 5432 04:37:06,920 --> 04:37:08,880 And here I can say something like, all right, 5433 04:37:08,880 --> 04:37:12,600 if I know there was light rain and track maintenance, well, OK, 5434 04:37:12,600 --> 04:37:16,120 that would be r is light and m is yes. 5435 04:37:16,120 --> 04:37:19,840 Well, then there is a probability of 0.6 that my train is on time, 5436 04:37:19,840 --> 04:37:23,200 and a probability of 0.4 the train is delayed. 5437 04:37:23,200 --> 04:37:25,480 And you can imagine gathering this data just 5438 04:37:25,480 --> 04:37:28,960 by looking at real world data, looking at data about, all right, 5439 04:37:28,960 --> 04:37:31,800 if I knew that it was light rain and there was track maintenance, 5440 04:37:31,800 --> 04:37:33,880 how often was a train delayed or not delayed? 5441 04:37:33,880 --> 04:37:35,680 And you could begin to construct this thing. 5442 04:37:35,680 --> 04:37:37,920 The interesting thing is intelligently, being 5443 04:37:37,920 --> 04:37:40,880 able to try to figure out how might you go about ordering these things, 5444 04:37:40,880 --> 04:37:46,320 what things might influence other nodes inside of this Bayesian network. 5445 04:37:46,320 --> 04:37:50,480 And the last thing I care about is whether or not I make it to my appointment. 5446 04:37:50,480 --> 04:37:52,800 So did I attend or miss the appointment? 5447 04:37:52,800 --> 04:37:55,760 And ultimately, whether I attend or miss the appointment, 5448 04:37:55,760 --> 04:37:59,520 it is influenced by track maintenance, because it's indirectly this idea that, 5449 04:37:59,520 --> 04:38:01,240 all right, if there is track maintenance, 5450 04:38:01,240 --> 04:38:02,940 well, then my train might more likely be delayed. 5451 04:38:02,940 --> 04:38:04,740 And if my train is more likely to be delayed, 5452 04:38:04,740 --> 04:38:06,880 then I'm more likely to miss my appointment. 5453 04:38:06,880 --> 04:38:09,240 But what we encode in this Bayesian network 5454 04:38:09,240 --> 04:38:12,440 are just what we might consider to be more direct relationships. 5455 04:38:12,440 --> 04:38:15,300 So the train has a direct influence on the appointment. 5456 04:38:15,300 --> 04:38:18,300 And given that I know whether the train is on time or delayed, 5457 04:38:18,300 --> 04:38:20,440 knowing whether there's track maintenance isn't 5458 04:38:20,440 --> 04:38:24,080 going to give me any additional information that I didn't already have. 5459 04:38:24,080 --> 04:38:27,680 That if I know train, these other nodes that are up above 5460 04:38:27,680 --> 04:38:30,840 isn't really going to influence the result. 5461 04:38:30,840 --> 04:38:34,500 And so here we might represent it using another conditional probability 5462 04:38:34,500 --> 04:38:36,900 distribution that looks a little something like this. 5463 04:38:36,900 --> 04:38:39,780 The train can take on two possible values. 5464 04:38:39,780 --> 04:38:42,360 Either my train is on time or my train is delayed. 5465 04:38:42,360 --> 04:38:44,120 And for each of those two possible values, 5466 04:38:44,120 --> 04:38:46,840 I have a distribution for what are the odds that I'm 5467 04:38:46,840 --> 04:38:49,720 able to attend the meeting and what are the odds that I missed the meeting. 5468 04:38:49,720 --> 04:38:51,640 And obviously, if my train is on time, I'm 5469 04:38:51,640 --> 04:38:53,760 much more likely to be able to attend the meeting 5470 04:38:53,760 --> 04:38:57,760 than if my train is delayed, in which case I'm more likely to miss that 5471 04:38:57,760 --> 04:38:59,000 meeting. 5472 04:38:59,000 --> 04:39:03,360 So all of these nodes put all together here represent this Bayesian network, 5473 04:39:03,360 --> 04:39:07,120 this network of random variables whose values I ultimately care about, 5474 04:39:07,120 --> 04:39:09,920 and that have some sort of relationship between them, 5475 04:39:09,920 --> 04:39:13,320 some sort of dependence where these arrows from one node to another 5476 04:39:13,320 --> 04:39:15,360 indicate some dependence, that I can calculate 5477 04:39:15,360 --> 04:39:21,400 the probability of some node given the parents that happen to exist there. 5478 04:39:21,400 --> 04:39:24,540 So now that we've been able to describe the structure of this Bayesian 5479 04:39:24,540 --> 04:39:27,320 network and the relationships between each of these nodes 5480 04:39:27,320 --> 04:39:30,720 by associating each of the nodes in the network with a probability 5481 04:39:30,720 --> 04:39:34,480 distribution, whether that's an unconditional probability distribution 5482 04:39:34,480 --> 04:39:36,720 in the case of this root node here, like rain, 5483 04:39:36,720 --> 04:39:39,560 and a conditional probability distribution in the case 5484 04:39:39,560 --> 04:39:42,000 of all of the other nodes whose probabilities are 5485 04:39:42,000 --> 04:39:44,560 dependent upon the values of their parents, 5486 04:39:44,560 --> 04:39:47,800 we can begin to do some computation and calculation using 5487 04:39:47,800 --> 04:39:50,120 the information inside of that table. 5488 04:39:50,120 --> 04:39:51,960 So let's imagine, for example, that I just 5489 04:39:51,960 --> 04:39:55,560 wanted to compute something simple like the probability of light rain. 5490 04:39:55,560 --> 04:39:57,760 How would I get the probability of light rain? 5491 04:39:57,760 --> 04:40:01,000 Well, light rain, rain here is a root node. 5492 04:40:01,000 --> 04:40:03,400 And so if I wanted to calculate that probability, 5493 04:40:03,400 --> 04:40:06,360 I could just look at the probability distribution for rain 5494 04:40:06,360 --> 04:40:10,680 and extract from it the probability of light rains, just a single value 5495 04:40:10,680 --> 04:40:12,840 that I already have access to. 5496 04:40:12,840 --> 04:40:16,160 But we could also imagine wanting to compute more complex joint 5497 04:40:16,160 --> 04:40:21,200 probabilities, like the probability that there is light rain and also 5498 04:40:21,200 --> 04:40:22,240 no track maintenance. 5499 04:40:22,240 --> 04:40:27,080 This is a joint probability of two values, light rain and no track 5500 04:40:27,080 --> 04:40:27,960 maintenance. 5501 04:40:27,960 --> 04:40:30,960 And the way I might do that is first by starting by saying, all right, 5502 04:40:30,960 --> 04:40:33,400 well, let me get the probability of light rain. 5503 04:40:33,400 --> 04:40:36,800 But now I also want the probability of no track maintenance. 5504 04:40:36,800 --> 04:40:41,360 But of course, this node is dependent upon the value of rain. 5505 04:40:41,360 --> 04:40:44,560 So what I really want is the probability of no track maintenance, 5506 04:40:44,560 --> 04:40:47,160 given that I know that there was light rain. 5507 04:40:47,160 --> 04:40:51,280 And so the expression for calculating this idea that the probability of light 5508 04:40:51,280 --> 04:40:56,040 rain and no track maintenance is really just the probability of light rain 5509 04:40:56,040 --> 04:40:58,840 and the probability that there is no track maintenance, 5510 04:40:58,840 --> 04:41:01,840 given that I know that there already is light rain. 5511 04:41:01,840 --> 04:41:05,160 So I take the unconditional probability of light rain, 5512 04:41:05,160 --> 04:41:09,800 multiply it by the conditional probability of no track maintenance, 5513 04:41:09,800 --> 04:41:12,320 given that I know there is light rain. 5514 04:41:12,320 --> 04:41:15,400 And you can continue to do this again and again for every variable 5515 04:41:15,400 --> 04:41:18,040 that you want to add into this joint probability 5516 04:41:18,040 --> 04:41:19,320 that I might want to calculate. 5517 04:41:19,320 --> 04:41:23,240 If I wanted to know the probability of light rain and no track maintenance 5518 04:41:23,240 --> 04:41:27,960 and a delayed train, well, that's going to be the probability of light rain, 5519 04:41:27,960 --> 04:41:31,880 multiplied by the probability of no track maintenance, given light rain, 5520 04:41:31,880 --> 04:41:36,400 multiplied by the probability of a delayed train, given light rain 5521 04:41:36,400 --> 04:41:37,400 and no track maintenance. 5522 04:41:37,400 --> 04:41:39,640 Because whether the train is on time or delayed 5523 04:41:39,640 --> 04:41:42,920 is dependent upon both of these other two variables. 5524 04:41:42,920 --> 04:41:45,200 And so I have two pieces of evidence that go 5525 04:41:45,200 --> 04:41:48,480 into the calculation of that conditional probability. 5526 04:41:48,480 --> 04:41:51,120 And each of these three values is just a value 5527 04:41:51,120 --> 04:41:55,280 that I can look up by looking at one of these individual probability 5528 04:41:55,280 --> 04:41:59,760 distributions that is encoded into my Bayesian network. 5529 04:41:59,760 --> 04:42:03,040 And if I wanted a joint probability over all four of the variables, 5530 04:42:03,040 --> 04:42:06,840 something like the probability of light rain and no track maintenance 5531 04:42:06,840 --> 04:42:09,760 and a delayed train and I miss my appointment, 5532 04:42:09,760 --> 04:42:12,520 well, that's going to be multiplying four different values, one 5533 04:42:12,520 --> 04:42:14,520 from each of these individual nodes. 5534 04:42:14,520 --> 04:42:16,600 It's going to be the probability of light rain, 5535 04:42:16,600 --> 04:42:20,600 then of no track maintenance given light rain, then of a delayed train, 5536 04:42:20,600 --> 04:42:22,720 given light rain and no track maintenance. 5537 04:42:22,720 --> 04:42:25,000 And then finally, for this node here, for whether I 5538 04:42:25,000 --> 04:42:26,840 make it to my appointment or not, it's not 5539 04:42:26,840 --> 04:42:29,360 dependent upon these two variables, given 5540 04:42:29,360 --> 04:42:31,880 that I know whether or not the train is on time. 5541 04:42:31,880 --> 04:42:34,680 I only need to care about the conditional probability 5542 04:42:34,680 --> 04:42:37,800 that I miss my train, or that I miss my appointment, 5543 04:42:37,800 --> 04:42:39,880 given that the train happens to be delayed. 5544 04:42:39,880 --> 04:42:43,720 And so that's represented here by four probabilities, each of which 5545 04:42:43,720 --> 04:42:47,040 is located inside of one of these probability distributions 5546 04:42:47,040 --> 04:42:50,760 for each of the nodes, all multiplied together. 5547 04:42:50,760 --> 04:42:52,920 And so I can take a variable like that and figure out 5548 04:42:52,920 --> 04:42:55,520 what the joint probability is by multiplying 5549 04:42:55,520 --> 04:42:59,640 a whole bunch of these individual probabilities from the Bayesian network. 5550 04:42:59,640 --> 04:43:02,720 But of course, just as with last time, where what I really wanted to do 5551 04:43:02,720 --> 04:43:05,240 was to be able to get new pieces of information, 5552 04:43:05,240 --> 04:43:08,280 here, too, this is what we're going to want to do with our Bayesian network. 5553 04:43:08,280 --> 04:43:11,360 In the context of knowledge, we talked about the problem of inference. 5554 04:43:11,360 --> 04:43:14,900 Given things that I know to be true, can I draw conclusions, 5555 04:43:14,900 --> 04:43:19,880 make deductions about other facts about the world that I also know to be true? 5556 04:43:19,880 --> 04:43:23,800 And what we're going to do now is apply the same sort of idea to probability. 5557 04:43:23,800 --> 04:43:26,600 Using information about which I have some knowledge, 5558 04:43:26,600 --> 04:43:28,920 whether some evidence or some probabilities, 5559 04:43:28,920 --> 04:43:32,000 can I figure out not other variables for certain, 5560 04:43:32,000 --> 04:43:35,000 but can I figure out the probabilities of other variables 5561 04:43:35,000 --> 04:43:36,800 taking on particular values? 5562 04:43:36,800 --> 04:43:41,240 And so here, we introduce the problem of inference in a probabilistic setting, 5563 04:43:41,240 --> 04:43:44,920 in a case where variables might not necessarily be true for sure, 5564 04:43:44,920 --> 04:43:48,480 but they might be random variables that take on different values 5565 04:43:48,480 --> 04:43:50,160 with some probability. 5566 04:43:50,160 --> 04:43:53,400 So how do we formally define what exactly this inference problem actually 5567 04:43:53,400 --> 04:43:54,120 is? 5568 04:43:54,120 --> 04:43:57,000 Well, the inference problem has a couple of parts to it. 5569 04:43:57,000 --> 04:43:59,780 We have some query, some variable x that we 5570 04:43:59,780 --> 04:44:01,360 want to compute the distribution for. 5571 04:44:01,360 --> 04:44:04,520 Maybe I want the probability that I miss my train, 5572 04:44:04,520 --> 04:44:08,600 or I want the probability that there is track maintenance, 5573 04:44:08,600 --> 04:44:11,200 something that I want information about. 5574 04:44:11,200 --> 04:44:13,200 And then I have some evidence variables. 5575 04:44:13,200 --> 04:44:14,740 Maybe it's just one piece of evidence. 5576 04:44:14,740 --> 04:44:16,400 Maybe it's multiple pieces of evidence. 5577 04:44:16,400 --> 04:44:20,320 But I've observed certain variables for some sort of event. 5578 04:44:20,320 --> 04:44:23,440 So for example, I might have observed that it is raining. 5579 04:44:23,440 --> 04:44:24,600 This is evidence that I have. 5580 04:44:24,600 --> 04:44:27,680 I know that there is light rain, or I know that there is heavy rain. 5581 04:44:27,680 --> 04:44:28,760 And that is evidence I have. 5582 04:44:28,760 --> 04:44:32,400 And using that evidence, I want to know what is the probability 5583 04:44:32,400 --> 04:44:34,960 that my train is delayed, for example. 5584 04:44:34,960 --> 04:44:38,080 And that is a query that I might want to ask based on this evidence. 5585 04:44:38,080 --> 04:44:39,880 So I have a query, some variable. 5586 04:44:39,880 --> 04:44:41,800 Evidence, which are some other variables that I 5587 04:44:41,800 --> 04:44:44,240 have observed inside of my Bayesian network. 5588 04:44:44,240 --> 04:44:46,960 And of course, that does leave some hidden variables. 5589 04:44:46,960 --> 04:44:47,720 Why? 5590 04:44:47,720 --> 04:44:52,160 These are variables that are not evidence variables and not query variables. 5591 04:44:52,160 --> 04:44:55,720 So you might imagine in the case where I know whether or not it's raining, 5592 04:44:55,720 --> 04:44:59,560 and I want to know whether my train is going to be delayed or not, 5593 04:44:59,560 --> 04:45:02,200 the hidden variable, the thing I don't have access to, 5594 04:45:02,200 --> 04:45:04,520 is something like, is there maintenance on the track? 5595 04:45:04,520 --> 04:45:07,000 Or am I going to make or not make my appointment, for example? 5596 04:45:07,000 --> 04:45:09,040 These are variables that I don't have access to. 5597 04:45:09,040 --> 04:45:12,080 They're hidden because they're not things I observed, 5598 04:45:12,080 --> 04:45:14,720 and they're also not the query, the thing that I'm asking. 5599 04:45:14,720 --> 04:45:17,080 And so ultimately, what we want to calculate 5600 04:45:17,080 --> 04:45:21,240 is I want to know the probability distribution of x given 5601 04:45:21,240 --> 04:45:22,600 e, the event that I observed. 5602 04:45:22,600 --> 04:45:25,760 So given that I observed some event, I observed that it is raining, 5603 04:45:25,760 --> 04:45:29,600 I would like to know what is the distribution over the possible values 5604 04:45:29,600 --> 04:45:31,280 of the train random variable. 5605 04:45:31,280 --> 04:45:32,240 Is it on time? 5606 04:45:32,240 --> 04:45:33,080 Is it delayed? 5607 04:45:33,080 --> 04:45:35,400 What's the likelihood it's going to be there? 5608 04:45:35,400 --> 04:45:37,800 And it turns out we can do this calculation just 5609 04:45:37,800 --> 04:45:42,040 using a lot of the probability rules that we've already seen in action. 5610 04:45:42,040 --> 04:45:44,480 And ultimately, we're going to take a look at the math 5611 04:45:44,480 --> 04:45:46,800 at a little bit of a high level, at an abstract level. 5612 04:45:46,800 --> 04:45:49,520 But ultimately, we can allow computers and programming libraries 5613 04:45:49,520 --> 04:45:52,240 that already exist to begin to do some of this math for us. 5614 04:45:52,240 --> 04:45:55,280 But it's good to get a general sense for what's actually happening 5615 04:45:55,280 --> 04:45:57,640 when this inference process takes place. 5616 04:45:57,640 --> 04:46:00,820 Let's imagine, for example, that I want to compute the probability 5617 04:46:00,820 --> 04:46:05,000 distribution of the appointment random variable given some evidence, 5618 04:46:05,000 --> 04:46:07,040 given that I know that there was light rain 5619 04:46:07,040 --> 04:46:08,920 and no track maintenance. 5620 04:46:08,920 --> 04:46:12,440 So there's my evidence, these two variables that I observe the values of. 5621 04:46:12,440 --> 04:46:14,240 I observe the value of rain. 5622 04:46:14,240 --> 04:46:15,560 I know there's light rain. 5623 04:46:15,560 --> 04:46:18,480 And I know that there is no track maintenance going on today. 5624 04:46:18,480 --> 04:46:22,440 And what I care about knowing, my query, is this random variable appointment. 5625 04:46:22,440 --> 04:46:25,800 I want to know the distribution of this random variable appointment, 5626 04:46:25,800 --> 04:46:28,360 like what is the chance that I'm able to attend my appointment? 5627 04:46:28,360 --> 04:46:32,000 What is the chance that I miss my appointment given this evidence? 5628 04:46:32,000 --> 04:46:35,520 And the hidden variable, the information that I don't have access to, 5629 04:46:35,520 --> 04:46:36,800 is this variable train. 5630 04:46:36,800 --> 04:46:38,920 This is information that is not part of the evidence 5631 04:46:38,920 --> 04:46:41,280 that I see, not something that I observe. 5632 04:46:41,280 --> 04:46:44,600 But it is also not the query that I'm asking for. 5633 04:46:44,600 --> 04:46:47,000 And so what might this inference procedure look like? 5634 04:46:47,000 --> 04:46:50,440 Well, if you recall back from when we were defining conditional probability 5635 04:46:50,440 --> 04:46:52,880 and doing math with conditional probabilities, 5636 04:46:52,880 --> 04:46:57,720 we know that a conditional probability is proportional to the joint 5637 04:46:57,720 --> 04:46:58,680 probability. 5638 04:46:58,680 --> 04:47:01,920 And we remembered this by recalling that the probability of A given 5639 04:47:01,920 --> 04:47:06,680 B is just some constant factor alpha multiplied by the probability of A 5640 04:47:06,680 --> 04:47:08,800 and B. That constant factor alpha turns out 5641 04:47:08,800 --> 04:47:10,960 to be like dividing over the probability of B. 5642 04:47:10,960 --> 04:47:14,560 But the important thing is that it's just some constant multiplied 5643 04:47:14,560 --> 04:47:17,080 by the joint distribution, the probability 5644 04:47:17,080 --> 04:47:19,680 that all of these individual things happen. 5645 04:47:19,680 --> 04:47:23,280 So in this case, I can take the probability of the appointment random 5646 04:47:23,280 --> 04:47:27,000 variable given light rain and no track maintenance 5647 04:47:27,000 --> 04:47:30,720 and say that is just going to be proportional, some constant alpha, 5648 04:47:30,720 --> 04:47:33,400 multiplied by the joint probability, the probability 5649 04:47:33,400 --> 04:47:36,060 of a particular value for the appointment random variable 5650 04:47:36,060 --> 04:47:40,040 and light rain and no track maintenance. 5651 04:47:40,040 --> 04:47:43,160 Well, all right, how do I calculate this, probability of appointment 5652 04:47:43,160 --> 04:47:46,240 and light rain and no track maintenance, when what I really care about 5653 04:47:46,240 --> 04:47:48,760 is knowing I need all four of these values 5654 04:47:48,760 --> 04:47:52,200 to be able to calculate a joint distribution across everything 5655 04:47:52,200 --> 04:47:56,040 because in a particular appointment depends upon the value of train? 5656 04:47:56,040 --> 04:47:59,400 Well, in order to do that, here I can begin to use that marginalization 5657 04:47:59,400 --> 04:48:02,240 trick, that there are only two ways I can get 5658 04:48:02,240 --> 04:48:05,520 any configuration of an appointment, light rain, and no track maintenance. 5659 04:48:05,520 --> 04:48:07,760 Either this particular setting of variables 5660 04:48:07,760 --> 04:48:12,000 happens and the train is on time, or this particular setting of variables 5661 04:48:12,000 --> 04:48:13,800 happens and the train is delayed. 5662 04:48:13,800 --> 04:48:17,160 Those are two possible cases that I would want to consider. 5663 04:48:17,160 --> 04:48:19,760 And if I add those two cases up, well, then I 5664 04:48:19,760 --> 04:48:23,360 get the result just by adding up all of the possibilities 5665 04:48:23,360 --> 04:48:26,600 for the hidden variable or variables that there are multiple. 5666 04:48:26,600 --> 04:48:30,260 But since there's only one hidden variable here, train, all I need to do 5667 04:48:30,260 --> 04:48:34,040 is iterate over all the possible values for that hidden variable train 5668 04:48:34,040 --> 04:48:36,160 and add up their probabilities. 5669 04:48:36,160 --> 04:48:40,440 So this probability expression here becomes probability distribution 5670 04:48:40,440 --> 04:48:44,080 over appointment, light, no rain, and train is on time, 5671 04:48:44,080 --> 04:48:47,560 and the probability distribution over the appointment, light rain, 5672 04:48:47,560 --> 04:48:51,360 no track maintenance, and that the train is delayed, for example. 5673 04:48:51,360 --> 04:48:55,280 So I take both of the possible values for train, go ahead and add them up. 5674 04:48:55,280 --> 04:48:57,520 These are just joint probabilities that we saw earlier, 5675 04:48:57,520 --> 04:48:59,920 how to calculate just by going parent, parent, parent, parent, 5676 04:48:59,920 --> 04:49:03,320 and calculating those probabilities and multiplying them together. 5677 04:49:03,320 --> 04:49:05,440 And then you'll need to normalize them at the end, 5678 04:49:05,440 --> 04:49:09,560 speaking at a high level, to make sure that everything adds up to the number 1. 5679 04:49:09,560 --> 04:49:13,480 So the formula for how you do this in a process known as inference by enumeration 5680 04:49:13,480 --> 04:49:16,280 looks a little bit complicated, but ultimately it looks like this. 5681 04:49:16,280 --> 04:49:20,040 And let's now try to distill what it is that all of these symbols actually mean. 5682 04:49:20,040 --> 04:49:21,040 Let's start here. 5683 04:49:21,040 --> 04:49:25,680 What I care about knowing is the probability of x, my query variable, 5684 04:49:25,680 --> 04:49:28,000 given some sort of evidence. 5685 04:49:28,000 --> 04:49:30,040 What do I know about conditional probabilities? 5686 04:49:30,040 --> 04:49:34,640 Well, a conditional probability is proportional to the joint probability. 5687 04:49:34,640 --> 04:49:37,480 So it is some alpha, some normalizing constant, 5688 04:49:37,480 --> 04:49:41,480 multiplied by this joint probability of x and evidence. 5689 04:49:41,480 --> 04:49:42,920 And how do I calculate that? 5690 04:49:42,920 --> 04:49:45,360 Well, to do that, I'm going to marginalize 5691 04:49:45,360 --> 04:49:47,760 over all of the hidden variables, all the variables 5692 04:49:47,760 --> 04:49:50,080 that I don't directly observe the values for. 5693 04:49:50,080 --> 04:49:53,020 I'm basically going to iterate over all of the possibilities 5694 04:49:53,020 --> 04:49:55,560 that it could happen and just sum them all up. 5695 04:49:55,560 --> 04:49:58,720 And so I can translate this into a sum over all y, 5696 04:49:58,720 --> 04:50:02,080 which ranges over all the possible hidden variables and the values 5697 04:50:02,080 --> 04:50:06,880 that they could take on, and adds up all of those possible individual 5698 04:50:06,880 --> 04:50:07,920 probabilities. 5699 04:50:07,920 --> 04:50:11,960 And that is going to allow me to do this process of inference by enumeration. 5700 04:50:11,960 --> 04:50:14,080 Now, ultimately, it's pretty annoying if we as humans 5701 04:50:14,080 --> 04:50:16,320 have to do all this math for ourselves. 5702 04:50:16,320 --> 04:50:19,480 But turns out this is where computers and AI can be particularly helpful, 5703 04:50:19,480 --> 04:50:22,800 that we can program a computer to understand a Bayesian network, 5704 04:50:22,800 --> 04:50:25,240 to be able to understand these inference procedures, 5705 04:50:25,240 --> 04:50:27,180 and to be able to do these calculations. 5706 04:50:27,180 --> 04:50:29,040 And using the information you've seen here, 5707 04:50:29,040 --> 04:50:31,760 you could implement a Bayesian network from scratch yourself. 5708 04:50:31,760 --> 04:50:34,920 But turns out there are a lot of libraries, especially written in Python, 5709 04:50:34,920 --> 04:50:38,400 that allow us to make it easier to do this sort of probabilistic inference, 5710 04:50:38,400 --> 04:50:41,480 to be able to take a Bayesian network and do these sorts of calculations, 5711 04:50:41,480 --> 04:50:44,480 so that you don't need to know and understand all of the underlying math, 5712 04:50:44,480 --> 04:50:46,920 though it's helpful to have a general sense for how it works. 5713 04:50:46,920 --> 04:50:49,980 But you just need to be able to describe the structure of the network 5714 04:50:49,980 --> 04:50:53,960 and make queries in order to be able to produce the result. 5715 04:50:53,960 --> 04:50:56,680 And so let's take a look at an example of that right now. 5716 04:50:56,680 --> 04:50:59,040 It turns out that there are a lot of possible libraries 5717 04:50:59,040 --> 04:51:01,600 that exist in Python for doing this sort of inference. 5718 04:51:01,600 --> 04:51:04,000 It doesn't matter too much which specific library you use. 5719 04:51:04,000 --> 04:51:05,880 They all behave in fairly similar ways. 5720 04:51:05,880 --> 04:51:08,800 But the library I'm going to use here is one known as pomegranate. 5721 04:51:08,800 --> 04:51:13,440 And here inside of model.py, I have defined a Bayesian network, 5722 04:51:13,440 --> 04:51:17,800 just using the structure and the syntax that the pomegranate library expects. 5723 04:51:17,800 --> 04:51:20,560 And what I'm effectively doing is just, in Python, 5724 04:51:20,560 --> 04:51:24,400 creating nodes to represent each of the nodes of the Bayesian network 5725 04:51:24,400 --> 04:51:26,600 that you saw me describe a moment ago. 5726 04:51:26,600 --> 04:51:29,400 So here on line four, after I've imported pomegranate, 5727 04:51:29,400 --> 04:51:31,520 I'm defining a variable called rain that is going 5728 04:51:31,520 --> 04:51:35,640 to represent a node inside of my Bayesian network. 5729 04:51:35,640 --> 04:51:39,160 It's going to be a node that follows this distribution, where 5730 04:51:39,160 --> 04:51:42,320 there are three possible values, none for no rain, light for light rain, 5731 04:51:42,320 --> 04:51:43,600 heavy for heavy rain. 5732 04:51:43,600 --> 04:51:46,840 And these are the probabilities of each of those taking place. 5733 04:51:46,840 --> 04:51:53,280 0.7 is the likelihood of no rain, 0.2 for light rain, 0.1 for heavy rain. 5734 04:51:53,280 --> 04:51:55,400 Then after that, we go to the next variable, 5735 04:51:55,400 --> 04:51:57,920 the variable for track maintenance, for example, 5736 04:51:57,920 --> 04:52:00,520 which is dependent upon that rain variable. 5737 04:52:00,520 --> 04:52:03,520 And this, instead of being an unconditional distribution, 5738 04:52:03,520 --> 04:52:05,720 is a conditional distribution, as indicated 5739 04:52:05,720 --> 04:52:07,960 by a conditional probability table here. 5740 04:52:07,960 --> 04:52:11,720 And the idea is that I'm following this is conditional 5741 04:52:11,720 --> 04:52:13,520 on the distribution of rain. 5742 04:52:13,520 --> 04:52:17,000 So if there is no rain, then the chance that there is, yes, track maintenance 5743 04:52:17,000 --> 04:52:17,840 is 0.4. 5744 04:52:17,840 --> 04:52:21,360 If there's no rain, the chance that there is no track maintenance is 0.6. 5745 04:52:21,360 --> 04:52:23,360 Likewise, for light rain, I have a distribution. 5746 04:52:23,360 --> 04:52:25,400 For heavy rain, I have a distribution as well. 5747 04:52:25,400 --> 04:52:27,760 But I'm effectively encoding the same information 5748 04:52:27,760 --> 04:52:29,720 you saw represented graphically a moment ago. 5749 04:52:29,720 --> 04:52:33,320 But I'm telling this Python program that the maintenance node 5750 04:52:33,320 --> 04:52:37,200 obeys this particular conditional probability distribution. 5751 04:52:37,200 --> 04:52:40,720 And we do the same thing for the other random variables as well. 5752 04:52:40,720 --> 04:52:44,480 Train was a node inside my distribution that 5753 04:52:44,480 --> 04:52:47,680 was a conditional probability table with two parents. 5754 04:52:47,680 --> 04:52:51,040 It was dependent not only on rain, but also on track maintenance. 5755 04:52:51,040 --> 04:52:53,080 And so here I'm saying something like, given 5756 04:52:53,080 --> 04:52:55,840 that there is no rain and, yes, track maintenance, 5757 04:52:55,840 --> 04:52:59,240 the probability that my train is on time is 0.8. 5758 04:52:59,240 --> 04:53:01,880 And the probability that it's delayed is 0.2. 5759 04:53:01,880 --> 04:53:03,840 And likewise, I can do the same thing for all 5760 04:53:03,840 --> 04:53:07,960 of the other possible values of the parents of the train node 5761 04:53:07,960 --> 04:53:12,440 inside of my Bayesian network by saying, for all of those possible values, 5762 04:53:12,440 --> 04:53:16,160 here is the distribution that the train node should follow. 5763 04:53:16,160 --> 04:53:18,360 Then I do the same thing for an appointment 5764 04:53:18,360 --> 04:53:21,440 based on the distribution of the variable train. 5765 04:53:21,440 --> 04:53:24,960 Then at the end, what I do is actually construct this network 5766 04:53:24,960 --> 04:53:27,480 by describing what the states of the network are 5767 04:53:27,480 --> 04:53:30,240 and by adding edges between the dependent nodes. 5768 04:53:30,240 --> 04:53:33,440 So I create a new Bayesian network, add states to it, one for rain, 5769 04:53:33,440 --> 04:53:36,280 one for maintenance, one for the train, one for the appointment. 5770 04:53:36,280 --> 04:53:40,120 And then I add edges connecting the related pieces. 5771 04:53:40,120 --> 04:53:44,200 Rain has an arrow to maintenance because rain influences track maintenance. 5772 04:53:44,200 --> 04:53:46,120 Rain also influences the train. 5773 04:53:46,120 --> 04:53:48,160 Maintenance also influences the train. 5774 04:53:48,160 --> 04:53:50,800 And train influences whether I make it to my appointment 5775 04:53:50,800 --> 04:53:54,440 and bake just finalizes the model and does some additional computation. 5776 04:53:54,440 --> 04:53:57,880 So the specific syntax of this is not really the important part. 5777 04:53:57,880 --> 04:54:00,640 Pomegranate just happens to be one of several different libraries 5778 04:54:00,640 --> 04:54:02,640 that can all be used for similar purposes. 5779 04:54:02,640 --> 04:54:05,840 And you could describe and define a library for yourself 5780 04:54:05,840 --> 04:54:07,560 that implemented similar things. 5781 04:54:07,560 --> 04:54:11,160 But the key idea here is that someone can design a library 5782 04:54:11,160 --> 04:54:15,320 for a general Bayesian network that has nodes that are based upon its parents. 5783 04:54:15,320 --> 04:54:18,840 And then all a programmer needs to do using one of those libraries 5784 04:54:18,840 --> 04:54:23,040 is to define what those nodes and what those probability distributions are. 5785 04:54:23,040 --> 04:54:26,600 And we can begin to do some interesting logic based on it. 5786 04:54:26,600 --> 04:54:30,800 So let's try doing that conditional or joint probability calculation 5787 04:54:30,800 --> 04:54:36,600 that we saw us do by hand before by going into likelihood.py, where 5788 04:54:36,600 --> 04:54:40,000 here I'm importing the model that I just defined a moment ago. 5789 04:54:40,000 --> 04:54:42,880 And here I'd just like to calculate model.probability, which 5790 04:54:42,880 --> 04:54:46,000 calculates the probability for a given observation. 5791 04:54:46,000 --> 04:54:51,480 And I'd like to calculate the probability of no rain, no track maintenance, 5792 04:54:51,480 --> 04:54:54,600 my train is on time, and I'm able to attend the meeting. 5793 04:54:54,600 --> 04:54:58,200 So sort of the optimal scenario that there is no rain and no maintenance 5794 04:54:58,200 --> 04:55:01,240 on the track, my train is on time, and I'm able to attend the meeting. 5795 04:55:01,240 --> 04:55:04,560 What is the probability that all of that actually happens? 5796 04:55:04,560 --> 04:55:08,840 And I can calculate that using the library and just print out its probability. 5797 04:55:08,840 --> 04:55:12,400 And so I'll go ahead and run python of likelihood.py. 5798 04:55:12,400 --> 04:55:16,840 And I see that, OK, the probability is about 0.34. 5799 04:55:16,840 --> 04:55:20,480 So about a third of the time, everything goes right for me in this case. 5800 04:55:20,480 --> 04:55:22,840 No rain, no track maintenance, train is on time, 5801 04:55:22,840 --> 04:55:24,760 and I'm able to attend the meeting. 5802 04:55:24,760 --> 04:55:28,280 But I could experiment with this, try and calculate other probabilities as well. 5803 04:55:28,280 --> 04:55:31,480 What's the probability that everything goes right up until the train, 5804 04:55:31,480 --> 04:55:33,680 but I still miss my meeting? 5805 04:55:33,680 --> 04:55:37,520 So no rain, no track maintenance, train is on time, 5806 04:55:37,520 --> 04:55:39,320 but I miss the appointment. 5807 04:55:39,320 --> 04:55:41,280 Let's calculate that probability. 5808 04:55:41,280 --> 04:55:44,240 And all right, that has a probability of about 0.04. 5809 04:55:44,240 --> 04:55:47,400 So about 4% of the time, the train will be on time, 5810 04:55:47,400 --> 04:55:49,240 there won't be any rain, no track maintenance, 5811 04:55:49,240 --> 04:55:52,200 and yet I'll still miss the meeting. 5812 04:55:52,200 --> 04:55:54,440 And so this is really just an implementation 5813 04:55:54,440 --> 04:55:57,560 of the calculation of the joint probabilities that we did before. 5814 04:55:57,560 --> 04:56:00,320 What this library is likely doing is first figuring out 5815 04:56:00,320 --> 04:56:03,400 the probability of no rain, then figuring out 5816 04:56:03,400 --> 04:56:06,760 the probability of no track maintenance given no rain, 5817 04:56:06,760 --> 04:56:10,160 then the probability that my train is on time given both of these values, 5818 04:56:10,160 --> 04:56:13,600 and then the probability that I miss my appointment given that I 5819 04:56:13,600 --> 04:56:15,600 know that the train was on time. 5820 04:56:15,600 --> 04:56:18,800 So this, again, is the calculation of that joint probability. 5821 04:56:18,800 --> 04:56:22,000 And turns out we can also begin to have our computer solve inference problems 5822 04:56:22,000 --> 04:56:26,560 as well, to begin to infer, based on information, evidence that we see, 5823 04:56:26,560 --> 04:56:30,640 what is the likelihood of other variables also being true. 5824 04:56:30,640 --> 04:56:33,720 So let's go into inference.py, for example. 5825 04:56:33,720 --> 04:56:36,760 We're here, I'm again importing that exact same model from before, 5826 04:56:36,760 --> 04:56:38,920 importing all the nodes and all the edges 5827 04:56:38,920 --> 04:56:42,840 and the probability distribution that is encoded there as well. 5828 04:56:42,840 --> 04:56:45,960 And now there's a function for doing some sort of prediction. 5829 04:56:45,960 --> 04:56:50,400 And here, into this model, I pass in the evidence that I observe. 5830 04:56:50,400 --> 04:56:54,400 So here, I've encoded into this Python program the evidence 5831 04:56:54,400 --> 04:56:55,400 that I have observed. 5832 04:56:55,400 --> 04:56:58,600 I have observed the fact that the train is delayed. 5833 04:56:58,600 --> 04:57:01,840 And that is the value for one of the four random variables 5834 04:57:01,840 --> 04:57:03,800 inside of this Bayesian network. 5835 04:57:03,800 --> 04:57:07,320 And using that information, I would like to be able to draw inspiration 5836 04:57:07,320 --> 04:57:09,680 and figure out inferences about the values 5837 04:57:09,680 --> 04:57:13,120 of the other random variables that are inside of my Bayesian network. 5838 04:57:13,120 --> 04:57:15,920 I would like to make predictions about everything else. 5839 04:57:15,920 --> 04:57:19,960 So all of the actual computational logic is happening in just these three lines, 5840 04:57:19,960 --> 04:57:21,920 where I'm making this call to this prediction. 5841 04:57:21,920 --> 04:57:25,720 Down below, I'm just iterating over all of the states and all the predictions 5842 04:57:25,720 --> 04:57:29,360 and just printing them out so that we can visually see what the results are. 5843 04:57:29,360 --> 04:57:31,640 But let's find out, given the train is delayed, 5844 04:57:31,640 --> 04:57:35,840 what can I predict about the values of the other random variables? 5845 04:57:35,840 --> 04:57:38,960 Let's go ahead and run python inference.py. 5846 04:57:38,960 --> 04:57:41,520 I run that, and all right, here is the result that I get. 5847 04:57:41,520 --> 04:57:44,280 Given the fact that I know that the train is delayed, 5848 04:57:44,280 --> 04:57:46,400 this is evidence that I have observed. 5849 04:57:46,400 --> 04:57:50,120 Well, given that there is a 45% chance or a 46% chance 5850 04:57:50,120 --> 04:57:52,720 that there was no rain, a 31% chance there was light rain, 5851 04:57:52,720 --> 04:57:56,360 a 23% chance there was heavy rain, I can see a probability distribution 5852 04:57:56,360 --> 04:57:58,720 of a track maintenance and a probability distribution 5853 04:57:58,720 --> 04:58:01,760 over whether I'm able to attend or miss my appointment. 5854 04:58:01,760 --> 04:58:04,560 Now, we know that whether I attend or miss the appointment, 5855 04:58:04,560 --> 04:58:07,960 that is only dependent upon the train being delayed or not delayed. 5856 04:58:07,960 --> 04:58:10,160 It shouldn't depend on anything else. 5857 04:58:10,160 --> 04:58:14,240 So let's imagine, for example, that I knew that there was heavy rain. 5858 04:58:14,240 --> 04:58:18,240 That shouldn't affect the distribution for making the appointment. 5859 04:58:18,240 --> 04:58:21,000 And indeed, if I go up here and add some evidence, 5860 04:58:21,000 --> 04:58:23,680 say that I know that the value of rain is heavy. 5861 04:58:23,680 --> 04:58:25,520 That is evidence that I now have access to. 5862 04:58:25,520 --> 04:58:27,040 I now have two pieces of evidence. 5863 04:58:27,040 --> 04:58:31,600 I know that the rain is heavy, and I know that my train is delayed. 5864 04:58:31,600 --> 04:58:35,160 I can calculate the probability by running this inference procedure again 5865 04:58:35,160 --> 04:58:37,960 and seeing the result. I know that the rain is heavy. 5866 04:58:37,960 --> 04:58:39,480 I know my train is delayed. 5867 04:58:39,480 --> 04:58:42,680 The probability distribution for track maintenance changed. 5868 04:58:42,680 --> 04:58:44,680 Given that I know that there's heavy rain, 5869 04:58:44,680 --> 04:58:48,240 now it's more likely that there is no track maintenance, 88%, 5870 04:58:48,240 --> 04:58:51,880 as opposed to 64% from here before. 5871 04:58:51,880 --> 04:58:55,680 And now, what is the probability that I make the appointment? 5872 04:58:55,680 --> 04:58:57,120 Well, that's the same as before. 5873 04:58:57,120 --> 04:59:00,720 It's still going to be attend the appointment with probability 0.6, 5874 04:59:00,720 --> 04:59:03,080 missed the appointment with probability 0.4, 5875 04:59:03,080 --> 04:59:05,440 because it was only dependent upon whether or not 5876 04:59:05,440 --> 04:59:07,760 my train was on time or delayed. 5877 04:59:07,760 --> 04:59:11,240 And so this here is implementing that idea of that inference algorithm 5878 04:59:11,240 --> 04:59:14,600 to be able to figure out, based on the evidence that I have, 5879 04:59:14,600 --> 04:59:18,800 what can we infer about the values of the other variables that exist as well. 5880 04:59:18,800 --> 04:59:22,520 So inference by enumeration is one way of doing this inference procedure, 5881 04:59:22,520 --> 04:59:26,360 just looping over all of the values the hidden variables could take on 5882 04:59:26,360 --> 04:59:29,080 and figuring out what the probability is. 5883 04:59:29,080 --> 04:59:31,640 Now, it turns out this is not particularly efficient. 5884 04:59:31,640 --> 04:59:35,800 And there are definitely optimizations you can make by avoiding repeated work. 5885 04:59:35,800 --> 04:59:38,680 If you're calculating the same sort of probability multiple times, 5886 04:59:38,680 --> 04:59:40,840 there are ways of optimizing the program to avoid 5887 04:59:40,840 --> 04:59:44,280 having to recalculate the same probabilities again and again. 5888 04:59:44,280 --> 04:59:47,240 But even then, as the number of variables get large, 5889 04:59:47,240 --> 04:59:50,640 as the number of possible values of variables could take on, get large, 5890 04:59:50,640 --> 04:59:52,920 we're going to start to have to do a lot of computation, 5891 04:59:52,920 --> 04:59:55,800 a lot of calculation, to be able to do this inference. 5892 04:59:55,800 --> 04:59:58,560 And at that point, it might start to get unreasonable, 5893 04:59:58,560 --> 05:00:00,680 in terms of the amount of time that it would take 5894 05:00:00,680 --> 05:00:04,280 to be able to do this sort of exact inference. 5895 05:00:04,280 --> 05:00:06,080 And it's for that reason that oftentimes, when 5896 05:00:06,080 --> 05:00:09,560 it comes towards probability and things we're not entirely sure about, 5897 05:00:09,560 --> 05:00:11,880 we don't always care about doing exact inference 5898 05:00:11,880 --> 05:00:14,640 and knowing exactly what the probability is. 5899 05:00:14,640 --> 05:00:17,120 But if we can approximate the inference procedure, 5900 05:00:17,120 --> 05:00:21,160 do some sort of approximate inference, that that can be pretty good as well. 5901 05:00:21,160 --> 05:00:23,160 That if I don't know the exact probability, 5902 05:00:23,160 --> 05:00:25,120 but I have a general sense for the probability 5903 05:00:25,120 --> 05:00:28,000 that I can get increasingly accurate with more time, 5904 05:00:28,000 --> 05:00:30,360 that that's probably pretty good, especially 5905 05:00:30,360 --> 05:00:33,200 if I can get that to happen even faster. 5906 05:00:33,200 --> 05:00:37,520 So how could I do approximate inference inside of a Bayesian network? 5907 05:00:37,520 --> 05:00:40,080 Well, one method is through a procedure known as sampling. 5908 05:00:40,080 --> 05:00:42,200 In the process of sampling, I'm going to take 5909 05:00:42,200 --> 05:00:46,440 a sample of all of the variables inside of this Bayesian network here. 5910 05:00:46,440 --> 05:00:47,840 And how am I going to sample? 5911 05:00:47,840 --> 05:00:51,840 Well, I'm going to sample one of the values from each of these nodes 5912 05:00:51,840 --> 05:00:54,160 according to their probability distribution. 5913 05:00:54,160 --> 05:00:56,120 So how might I take a sample of all these nodes? 5914 05:00:56,120 --> 05:00:57,040 Well, I'll start at the root. 5915 05:00:57,040 --> 05:00:58,080 I'll start with rain. 5916 05:00:58,080 --> 05:00:59,800 Here's the distribution for rain. 5917 05:00:59,800 --> 05:01:03,520 And I'll go ahead and, using a random number generator or something like it, 5918 05:01:03,520 --> 05:01:05,400 randomly pick one of these three values. 5919 05:01:05,400 --> 05:01:09,360 I'll pick none with probability 0.7, light with probability 0.2, 5920 05:01:09,360 --> 05:01:11,080 and heavy with probability 0.1. 5921 05:01:11,080 --> 05:01:14,400 So I'll randomly just pick one of them according to that distribution. 5922 05:01:14,400 --> 05:01:17,480 And maybe in this case, I pick none, for example. 5923 05:01:17,480 --> 05:01:19,440 Then I do the same thing for the other variable. 5924 05:01:19,440 --> 05:01:22,120 Maintenance also has a probability distribution. 5925 05:01:22,120 --> 05:01:23,680 And I'm going to sample. 5926 05:01:23,680 --> 05:01:26,120 Now, there are three probability distributions here. 5927 05:01:26,120 --> 05:01:29,360 But I'm only going to sample from this first row here, 5928 05:01:29,360 --> 05:01:33,880 because I've observed already in my sample that the value of rain is none. 5929 05:01:33,880 --> 05:01:37,960 So given that rain is none, I'm going to sample from this distribution to say, 5930 05:01:37,960 --> 05:01:40,040 all right, what should the value of maintenance be? 5931 05:01:40,040 --> 05:01:42,800 And in this case, maintenance is going to be, let's just say yes, 5932 05:01:42,800 --> 05:01:47,560 which happens 40% of the time in the event that there is no rain, for example. 5933 05:01:47,560 --> 05:01:50,360 And we'll sample all of the rest of the nodes in this way as well, 5934 05:01:50,360 --> 05:01:52,480 that I want to sample from the train distribution. 5935 05:01:52,480 --> 05:01:56,680 And I'll sample from this first row here, where there is no rain, 5936 05:01:56,680 --> 05:01:58,200 but there is track maintenance. 5937 05:01:58,200 --> 05:02:00,160 And I'll sample 80% of the time. 5938 05:02:00,160 --> 05:02:01,560 I'll say the train is on time. 5939 05:02:01,560 --> 05:02:04,320 20% of the time, I'll say the train is delayed. 5940 05:02:04,320 --> 05:02:07,280 And finally, we'll do the same thing for whether I make it to my appointment 5941 05:02:07,280 --> 05:02:07,560 or not. 5942 05:02:07,560 --> 05:02:09,120 Did I attend or miss the appointment? 5943 05:02:09,120 --> 05:02:11,640 We'll sample based on this distribution and maybe say 5944 05:02:11,640 --> 05:02:13,760 that in this case, I attend the appointment, which 5945 05:02:13,760 --> 05:02:18,480 happens 90% of the time when the train is actually on time. 5946 05:02:18,480 --> 05:02:22,560 So by going through these nodes, I can very quickly just do some sampling 5947 05:02:22,560 --> 05:02:26,200 and get a sample of the possible values that could come up 5948 05:02:26,200 --> 05:02:28,600 from going through this entire Bayesian network 5949 05:02:28,600 --> 05:02:31,160 according to those probability distributions. 5950 05:02:31,160 --> 05:02:34,040 And where this becomes powerful is if I do this not once, 5951 05:02:34,040 --> 05:02:36,640 but I do this thousands or tens of thousands of times 5952 05:02:36,640 --> 05:02:39,960 and generate a whole bunch of samples all using this distribution. 5953 05:02:39,960 --> 05:02:41,040 I get different samples. 5954 05:02:41,040 --> 05:02:42,480 Maybe some of them are the same. 5955 05:02:42,480 --> 05:02:47,320 But I get a value for each of the possible variables that could come up. 5956 05:02:47,320 --> 05:02:49,320 And so then if I'm ever faced with a question, 5957 05:02:49,320 --> 05:02:53,480 a question like, what is the probability that the train is on time, 5958 05:02:53,480 --> 05:02:55,520 you could do an exact inference procedure. 5959 05:02:55,520 --> 05:02:58,240 This is no different than the inference problem we had before 5960 05:02:58,240 --> 05:03:01,400 where I could just marginalize, look at all the possible other values 5961 05:03:01,400 --> 05:03:05,080 of the variables, and do the computation of inference by enumeration 5962 05:03:05,080 --> 05:03:07,840 to find out this probability exactly. 5963 05:03:07,840 --> 05:03:10,680 But I could also, if I don't care about the exact probability, 5964 05:03:10,680 --> 05:03:12,800 just sample it, approximate it to get close. 5965 05:03:12,800 --> 05:03:16,240 And this is a powerful tool in AI where we don't need to be right 100% 5966 05:03:16,240 --> 05:03:18,440 of the time or we don't need to be exactly right. 5967 05:03:18,440 --> 05:03:20,760 If we just need to be right with some probability, 5968 05:03:20,760 --> 05:03:23,800 we can often do so more effectively, more efficiently. 5969 05:03:23,800 --> 05:03:26,920 And so if here now are all of those possible samples, 5970 05:03:26,920 --> 05:03:30,000 I'll highlight the ones where the train is on time. 5971 05:03:30,000 --> 05:03:32,240 I'm ignoring the ones where the train is delayed. 5972 05:03:32,240 --> 05:03:35,640 And in this case, there's like six out of eight of the samples 5973 05:03:35,640 --> 05:03:37,320 have the train is arriving on time. 5974 05:03:37,320 --> 05:03:40,960 And so maybe in this case, I can say that in six out of eight cases, 5975 05:03:40,960 --> 05:03:43,200 that's the likelihood that the train is on time. 5976 05:03:43,200 --> 05:03:45,640 And with eight samples, that might not be a great prediction. 5977 05:03:45,640 --> 05:03:48,160 But if I had thousands upon thousands of samples, 5978 05:03:48,160 --> 05:03:51,240 then this could be a much better inference procedure 5979 05:03:51,240 --> 05:03:53,320 to be able to do these sorts of calculations. 5980 05:03:53,320 --> 05:03:56,960 So this is a direct sampling method to just do a bunch of samples 5981 05:03:56,960 --> 05:04:00,920 and then figure out what the probability of some event is. 5982 05:04:00,920 --> 05:04:03,960 Now, this from before was an unconditional probability. 5983 05:04:03,960 --> 05:04:07,080 What is the probability that the train is on time? 5984 05:04:07,080 --> 05:04:09,880 And I did that by looking at all the samples and figuring out, right, 5985 05:04:09,880 --> 05:04:12,120 here are the ones where the train is on time. 5986 05:04:12,120 --> 05:04:16,000 But sometimes what I want to calculate is not an unconditional probability, 5987 05:04:16,000 --> 05:04:18,360 but rather a conditional probability, something 5988 05:04:18,360 --> 05:04:21,240 like what is the probability that there is light rain, 5989 05:04:21,240 --> 05:04:24,600 given that the train is on time, something to that effect. 5990 05:04:24,600 --> 05:04:28,200 And to do that kind of calculation, well, what I might do 5991 05:04:28,200 --> 05:04:31,360 is here are all the samples that I have. 5992 05:04:31,360 --> 05:04:33,920 And I want to calculate a probability distribution, 5993 05:04:33,920 --> 05:04:36,920 given that I know that the train is on time. 5994 05:04:36,920 --> 05:04:38,800 So to be able to do that, I can kind of look 5995 05:04:38,800 --> 05:04:43,280 at the two cases where the train was delayed and ignore or reject them, 5996 05:04:43,280 --> 05:04:47,400 sort of exclude them from the possible samples that I'm considering. 5997 05:04:47,400 --> 05:04:50,760 And now I want to look at these remaining cases where the train is on time. 5998 05:04:50,760 --> 05:04:53,480 Here are the cases where there is light rain. 5999 05:04:53,480 --> 05:04:56,440 And I say, OK, these are two out of the six possible cases. 6000 05:04:56,440 --> 05:05:00,200 That can give me an approximation for the probability of light rain, 6001 05:05:00,200 --> 05:05:03,080 given the fact that I know the train was on time. 6002 05:05:03,080 --> 05:05:05,340 And I did that in almost exactly the same way, 6003 05:05:05,340 --> 05:05:08,600 just by adding an additional step, by saying that, all right, 6004 05:05:08,600 --> 05:05:12,080 when I take each sample, let me reject all of the samples that 6005 05:05:12,080 --> 05:05:14,960 don't match my evidence and only consider 6006 05:05:14,960 --> 05:05:19,200 the samples that do match what it is that I have in my evidence 6007 05:05:19,200 --> 05:05:21,640 that I want to make some sort of calculation about. 6008 05:05:21,640 --> 05:05:25,560 And it turns out, using the libraries that we've had for Bayesian networks, 6009 05:05:25,560 --> 05:05:28,180 we can begin to implement this same sort of idea, 6010 05:05:28,180 --> 05:05:31,520 like implement rejection sampling, which is what this method is called, 6011 05:05:31,520 --> 05:05:35,480 to be able to figure out some probability, not via direct inference, 6012 05:05:35,480 --> 05:05:37,600 but instead by sampling. 6013 05:05:37,600 --> 05:05:39,920 So what I have here is a program called sample.py. 6014 05:05:39,920 --> 05:05:41,840 Imports the exact same model. 6015 05:05:41,840 --> 05:05:45,000 And what I define first is a program to generate a sample. 6016 05:05:45,000 --> 05:05:48,720 And the way I generate a sample is just by looping over all of the states. 6017 05:05:48,720 --> 05:05:50,520 The states need to be in some sort of order 6018 05:05:50,520 --> 05:05:52,360 to make sure I'm looping in the correct order. 6019 05:05:52,360 --> 05:05:55,640 But effectively, if it is a conditional distribution, 6020 05:05:55,640 --> 05:05:58,040 I'm going to sample based on the parents. 6021 05:05:58,040 --> 05:06:00,240 And otherwise, I'm just going to directly sample 6022 05:06:00,240 --> 05:06:02,280 the variable, like rain, which has no parents. 6023 05:06:02,280 --> 05:06:05,000 It's just an unconditional distribution and keep 6024 05:06:05,000 --> 05:06:08,240 track of all those parent samples and return the final sample. 6025 05:06:08,240 --> 05:06:11,040 The exact syntax of this, again, not particularly important. 6026 05:06:11,040 --> 05:06:13,680 It just happens to be part of the implementation details 6027 05:06:13,680 --> 05:06:15,440 of this particular library. 6028 05:06:15,440 --> 05:06:17,920 The interesting logic is down below. 6029 05:06:17,920 --> 05:06:20,440 Now that I have the ability to generate a sample, 6030 05:06:20,440 --> 05:06:24,280 if I want to know the distribution of the appointment random variable, 6031 05:06:24,280 --> 05:06:26,520 given that the train is delayed, well, then I 6032 05:06:26,520 --> 05:06:28,400 can begin to do calculations like this. 6033 05:06:28,400 --> 05:06:32,080 Let me take 10,000 samples and assemble all my results 6034 05:06:32,080 --> 05:06:33,440 in this list called data. 6035 05:06:33,440 --> 05:06:36,760 I'll go ahead and loop n times, in this case, 10,000 times. 6036 05:06:36,760 --> 05:06:38,720 I'll generate a sample. 6037 05:06:38,720 --> 05:06:41,320 And I want to know the distribution of appointment, 6038 05:06:41,320 --> 05:06:43,040 given that the train is delayed. 6039 05:06:43,040 --> 05:06:45,520 So according to rejection sampling, I'm only 6040 05:06:45,520 --> 05:06:47,840 going to consider samples where the train is delayed. 6041 05:06:47,840 --> 05:06:51,400 If the train is not delayed, I'm not going to consider those values at all. 6042 05:06:51,400 --> 05:06:53,400 So I'm going to say, all right, if I take the sample, 6043 05:06:53,400 --> 05:06:57,560 look at the value of the train random variable, if the train is delayed, 6044 05:06:57,560 --> 05:06:59,320 well, let me go ahead and add to my data 6045 05:06:59,320 --> 05:07:02,640 that I'm collecting the value of the appointment random variable 6046 05:07:02,640 --> 05:07:05,400 that it took on in this particular sample. 6047 05:07:05,400 --> 05:07:08,240 So I'm only considering the samples where the train is delayed. 6048 05:07:08,240 --> 05:07:11,840 And for each of those samples, considering what the value of appointment 6049 05:07:11,840 --> 05:07:14,440 is, and then at the end, I'm using a Python class called 6050 05:07:14,440 --> 05:07:18,120 counter, which quickly counts up all the values inside of a data set. 6051 05:07:18,120 --> 05:07:20,560 So I can take this list of data and figure out 6052 05:07:20,560 --> 05:07:25,680 how many times was my appointment made and how many times was my appointment 6053 05:07:25,680 --> 05:07:27,080 missed. 6054 05:07:27,080 --> 05:07:29,240 And so this here, with just a couple lines of code, 6055 05:07:29,240 --> 05:07:32,720 is an implementation of rejection sampling. 6056 05:07:32,720 --> 05:07:37,800 And I can run it by going ahead and running Python sample.py. 6057 05:07:37,800 --> 05:07:39,840 And when I do that, here is the result I get. 6058 05:07:39,840 --> 05:07:41,760 This is the result of the counter. 6059 05:07:41,760 --> 05:07:45,400 1,251 times, I was able to attend the meeting. 6060 05:07:45,400 --> 05:07:48,520 And 856 times, I was able to miss the meeting. 6061 05:07:48,520 --> 05:07:51,080 And you can imagine, by doing more and more samples, 6062 05:07:51,080 --> 05:07:54,120 I'll be able to get a better and better, more accurate result. 6063 05:07:54,120 --> 05:07:55,680 And this is a randomized process. 6064 05:07:55,680 --> 05:07:58,560 It's going to be an approximation of the probability. 6065 05:07:58,560 --> 05:08:01,760 If I run it a different time, you'll notice the numbers are similar, 12, 6066 05:08:01,760 --> 05:08:03,760 72, and 905. 6067 05:08:03,760 --> 05:08:07,560 But they're not identical because there's some randomization, some likelihood 6068 05:08:07,560 --> 05:08:09,280 that things might be higher or lower. 6069 05:08:09,280 --> 05:08:12,800 And so this is why we generally want to try and use more samples so that we 6070 05:08:12,800 --> 05:08:15,520 can have a greater amount of confidence in our result, 6071 05:08:15,520 --> 05:08:18,840 be more sure about the result that we're getting of whether or not 6072 05:08:18,840 --> 05:08:23,520 it accurately reflects or represents the actual underlying probabilities that 6073 05:08:23,520 --> 05:08:26,680 are inherent inside of this distribution. 6074 05:08:26,680 --> 05:08:29,720 And so this, then, was an instance of rejection sampling. 6075 05:08:29,720 --> 05:08:32,280 And it turns out there are a number of other sampling methods 6076 05:08:32,280 --> 05:08:34,720 that you could use to begin to try to sample. 6077 05:08:34,720 --> 05:08:37,160 One problem that rejection sampling has is 6078 05:08:37,160 --> 05:08:41,800 that if the evidence you're looking for is a fairly unlikely event, 6079 05:08:41,800 --> 05:08:44,240 well, you're going to be rejecting a lot of samples. 6080 05:08:44,240 --> 05:08:48,160 Like if I'm looking for the probability of x given some evidence e, 6081 05:08:48,160 --> 05:08:52,320 if e is very unlikely to occur, like occurs maybe one every 1,000 times, 6082 05:08:52,320 --> 05:08:56,120 then I'm only going to be considering 1 out of every 1,000 samples that I do, 6083 05:08:56,120 --> 05:08:59,760 which is a pretty inefficient method for trying to do this sort of calculation. 6084 05:08:59,760 --> 05:09:01,720 I'm throwing away a lot of samples. 6085 05:09:01,720 --> 05:09:05,040 And it takes computational effort to be able to generate those samples. 6086 05:09:05,040 --> 05:09:07,320 So I'd like to not have to do something like that. 6087 05:09:07,320 --> 05:09:09,880 So there are other sampling methods that can try and address this. 6088 05:09:09,880 --> 05:09:13,320 One such sampling method is called likelihood weighting. 6089 05:09:13,320 --> 05:09:16,600 In likelihood weighting, we follow a slightly different procedure. 6090 05:09:16,600 --> 05:09:20,680 And the goal is to avoid needing to throw out samples 6091 05:09:20,680 --> 05:09:22,240 that didn't match the evidence. 6092 05:09:22,240 --> 05:09:26,400 And so what we'll do is we'll start by fixing the values for the evidence 6093 05:09:26,400 --> 05:09:26,920 variables. 6094 05:09:26,920 --> 05:09:29,080 Rather than sample everything, we're going 6095 05:09:29,080 --> 05:09:33,480 to fix the values of the evidence variables and not sample those. 6096 05:09:33,480 --> 05:09:36,640 Then we're going to sample all the other non-evidence variables 6097 05:09:36,640 --> 05:09:38,920 in the same way, just using the Bayesian network looking 6098 05:09:38,920 --> 05:09:43,640 at the probability distributions, sampling all the non-evidence variables. 6099 05:09:43,640 --> 05:09:48,080 But then what we need to do is weight each sample by its likelihood. 6100 05:09:48,080 --> 05:09:50,120 If our evidence is really unlikely, we want 6101 05:09:50,120 --> 05:09:53,840 to make sure that we've taken into account how likely was the evidence 6102 05:09:53,840 --> 05:09:55,920 to actually show up in the sample. 6103 05:09:55,920 --> 05:09:58,200 If I have a sample where the evidence was much more 6104 05:09:58,200 --> 05:10:00,360 likely to show up than another sample, then I 6105 05:10:00,360 --> 05:10:02,680 want to weight the more likely one higher. 6106 05:10:02,680 --> 05:10:06,080 So we're going to weight each sample by its likelihood, where likelihood is just 6107 05:10:06,080 --> 05:10:09,120 defined as the probability of all the evidence. 6108 05:10:09,120 --> 05:10:11,720 Given all the evidence we have, what is the probability 6109 05:10:11,720 --> 05:10:14,280 that it would happen in that particular sample? 6110 05:10:14,280 --> 05:10:16,860 So before, all of our samples were weighted equally. 6111 05:10:16,860 --> 05:10:19,360 They all had a weight of 1 when we were calculating 6112 05:10:19,360 --> 05:10:20,600 the overall average. 6113 05:10:20,600 --> 05:10:22,640 In this case, we're going to weight each sample, 6114 05:10:22,640 --> 05:10:25,840 multiply each sample by its likelihood in order 6115 05:10:25,840 --> 05:10:28,880 to get the more accurate distribution. 6116 05:10:28,880 --> 05:10:30,080 So what would this look like? 6117 05:10:30,080 --> 05:10:33,520 Well, if I ask the same question, what is the probability of light rain, 6118 05:10:33,520 --> 05:10:36,680 given that the train is on time, when I do the sampling procedure 6119 05:10:36,680 --> 05:10:40,720 and start by trying to sample, I'm going to start by fixing the evidence 6120 05:10:40,720 --> 05:10:41,280 variable. 6121 05:10:41,280 --> 05:10:44,280 I'm already going to have in my sample the train is on time. 6122 05:10:44,280 --> 05:10:46,480 That way, I don't have to throw out anything. 6123 05:10:46,480 --> 05:10:50,280 I'm only sampling things where I know the value of the variables that 6124 05:10:50,280 --> 05:10:53,440 are my evidence are what I expect them to be. 6125 05:10:53,440 --> 05:10:55,200 So I'll go ahead and sample from rain. 6126 05:10:55,200 --> 05:10:58,160 And maybe this time, I sample light rain instead of no rain. 6127 05:10:58,160 --> 05:11:00,000 Then I'll sample from track maintenance and say, 6128 05:11:00,000 --> 05:11:01,720 maybe, yes, there's track maintenance. 6129 05:11:01,720 --> 05:11:04,800 Then for train, well, I've already fixed it in place. 6130 05:11:04,800 --> 05:11:06,840 Train was an evidence variable. 6131 05:11:06,840 --> 05:11:09,000 So I'm not going to bother sampling again. 6132 05:11:09,000 --> 05:11:10,520 I'll just go ahead and move on. 6133 05:11:10,520 --> 05:11:14,880 I'll move on to appointment and go ahead and sample from appointment as well. 6134 05:11:14,880 --> 05:11:16,680 So now I've generated a sample. 6135 05:11:16,680 --> 05:11:19,840 I've generated a sample by fixing this evidence variable 6136 05:11:19,840 --> 05:11:22,000 and sampling the other three. 6137 05:11:22,000 --> 05:11:24,000 And the last step is now weighting the sample. 6138 05:11:24,000 --> 05:11:25,560 How much weight should it have? 6139 05:11:25,560 --> 05:11:28,520 And the weight is based on how probable is it 6140 05:11:28,520 --> 05:11:32,080 that the train was actually on time, this evidence actually happened, 6141 05:11:32,080 --> 05:11:35,080 given the values of these other variables, light rain and the fact 6142 05:11:35,080 --> 05:11:37,280 that, yes, there was track maintenance. 6143 05:11:37,280 --> 05:11:39,880 Well, to do that, I can just go back to the train variable 6144 05:11:39,880 --> 05:11:43,280 and say, all right, if there was light rain and track maintenance, 6145 05:11:43,280 --> 05:11:46,800 the likelihood of my evidence, the likelihood that my train was on time, 6146 05:11:46,800 --> 05:11:48,200 is 0.6. 6147 05:11:48,200 --> 05:11:52,880 And so this particular sample would have a weight of 0.6. 6148 05:11:52,880 --> 05:11:55,360 And I could repeat the sampling procedure again and again. 6149 05:11:55,360 --> 05:11:57,760 Each time every sample would be given a weight 6150 05:11:57,760 --> 05:12:02,560 according to the probability of the evidence that I see associated with it. 6151 05:12:02,560 --> 05:12:04,920 And there are other sampling methods that exist as well, 6152 05:12:04,920 --> 05:12:07,320 but all of them are designed to try and get it the same idea, 6153 05:12:07,320 --> 05:12:13,160 to approximate the inference procedure of figuring out the value of a variable. 6154 05:12:13,160 --> 05:12:15,200 So we've now dealt with probability as it 6155 05:12:15,200 --> 05:12:18,480 pertains to particular variables that have these discrete values. 6156 05:12:18,480 --> 05:12:22,880 But what we haven't really considered is how values might change over time. 6157 05:12:22,880 --> 05:12:25,120 That we've considered something like a variable for rain, 6158 05:12:25,120 --> 05:12:28,920 where rain can take on values of none or light rain or heavy rain. 6159 05:12:28,920 --> 05:12:32,600 But in practice, usually when we consider values for variables like rain, 6160 05:12:32,600 --> 05:12:37,120 we like to consider it for over time, how do the values of these variables 6161 05:12:37,120 --> 05:12:37,640 change? 6162 05:12:37,640 --> 05:12:40,320 What do we do with when we're dealing with uncertainty 6163 05:12:40,320 --> 05:12:43,240 over a period of time, which can come up in the context of weather, 6164 05:12:43,240 --> 05:12:46,360 for example, if I have sunny days and I have rainy days. 6165 05:12:46,360 --> 05:12:51,080 And I'd like to know not just what is the probability that it's raining now, 6166 05:12:51,080 --> 05:12:53,480 but what is the probability that it rains tomorrow, 6167 05:12:53,480 --> 05:12:55,520 or the day after that, or the day after that. 6168 05:12:55,520 --> 05:12:57,280 And so to do this, we're going to introduce 6169 05:12:57,280 --> 05:12:58,960 a slightly different kind of model. 6170 05:12:58,960 --> 05:13:02,920 But here, we're going to have a random variable, not just one for the weather, 6171 05:13:02,920 --> 05:13:05,360 but for every possible time step. 6172 05:13:05,360 --> 05:13:07,200 And you can define time step however you like. 6173 05:13:07,200 --> 05:13:10,280 A simple way is just to use days as your time step. 6174 05:13:10,280 --> 05:13:13,840 And so we can define a variable called x sub t, which 6175 05:13:13,840 --> 05:13:16,280 is going to be the weather at time t. 6176 05:13:16,280 --> 05:13:19,200 So x sub 0 might be the weather on day 0. 6177 05:13:19,200 --> 05:13:22,040 x sub 1 might be the weather on day 1, so on and so forth. 6178 05:13:22,040 --> 05:13:24,720 x sub 2 is the weather on day 2. 6179 05:13:24,720 --> 05:13:26,560 But as you can imagine, if we start to do this 6180 05:13:26,560 --> 05:13:28,560 over longer and longer periods of time, there's 6181 05:13:28,560 --> 05:13:30,840 an incredible amount of data that might go into this. 6182 05:13:30,840 --> 05:13:33,600 If you're keeping track of data about the weather for a year, 6183 05:13:33,600 --> 05:13:36,400 now suddenly you might be trying to predict the weather tomorrow, 6184 05:13:36,400 --> 05:13:40,000 given 365 days of previous pieces of evidence. 6185 05:13:40,000 --> 05:13:43,200 And that's a lot of evidence to have to deal with and manipulate and calculate. 6186 05:13:43,200 --> 05:13:47,080 Probably nobody knows what the exact conditional probability distribution 6187 05:13:47,080 --> 05:13:49,880 is for all of those combinations of variables. 6188 05:13:49,880 --> 05:13:52,560 And so when we're trying to do this inference inside of a computer, 6189 05:13:52,560 --> 05:13:56,280 when we're trying to reasonably do this sort of analysis, 6190 05:13:56,280 --> 05:13:58,800 it's helpful to make some simplifying assumptions, 6191 05:13:58,800 --> 05:14:01,920 some assumptions about the problem that we can just assume are true, 6192 05:14:01,920 --> 05:14:03,600 to make our lives a little bit easier. 6193 05:14:03,600 --> 05:14:05,920 Even if they're not totally accurate assumptions, 6194 05:14:05,920 --> 05:14:09,520 if they're close to accurate or approximate, they're usually pretty good. 6195 05:14:09,520 --> 05:14:13,160 And the assumption we're going to make is called the Markov assumption, which 6196 05:14:13,160 --> 05:14:16,640 is the assumption that the current state depends only 6197 05:14:16,640 --> 05:14:19,880 on a finite fixed number of previous states. 6198 05:14:19,880 --> 05:14:23,880 So the current day's weather depends not on all the previous day's weather 6199 05:14:23,880 --> 05:14:26,720 for the rest of all of history, but the current day's weather 6200 05:14:26,720 --> 05:14:29,520 I can predict just based on yesterday's weather, 6201 05:14:29,520 --> 05:14:32,680 or just based on the last two days weather, or the last three days weather. 6202 05:14:32,680 --> 05:14:36,960 But oftentimes, we're going to deal with just the one previous state 6203 05:14:36,960 --> 05:14:39,720 that helps to predict this current state. 6204 05:14:39,720 --> 05:14:42,280 And by putting a whole bunch of these random variables together, 6205 05:14:42,280 --> 05:14:46,120 using this Markov assumption, we can create what's called a Markov chain, 6206 05:14:46,120 --> 05:14:49,560 where a Markov chain is just some sequence of random variables 6207 05:14:49,560 --> 05:14:53,120 where each of the variables distribution follows that Markov assumption. 6208 05:14:53,120 --> 05:14:56,040 And so we'll do an example of this where the Markov assumption is, 6209 05:14:56,040 --> 05:14:57,200 I can predict the weather. 6210 05:14:57,200 --> 05:14:58,760 Is it sunny or rainy? 6211 05:14:58,760 --> 05:15:01,160 And we'll just consider those two possibilities for now, 6212 05:15:01,160 --> 05:15:02,920 even though there are other types of weather. 6213 05:15:02,920 --> 05:15:06,280 But I can predict each day's weather just on the prior day's weather, 6214 05:15:06,280 --> 05:15:10,040 using today's weather, I can come up with a probability distribution 6215 05:15:10,040 --> 05:15:11,480 for tomorrow's weather. 6216 05:15:11,480 --> 05:15:13,320 And here's what this weather might look like. 6217 05:15:13,320 --> 05:15:16,640 It's formatted in terms of a matrix, as you might describe it, 6218 05:15:16,640 --> 05:15:21,040 as rows and columns of values, where on the left-hand side, 6219 05:15:21,040 --> 05:15:25,480 I have today's weather, represented by the variable x sub t. 6220 05:15:25,480 --> 05:15:28,360 And over here in the columns, I have tomorrow's weather, 6221 05:15:28,360 --> 05:15:34,440 represented by the variable x sub t plus 1, t plus 1 day's weather instead. 6222 05:15:34,440 --> 05:15:38,600 And what this matrix is saying is, if today is sunny, 6223 05:15:38,600 --> 05:15:42,040 well, then it's more likely than not that tomorrow is also sunny. 6224 05:15:42,040 --> 05:15:45,520 Oftentimes, the weather stays consistent for multiple days in a row. 6225 05:15:45,520 --> 05:15:47,840 And for example, let's say that if today is sunny, 6226 05:15:47,840 --> 05:15:52,440 our model says that tomorrow, with probability 0.8, it will also be sunny. 6227 05:15:52,440 --> 05:15:55,240 And with probability 0.2, it will be raining. 6228 05:15:55,240 --> 05:15:59,920 And likewise, if today is raining, then it's more likely than not 6229 05:15:59,920 --> 05:16:01,120 that tomorrow is also raining. 6230 05:16:01,120 --> 05:16:06,320 With probability 0.7, it'll be raining. With probability 0.3, it will be sunny. 6231 05:16:06,320 --> 05:16:10,760 So this matrix, this description of how it is we transition from one state 6232 05:16:10,760 --> 05:16:14,160 to the next state is what we're going to call the transition model. 6233 05:16:14,160 --> 05:16:16,680 And using the transition model, you can begin 6234 05:16:16,680 --> 05:16:20,360 to construct this Markov chain by just predicting, 6235 05:16:20,360 --> 05:16:23,300 given today's weather, what's the likelihood of tomorrow's weather 6236 05:16:23,300 --> 05:16:23,800 happening. 6237 05:16:23,800 --> 05:16:27,500 And you can imagine doing a similar sampling procedure, 6238 05:16:27,500 --> 05:16:30,880 where you take this information, you sample what tomorrow's weather is 6239 05:16:30,880 --> 05:16:31,600 going to be. 6240 05:16:31,600 --> 05:16:33,640 Using that, you sample the next day's weather. 6241 05:16:33,640 --> 05:16:38,040 And the result of that is you can form this Markov chain of like x0, 6242 05:16:38,040 --> 05:16:40,760 time and time, day zero is sunny, the next day is sunny, 6243 05:16:40,760 --> 05:16:43,880 maybe the next day it changes to raining, then raining, then raining. 6244 05:16:43,880 --> 05:16:46,600 And the pattern that this Markov chain follows, 6245 05:16:46,600 --> 05:16:50,320 given the distribution that we had access to, this transition model here, 6246 05:16:50,320 --> 05:16:53,280 is that when it's sunny, it tends to stay sunny for a little while. 6247 05:16:53,280 --> 05:16:55,760 The next couple of days tend to be sunny too. 6248 05:16:55,760 --> 05:16:59,360 And when it's raining, it tends to be raining as well. 6249 05:16:59,360 --> 05:17:01,400 And so you get a Markov chain that looks like this, 6250 05:17:01,400 --> 05:17:02,720 and you can do analysis on this. 6251 05:17:02,720 --> 05:17:06,380 You can say, given that today is raining, what is the probability 6252 05:17:06,380 --> 05:17:07,420 that tomorrow is raining? 6253 05:17:07,420 --> 05:17:09,400 Or you can begin to ask probability questions 6254 05:17:09,400 --> 05:17:13,600 like, what is the probability of this sequence of five values, sun, sun, 6255 05:17:13,600 --> 05:17:17,120 rain, rain, rain, and answer those sorts of questions too. 6256 05:17:17,120 --> 05:17:19,640 And it turns out there are, again, many Python libraries 6257 05:17:19,640 --> 05:17:23,160 for interacting with models like this of probabilities 6258 05:17:23,160 --> 05:17:25,320 that have distributions and random variables that 6259 05:17:25,320 --> 05:17:29,340 are based on previous variables according to this Markov assumption. 6260 05:17:29,340 --> 05:17:32,720 And pomegranate2 has ways of dealing with these sorts of variables. 6261 05:17:32,720 --> 05:17:39,440 So I'll go ahead and go into the chain directory, 6262 05:17:39,440 --> 05:17:42,200 where I have some information about Markov chains. 6263 05:17:42,200 --> 05:17:45,240 And here, I've defined a file called model.py, 6264 05:17:45,240 --> 05:17:47,960 where I've defined in a very similar syntax. 6265 05:17:47,960 --> 05:17:50,720 And again, the exact syntax doesn't matter so much as the idea 6266 05:17:50,720 --> 05:17:54,080 that I'm encoding this information into a Python program 6267 05:17:54,080 --> 05:17:56,940 so that the program has access to these distributions. 6268 05:17:56,940 --> 05:17:59,560 I've here defined some starting distribution. 6269 05:17:59,560 --> 05:18:02,640 So every Markov model begins at some point in time, 6270 05:18:02,640 --> 05:18:04,720 and I need to give it some starting distribution. 6271 05:18:04,720 --> 05:18:08,480 And so we'll just say, you know at the start, you can pick 50-50 between sunny 6272 05:18:08,480 --> 05:18:09,120 and rainy. 6273 05:18:09,120 --> 05:18:13,000 We'll say it's sunny 50% of the time, rainy 50% of the time. 6274 05:18:13,000 --> 05:18:16,080 And then down below, I've here defined the transition model, 6275 05:18:16,080 --> 05:18:19,320 how it is that I transition from one day to the next. 6276 05:18:19,320 --> 05:18:22,160 And here, I've encoded that exact same matrix from before, 6277 05:18:22,160 --> 05:18:24,840 that if it was sunny today, then with probability 0.8, 6278 05:18:24,840 --> 05:18:26,280 it will be sunny tomorrow. 6279 05:18:26,280 --> 05:18:29,180 And it'll be rainy tomorrow with probability 0.2. 6280 05:18:29,180 --> 05:18:34,400 And I likewise have another distribution for if it was raining today instead. 6281 05:18:34,400 --> 05:18:36,640 And so that alone defines the Markov model. 6282 05:18:36,640 --> 05:18:39,040 You can begin to answer questions using that model. 6283 05:18:39,040 --> 05:18:42,320 But one thing I'll just do is sample from the Markov chain. 6284 05:18:42,320 --> 05:18:45,640 It turns out there is a method built into this Markov chain library 6285 05:18:45,640 --> 05:18:48,120 that allows me to sample 50 states from the chain, 6286 05:18:48,120 --> 05:18:52,640 basically just simulating like 50 instances of weather. 6287 05:18:52,640 --> 05:18:54,400 And so let me go ahead and run this. 6288 05:18:54,400 --> 05:18:57,840 Python model.py. 6289 05:18:57,840 --> 05:18:59,920 And when I run it, what I get is that it's 6290 05:18:59,920 --> 05:19:04,480 going to sample from this Markov chain 50 states, 50 days worth of weather 6291 05:19:04,480 --> 05:19:06,240 that it's just going to randomly sample. 6292 05:19:06,240 --> 05:19:09,040 And you can imagine sampling many times to be able to get more data, 6293 05:19:09,040 --> 05:19:10,480 to be able to do more analysis. 6294 05:19:10,480 --> 05:19:13,800 But here, for example, it's sunny two days in a row, 6295 05:19:13,800 --> 05:19:17,000 rainy a whole bunch of days in a row before it changes back to sun. 6296 05:19:17,000 --> 05:19:20,080 And so you get this model that follows the distribution 6297 05:19:20,080 --> 05:19:23,600 that we originally described, that follows the distribution of sunny days 6298 05:19:23,600 --> 05:19:25,240 tend to lead to more sunny days. 6299 05:19:25,240 --> 05:19:29,400 Rainy days tend to lead to more rainy days. 6300 05:19:29,400 --> 05:19:31,680 And that then is a Markov model. 6301 05:19:31,680 --> 05:19:34,800 And Markov models rely on us knowing the values 6302 05:19:34,800 --> 05:19:35,880 of these individual states. 6303 05:19:35,880 --> 05:19:38,640 I know that today is sunny or that today is raining. 6304 05:19:38,640 --> 05:19:41,880 And using that information, I can draw some sort of inference 6305 05:19:41,880 --> 05:19:44,320 about what tomorrow is going to be like. 6306 05:19:44,320 --> 05:19:46,640 But in practice, this often isn't the case. 6307 05:19:46,640 --> 05:19:49,200 It often isn't the case that I know for certain what 6308 05:19:49,200 --> 05:19:51,280 the exact state of the world is. 6309 05:19:51,280 --> 05:19:54,320 Oftentimes, the state of the world is exactly unknown. 6310 05:19:54,320 --> 05:19:58,120 But I'm able to somehow sense some information about that state, 6311 05:19:58,120 --> 05:20:01,040 that a robot or an AI doesn't have exact knowledge 6312 05:20:01,040 --> 05:20:02,200 about the world around it. 6313 05:20:02,200 --> 05:20:05,120 But it has some sort of sensor, whether that sensor is a camera 6314 05:20:05,120 --> 05:20:09,040 or sensors that detect distance or just a microphone that is sensing audio, 6315 05:20:09,040 --> 05:20:09,920 for example. 6316 05:20:09,920 --> 05:20:11,400 It is sensing data. 6317 05:20:11,400 --> 05:20:14,240 And using that data, that data is somehow related 6318 05:20:14,240 --> 05:20:17,000 to the state of the world, even if it doesn't actually know, 6319 05:20:17,000 --> 05:20:20,720 our AI doesn't know, what the underlying true state of the world 6320 05:20:20,720 --> 05:20:22,200 actually is. 6321 05:20:22,200 --> 05:20:25,120 And for that, we need to get into the world of sensor models, 6322 05:20:25,120 --> 05:20:28,040 the way of describing how it is that we translate 6323 05:20:28,040 --> 05:20:31,200 what the hidden state, the underlying true state of the world, 6324 05:20:31,200 --> 05:20:36,120 is with what the observation, what it is that the AI knows or the AI has 6325 05:20:36,120 --> 05:20:38,360 access to, actually is. 6326 05:20:38,360 --> 05:20:42,520 And so for example, a hidden state might be a robot's position. 6327 05:20:42,520 --> 05:20:45,160 If a robot is exploring new uncharted territory, 6328 05:20:45,160 --> 05:20:48,240 the robot likely doesn't know exactly where it is. 6329 05:20:48,240 --> 05:20:49,640 But it does have an observation. 6330 05:20:49,640 --> 05:20:52,920 It has robot sensor data, where it can sense how far away 6331 05:20:52,920 --> 05:20:54,880 are possible obstacles around it. 6332 05:20:54,880 --> 05:20:58,880 And using that information, using the observed information that it has, 6333 05:20:58,880 --> 05:21:01,920 it can infer something about the hidden state. 6334 05:21:01,920 --> 05:21:05,880 Because what the true hidden state is influences those observations. 6335 05:21:05,880 --> 05:21:10,160 Whatever the robot's true position is affects or has some effect 6336 05:21:10,160 --> 05:21:13,480 upon what the sensor data of the robot is able to collect is, 6337 05:21:13,480 --> 05:21:18,720 even if the robot doesn't actually know for certain what its true position is. 6338 05:21:18,720 --> 05:21:21,960 Likewise, if you think about a voice recognition or a speech recognition 6339 05:21:21,960 --> 05:21:25,280 program that listens to you and is able to respond to you, something 6340 05:21:25,280 --> 05:21:29,640 like Alexa or what Apple and Google are doing with their voice recognition 6341 05:21:29,640 --> 05:21:33,720 as well, that you might imagine that the hidden state, the underlying state, 6342 05:21:33,720 --> 05:21:35,360 is what words are actually spoken. 6343 05:21:35,360 --> 05:21:38,240 The true nature of the world contains you saying 6344 05:21:38,240 --> 05:21:42,920 a particular sequence of words, but your phone or your smart home device 6345 05:21:42,920 --> 05:21:45,560 doesn't know for sure exactly what words you said. 6346 05:21:45,560 --> 05:21:50,720 The only observation that the AI has access to is some audio waveforms. 6347 05:21:50,720 --> 05:21:54,800 And those audio waveforms are, of course, dependent upon this hidden state. 6348 05:21:54,800 --> 05:21:57,560 And you can infer, based on those audio waveforms, 6349 05:21:57,560 --> 05:22:00,160 what the words spoken likely were. 6350 05:22:00,160 --> 05:22:04,600 But you might not know with 100% certainty what that hidden state actually 6351 05:22:04,600 --> 05:22:05,100 is. 6352 05:22:05,100 --> 05:22:08,440 And it might be a task to try and predict, given this observation, 6353 05:22:08,440 --> 05:22:12,600 given these audio waveforms, can you figure out what the actual words spoken 6354 05:22:12,600 --> 05:22:13,760 are. 6355 05:22:13,760 --> 05:22:16,680 And likewise, you might imagine on a website, true user engagement. 6356 05:22:16,680 --> 05:22:19,160 Might be information you don't directly have access to. 6357 05:22:19,160 --> 05:22:22,060 But you can observe data, like website or app analytics, 6358 05:22:22,060 --> 05:22:25,280 about how often was this button clicked or how often are people interacting 6359 05:22:25,280 --> 05:22:26,840 with a page in a particular way. 6360 05:22:26,840 --> 05:22:30,840 And you can use that to infer things about your users as well. 6361 05:22:30,840 --> 05:22:33,440 So this type of problem comes up all the time 6362 05:22:33,440 --> 05:22:36,400 when we're dealing with AI and trying to infer things about the world. 6363 05:22:36,400 --> 05:22:40,400 That often AI doesn't really know the hidden true state of the world. 6364 05:22:40,400 --> 05:22:43,560 All the AI has access to is some observation 6365 05:22:43,560 --> 05:22:45,920 that is related to the hidden true state. 6366 05:22:45,920 --> 05:22:47,080 But it's not direct. 6367 05:22:47,080 --> 05:22:48,440 There might be some noise there. 6368 05:22:48,440 --> 05:22:50,720 The audio waveform might have some additional noise 6369 05:22:50,720 --> 05:22:52,000 that might be difficult to parse. 6370 05:22:52,000 --> 05:22:54,560 The sensor data might not be exactly correct. 6371 05:22:54,560 --> 05:22:57,760 There's some noise that might not allow you to conclude with certainty what 6372 05:22:57,760 --> 05:23:01,880 the hidden state is, but can allow you to infer what it might be. 6373 05:23:01,880 --> 05:23:04,040 And so the simple example we'll take a look at here 6374 05:23:04,040 --> 05:23:07,040 is imagining the hidden state as the weather, whether it's sunny or rainy 6375 05:23:07,040 --> 05:23:07,720 or not. 6376 05:23:07,720 --> 05:23:11,360 And imagine you are programming an AI inside of a building that maybe has 6377 05:23:11,360 --> 05:23:14,400 access to just a camera to inside the building. 6378 05:23:14,400 --> 05:23:17,280 And all you have access to is an observation 6379 05:23:17,280 --> 05:23:19,600 as to whether or not employees are bringing 6380 05:23:19,600 --> 05:23:21,440 an umbrella into the building or not. 6381 05:23:21,440 --> 05:23:24,000 You can detect whether it's an umbrella or not. 6382 05:23:24,000 --> 05:23:26,640 And so you might have an observation as to whether or not 6383 05:23:26,640 --> 05:23:28,960 an umbrella is brought into the building or not. 6384 05:23:28,960 --> 05:23:32,840 And using that information, you want to predict whether it's sunny or rainy, 6385 05:23:32,840 --> 05:23:35,600 even if you don't know what the underlying weather is. 6386 05:23:35,600 --> 05:23:37,680 So the underlying weather might be sunny or rainy. 6387 05:23:37,680 --> 05:23:41,120 And if it's raining, obviously people are more likely to bring an umbrella. 6388 05:23:41,120 --> 05:23:44,320 And so whether or not people bring an umbrella, your observation, 6389 05:23:44,320 --> 05:23:46,560 tells you something about the hidden state. 6390 05:23:46,560 --> 05:23:48,600 And of course, this is a bit of a contrived example, 6391 05:23:48,600 --> 05:23:51,640 but the idea here is to think about this more broadly in terms of more 6392 05:23:51,640 --> 05:23:54,000 generally, any time you observe something, 6393 05:23:54,000 --> 05:23:57,680 it having to do with some underlying hidden state. 6394 05:23:57,680 --> 05:23:59,720 And so to try and model this type of idea where 6395 05:23:59,720 --> 05:24:02,000 we have these hidden states and observations, 6396 05:24:02,000 --> 05:24:05,320 rather than just use a Markov model, which has state, state, state, state, 6397 05:24:05,320 --> 05:24:08,560 each of which is connected by that transition matrix that we described 6398 05:24:08,560 --> 05:24:12,280 before, we're going to use what we call a hidden Markov model. 6399 05:24:12,280 --> 05:24:14,600 Very similar to a Markov model, but this is going 6400 05:24:14,600 --> 05:24:17,560 to allow us to model a system that has hidden states 6401 05:24:17,560 --> 05:24:21,160 that we don't directly observe, along with some observed event 6402 05:24:21,160 --> 05:24:23,360 that we do actually see. 6403 05:24:23,360 --> 05:24:25,800 And so in addition to that transition model that we still 6404 05:24:25,800 --> 05:24:28,400 need of saying, given the underlying state of the world, 6405 05:24:28,400 --> 05:24:32,080 if it's sunny or rainy, what's the probability of tomorrow's weather? 6406 05:24:32,080 --> 05:24:35,800 We also need another model that, given some state, 6407 05:24:35,800 --> 05:24:38,920 is going to give us an observation of green, yes, someone brings 6408 05:24:38,920 --> 05:24:43,560 an umbrella into the office, or red, no, nobody brings umbrellas into the office. 6409 05:24:43,560 --> 05:24:46,840 And so the observation might be that if it's sunny, 6410 05:24:46,840 --> 05:24:49,400 then odds are nobody is going to bring an umbrella to the office. 6411 05:24:49,400 --> 05:24:51,400 But maybe some people are just being cautious, 6412 05:24:51,400 --> 05:24:54,120 and they do bring an umbrella to the office anyways. 6413 05:24:54,120 --> 05:24:57,400 And if it's raining, then with much higher probability, 6414 05:24:57,400 --> 05:24:59,720 then people are going to bring umbrellas into the office. 6415 05:24:59,720 --> 05:25:02,900 But maybe if the rain was unexpected, people didn't bring an umbrella. 6416 05:25:02,900 --> 05:25:05,520 And so it might have some other probability as well. 6417 05:25:05,520 --> 05:25:07,560 And so using the observations, you can begin 6418 05:25:07,560 --> 05:25:11,680 to predict with reasonable likelihood what the underlying state is, 6419 05:25:11,680 --> 05:25:15,080 even if you don't actually get to observe the underlying state, 6420 05:25:15,080 --> 05:25:18,640 if you don't get to see what the hidden state is actually equal to. 6421 05:25:18,640 --> 05:25:21,040 This here we'll often call the sensor model. 6422 05:25:21,040 --> 05:25:23,920 It's also often called the emission probabilities, 6423 05:25:23,920 --> 05:25:27,760 because the state, the underlying state, emits some sort of emission 6424 05:25:27,760 --> 05:25:29,160 that you then observe. 6425 05:25:29,160 --> 05:25:32,840 And so that can be another way of describing that same idea. 6426 05:25:32,840 --> 05:25:35,480 And the sensor Markov assumption that we're going to use 6427 05:25:35,480 --> 05:25:38,960 is this assumption that the evidence variable, the thing we observe, 6428 05:25:38,960 --> 05:25:43,120 the emission that gets produced, depends only on the corresponding state, 6429 05:25:43,120 --> 05:25:46,600 meaning it can predict whether or not people will bring umbrellas or not 6430 05:25:46,600 --> 05:25:50,920 entirely dependent just on whether it is sunny or rainy today. 6431 05:25:50,920 --> 05:25:53,560 Of course, again, this assumption might not hold in practice, 6432 05:25:53,560 --> 05:25:55,680 that in practice, it might depend whether or not 6433 05:25:55,680 --> 05:25:58,240 people bring umbrellas, might depend not just on today's weather, 6434 05:25:58,240 --> 05:26:00,560 but also on yesterday's weather and the day before. 6435 05:26:00,560 --> 05:26:04,480 But for simplification purposes, it can be helpful to apply this sort 6436 05:26:04,480 --> 05:26:07,000 of assumption just to allow us to be able to reason 6437 05:26:07,000 --> 05:26:09,680 about these probabilities a little more easily. 6438 05:26:09,680 --> 05:26:14,440 And if we're able to approximate it, we can still often get a very good answer. 6439 05:26:14,440 --> 05:26:16,960 And so what these hidden Markov models end up looking like 6440 05:26:16,960 --> 05:26:20,000 is a little something like this, where now, rather than just have 6441 05:26:20,000 --> 05:26:23,520 one chain of states, like sun, sun, rain, rain, rain, 6442 05:26:23,520 --> 05:26:29,280 we instead have this upper level, which is the underlying state of the world. 6443 05:26:29,280 --> 05:26:30,560 Is it sunny or is it rainy? 6444 05:26:30,560 --> 05:26:34,360 And those are connected by that transition matrix we described before. 6445 05:26:34,360 --> 05:26:37,160 But each of these states produces an emission, 6446 05:26:37,160 --> 05:26:41,200 produces an observation that I see, that on this day, it was sunny 6447 05:26:41,200 --> 05:26:43,200 and people didn't bring umbrellas. 6448 05:26:43,200 --> 05:26:46,000 And on this day, it was sunny, but people did bring umbrellas. 6449 05:26:46,000 --> 05:26:48,160 And on this day, it was raining and people did bring umbrellas, 6450 05:26:48,160 --> 05:26:49,680 and so on and so forth. 6451 05:26:49,680 --> 05:26:52,560 And so each of these underlying states represented 6452 05:26:52,560 --> 05:26:56,400 by x sub t for x sub 1, 0, 1, 2, so on and so forth, 6453 05:26:56,400 --> 05:26:59,000 produces some sort of observation or emission, 6454 05:26:59,000 --> 05:27:04,320 which is what the e stands for, e sub 0, e sub 1, e sub 2, so on and so forth. 6455 05:27:04,320 --> 05:27:07,600 And so this, too, is a way of trying to represent this idea. 6456 05:27:07,600 --> 05:27:10,240 And what you want to think about is that these underlying states are 6457 05:27:10,240 --> 05:27:14,360 the true nature of the world, the robot's position as it moves over time, 6458 05:27:14,360 --> 05:27:17,720 and that produces some sort of sensor data that might be observed, 6459 05:27:17,720 --> 05:27:21,640 or what people are actually saying and using the emission data of what 6460 05:27:21,640 --> 05:27:24,880 audio waveforms do you detect in order to process that data 6461 05:27:24,880 --> 05:27:26,200 and try and figure it out. 6462 05:27:26,200 --> 05:27:29,440 And there are a number of possible tasks that you might want to do 6463 05:27:29,440 --> 05:27:30,800 given this kind of information. 6464 05:27:30,800 --> 05:27:33,720 And one of the simplest is trying to infer something 6465 05:27:33,720 --> 05:27:37,520 about the future or the past or about these sort of hidden states that 6466 05:27:37,520 --> 05:27:38,560 might exist. 6467 05:27:38,560 --> 05:27:40,520 And so the tasks that you'll often see, and we're not 6468 05:27:40,520 --> 05:27:42,520 going to go into the mathematics of these tasks, 6469 05:27:42,520 --> 05:27:45,960 but they're all based on the same idea of conditional probabilities 6470 05:27:45,960 --> 05:27:48,440 and using the probability distributions we 6471 05:27:48,440 --> 05:27:51,200 have to draw these sorts of conclusions. 6472 05:27:51,200 --> 05:27:55,440 One task is called filtering, which is given observations from the start 6473 05:27:55,440 --> 05:27:59,320 until now, calculate the distribution for the current state, 6474 05:27:59,320 --> 05:28:03,360 meaning given information about from the beginning of time until now, 6475 05:28:03,360 --> 05:28:06,720 on which days do people bring an umbrella or not bring an umbrella, 6476 05:28:06,720 --> 05:28:10,280 can I calculate the probability of the current state that today, 6477 05:28:10,280 --> 05:28:12,440 is it sunny or is it raining? 6478 05:28:12,440 --> 05:28:14,640 Another task that might be possible is prediction, 6479 05:28:14,640 --> 05:28:16,240 which is looking towards the future. 6480 05:28:16,240 --> 05:28:18,640 Given observations about people bringing umbrellas 6481 05:28:18,640 --> 05:28:22,240 from the beginning of when we started counting time until now, 6482 05:28:22,240 --> 05:28:25,600 can I figure out the distribution that tomorrow is it sunny or is it 6483 05:28:25,600 --> 05:28:26,680 raining? 6484 05:28:26,680 --> 05:28:29,520 And you can also go backwards as well by a smoothing, 6485 05:28:29,520 --> 05:28:32,560 where I can say given observations from start until now, 6486 05:28:32,560 --> 05:28:35,360 calculate the distributions for some past state. 6487 05:28:35,360 --> 05:28:38,920 Like I know that today people brought umbrellas and tomorrow people 6488 05:28:38,920 --> 05:28:39,920 brought umbrellas. 6489 05:28:39,920 --> 05:28:42,760 And so given two days worth of data of people bringing umbrellas, 6490 05:28:42,760 --> 05:28:45,720 what's the probability that yesterday it was raining? 6491 05:28:45,720 --> 05:28:47,880 And that I know that people brought umbrellas today, 6492 05:28:47,880 --> 05:28:50,160 that might inform that decision as well. 6493 05:28:50,160 --> 05:28:52,680 It might influence those probabilities. 6494 05:28:52,680 --> 05:28:56,280 And there's also a most likely explanation task, 6495 05:28:56,280 --> 05:28:58,560 in addition to other tasks that might exist as well, which 6496 05:28:58,560 --> 05:29:01,720 is combining some of these given observations from the start up 6497 05:29:01,720 --> 05:29:04,960 until now, figuring out the most likely sequence of states. 6498 05:29:04,960 --> 05:29:07,960 And this is what we're going to take a look at now, this idea that if I 6499 05:29:07,960 --> 05:29:11,560 have all these observations, umbrella, no umbrella, umbrella, no umbrella, 6500 05:29:11,560 --> 05:29:15,880 can I calculate the most likely states of sun, rain, sun, rain, and whatnot 6501 05:29:15,880 --> 05:29:18,520 that actually represented the true weather that 6502 05:29:18,520 --> 05:29:20,680 would produce these observations? 6503 05:29:20,680 --> 05:29:23,960 And this is quite common when you're trying to do something like voice 6504 05:29:23,960 --> 05:29:27,520 recognition, for example, that you have these emissions of the audio waveforms, 6505 05:29:27,520 --> 05:29:30,560 and you would like to calculate based on all of the observations 6506 05:29:30,560 --> 05:29:34,480 that you have, what is the most likely sequence of actual words, or syllables, 6507 05:29:34,480 --> 05:29:38,200 or sounds that the user actually made when they were speaking 6508 05:29:38,200 --> 05:29:41,760 to this particular device, or other tasks that might come up in that context 6509 05:29:41,760 --> 05:29:43,000 as well. 6510 05:29:43,000 --> 05:29:47,680 And so we can try this out by going ahead and going into the HMM directory, 6511 05:29:47,680 --> 05:29:50,800 HMM for Hidden Markov Model. 6512 05:29:50,800 --> 05:29:57,160 And here, what I've done is I've defined a model where this model first defines 6513 05:29:57,160 --> 05:30:02,200 my possible state, sun, and rain, along with their emission probabilities, 6514 05:30:02,200 --> 05:30:06,240 the observation model, or the emission model, where here, given 6515 05:30:06,240 --> 05:30:09,040 that I know that it's sunny, the probability 6516 05:30:09,040 --> 05:30:11,680 that I see people bring an umbrella is 0.2, 6517 05:30:11,680 --> 05:30:14,560 the probability of no umbrella is 0.8. 6518 05:30:14,560 --> 05:30:16,600 And likewise, if it's raining, then people 6519 05:30:16,600 --> 05:30:18,000 are more likely to bring an umbrella. 6520 05:30:18,000 --> 05:30:21,720 Umbrella has probability 0.9, no umbrella has probability 0.1. 6521 05:30:21,720 --> 05:30:26,520 So the actual underlying hidden states, those states are sun and rain, 6522 05:30:26,520 --> 05:30:29,560 but the things that I observe, the observations that I can see, 6523 05:30:29,560 --> 05:30:35,320 are either umbrella or no umbrella as the things that I observe as a result. 6524 05:30:35,320 --> 05:30:39,840 So this then, I also need to add to it a transition matrix, same as before, 6525 05:30:39,840 --> 05:30:43,640 saying that if today is sunny, then tomorrow is more likely to be sunny. 6526 05:30:43,640 --> 05:30:47,000 And if today is rainy, then tomorrow is more likely to be raining. 6527 05:30:47,000 --> 05:30:49,320 As of before, I give it some starting probabilities, 6528 05:30:49,320 --> 05:30:53,120 saying at first, 50-50 chance for whether it's sunny or rainy. 6529 05:30:53,120 --> 05:30:56,640 And then I can create the model based on that information. 6530 05:30:56,640 --> 05:30:59,160 Again, the exact syntax of this is not so important, 6531 05:30:59,160 --> 05:31:02,600 so much as it is the data that I am now encoding into a program, 6532 05:31:02,600 --> 05:31:06,400 such that now I can begin to do some inference. 6533 05:31:06,400 --> 05:31:10,160 So I can give my program, for example, a list of observations, 6534 05:31:10,160 --> 05:31:13,560 umbrella, umbrella, no umbrella, umbrella, umbrella, so on and so forth, 6535 05:31:13,560 --> 05:31:14,960 no umbrella, no umbrella. 6536 05:31:14,960 --> 05:31:18,080 And I would like to calculate, I would like to figure out the most likely 6537 05:31:18,080 --> 05:31:20,360 explanation for these observations. 6538 05:31:20,360 --> 05:31:23,600 What is likely is whether rain, rain, is this rain, 6539 05:31:23,600 --> 05:31:25,960 or is it more likely that this was actually sunny, 6540 05:31:25,960 --> 05:31:28,000 and then it switched back to it being rainy? 6541 05:31:28,000 --> 05:31:29,440 And that's an interesting question. 6542 05:31:29,440 --> 05:31:31,640 We might not be sure, because it might just 6543 05:31:31,640 --> 05:31:34,640 be that it just so happened on this rainy day, 6544 05:31:34,640 --> 05:31:36,560 people decided not to bring an umbrella. 6545 05:31:36,560 --> 05:31:40,360 Or it could be that it switched from rainy to sunny back to rainy, 6546 05:31:40,360 --> 05:31:43,680 which doesn't seem too likely, but it certainly could happen. 6547 05:31:43,680 --> 05:31:46,280 And using the data we give to the hidden Markov model, 6548 05:31:46,280 --> 05:31:49,840 our model can begin to predict these answers, can begin to figure it out. 6549 05:31:49,840 --> 05:31:53,400 So we're going to go ahead and just predict these observations. 6550 05:31:53,400 --> 05:31:56,080 And then for each of those predictions, go ahead and print out 6551 05:31:56,080 --> 05:31:56,880 what the prediction is. 6552 05:31:56,880 --> 05:31:59,400 And this library just so happens to have a function called 6553 05:31:59,400 --> 05:32:03,040 predict that does this prediction process for me. 6554 05:32:03,040 --> 05:32:06,240 So I'll run python sequence.py. 6555 05:32:06,240 --> 05:32:07,880 And the result I get is this. 6556 05:32:07,880 --> 05:32:10,640 This is the prediction based on the observations 6557 05:32:10,640 --> 05:32:12,680 of what all of those states are likely to be. 6558 05:32:12,680 --> 05:32:14,400 And it's likely to be rain and rain. 6559 05:32:14,400 --> 05:32:16,640 In this case, it thinks that what most likely happened 6560 05:32:16,640 --> 05:32:19,560 is that it was sunny for a day and then went back to being rainy. 6561 05:32:19,560 --> 05:32:22,320 But in different situations, if it was rainy for longer maybe, 6562 05:32:22,320 --> 05:32:24,280 or if the probabilities were slightly different, 6563 05:32:24,280 --> 05:32:27,840 you might imagine that it's more likely that it was rainy all the way through. 6564 05:32:27,840 --> 05:32:32,960 And it just so happened on one rainy day, people decided not to bring umbrellas. 6565 05:32:32,960 --> 05:32:35,360 And so here, too, Python libraries can begin 6566 05:32:35,360 --> 05:32:38,240 to allow for the sort of inference procedure. 6567 05:32:38,240 --> 05:32:40,480 And by taking what we know and by putting it 6568 05:32:40,480 --> 05:32:43,080 in terms of these tasks that already exist, 6569 05:32:43,080 --> 05:32:45,960 these general tasks that work with hidden Markov models, 6570 05:32:45,960 --> 05:32:50,160 then any time we can take an idea and formulate it as a hidden Markov model, 6571 05:32:50,160 --> 05:32:52,800 formulate it as something that has hidden states 6572 05:32:52,800 --> 05:32:55,320 and observed emissions that result from those states, 6573 05:32:55,320 --> 05:32:57,360 then we can take advantage of these algorithms 6574 05:32:57,360 --> 05:33:01,360 that are known to exist for trying to do this sort of inference. 6575 05:33:01,360 --> 05:33:05,360 So now we've seen a couple of ways that AI can begin to deal with uncertainty. 6576 05:33:05,360 --> 05:33:08,460 We've taken a look at probability and how we can use probability 6577 05:33:08,460 --> 05:33:11,840 to describe numerically things that are likely or more likely or less 6578 05:33:11,840 --> 05:33:14,640 likely to happen than other events or other variables. 6579 05:33:14,640 --> 05:33:17,400 And using that information, we can begin to construct 6580 05:33:17,400 --> 05:33:20,960 these standard types of models, things like Bayesian networks and Markov 6581 05:33:20,960 --> 05:33:25,200 chains and hidden Markov models that all allow us to be able to describe 6582 05:33:25,200 --> 05:33:27,720 how particular events relate to other events 6583 05:33:27,720 --> 05:33:30,800 or how the values of particular variables relate to other variables, 6584 05:33:30,800 --> 05:33:34,200 not for certain, but with some sort of probability distribution. 6585 05:33:34,200 --> 05:33:37,600 And by formulating things in terms of these models that already exist, 6586 05:33:37,600 --> 05:33:39,920 we can take advantage of Python libraries that 6587 05:33:39,920 --> 05:33:42,560 implement these sort of models already and allow us just 6588 05:33:42,560 --> 05:33:46,520 to be able to use them to produce some sort of resulting effect. 6589 05:33:46,520 --> 05:33:48,520 So all of this then allows our AI to begin 6590 05:33:48,520 --> 05:33:50,920 to deal with these sort of uncertain problems 6591 05:33:50,920 --> 05:33:53,360 so that our AI doesn't need to know things for certain 6592 05:33:53,360 --> 05:33:56,720 but can infer based on information it doesn't know. 6593 05:33:56,720 --> 05:33:59,560 Next time, we'll take a look at additional types of problems 6594 05:33:59,560 --> 05:34:02,520 that we can solve by taking advantage of AI-related algorithms, 6595 05:34:02,520 --> 05:34:05,760 even beyond the world of the types of problems we've already explored. 6596 05:34:05,760 --> 05:34:08,480 We'll see you next time. 6597 05:34:08,480 --> 05:34:27,360 OK. 6598 05:34:27,360 --> 05:34:30,120 Welcome back, everyone, to an introduction to artificial intelligence 6599 05:34:30,120 --> 05:34:31,080 with Python. 6600 05:34:31,080 --> 05:34:32,880 And now, so far, we've taken a look at a couple 6601 05:34:32,880 --> 05:34:34,600 of different types of problems. 6602 05:34:34,600 --> 05:34:36,320 We've seen classical search problems where 6603 05:34:36,320 --> 05:34:38,680 we're trying to get from an initial state to a goal 6604 05:34:38,680 --> 05:34:40,560 by figuring out some optimal path. 6605 05:34:40,560 --> 05:34:42,360 We've taken a look at adversarial search where 6606 05:34:42,360 --> 05:34:45,360 we have a game-playing agent that is trying to make the best move. 6607 05:34:45,360 --> 05:34:48,080 We've seen knowledge-based problems where we're trying to use logic 6608 05:34:48,080 --> 05:34:50,320 and inference to be able to figure out and draw 6609 05:34:50,320 --> 05:34:51,800 some additional conclusions. 6610 05:34:51,800 --> 05:34:54,400 And we've seen some probabilistic models as well where we might not 6611 05:34:54,400 --> 05:34:56,280 have certain information about the world, 6612 05:34:56,280 --> 05:34:59,480 but we want to use the knowledge about probabilities that we do have 6613 05:34:59,480 --> 05:35:01,480 to be able to draw some conclusions. 6614 05:35:01,480 --> 05:35:04,400 Today, we're going to turn our attention to another category of problems 6615 05:35:04,400 --> 05:35:08,480 generally known as optimization problems, where optimization is really 6616 05:35:08,480 --> 05:35:12,400 all about choosing the best option from a set of possible options. 6617 05:35:12,400 --> 05:35:14,680 And we've already seen optimization in some contexts, 6618 05:35:14,680 --> 05:35:17,140 like game-playing, where we're trying to create an AI that 6619 05:35:17,140 --> 05:35:19,760 chooses the best move out of a set of possible moves. 6620 05:35:19,760 --> 05:35:23,120 But what we'll take a look at today is a category of types of problems 6621 05:35:23,120 --> 05:35:25,360 and algorithms to solve them that can be used 6622 05:35:25,360 --> 05:35:29,720 in order to deal with a broader range of potential optimization problems. 6623 05:35:29,720 --> 05:35:32,040 And the first of the algorithms that we'll take a look at 6624 05:35:32,040 --> 05:35:34,280 is known as a local search. 6625 05:35:34,280 --> 05:35:36,320 And local search differs from search algorithms 6626 05:35:36,320 --> 05:35:38,960 we've seen before in the sense that the search algorithms we've 6627 05:35:38,960 --> 05:35:42,400 looked at so far, which are things like breadth-first search or A-star search, 6628 05:35:42,400 --> 05:35:45,800 for example, generally maintain a whole bunch of different paths 6629 05:35:45,800 --> 05:35:47,920 that we're simultaneously exploring, and we're 6630 05:35:47,920 --> 05:35:50,240 looking at a bunch of different paths at once trying 6631 05:35:50,240 --> 05:35:51,920 to find our way to the solution. 6632 05:35:51,920 --> 05:35:53,920 On the other hand, in local search, this is going 6633 05:35:53,920 --> 05:35:57,520 to be a search algorithm that's really just going to maintain a single node, 6634 05:35:57,520 --> 05:35:59,240 looking at a single state. 6635 05:35:59,240 --> 05:36:02,860 And we'll generally run this algorithm by maintaining that single node 6636 05:36:02,860 --> 05:36:05,680 and then moving ourselves to one of the neighboring nodes 6637 05:36:05,680 --> 05:36:07,600 throughout this search process. 6638 05:36:07,600 --> 05:36:10,900 And this is generally useful in context not like these problems, which 6639 05:36:10,900 --> 05:36:13,360 we've seen before, like a maze-solving situation where 6640 05:36:13,360 --> 05:36:16,060 we're trying to find our way from the initial state to the goal 6641 05:36:16,060 --> 05:36:17,680 by following some path. 6642 05:36:17,680 --> 05:36:20,200 But local search is most applicable when we really 6643 05:36:20,200 --> 05:36:23,320 don't care about the path at all, and all we care about 6644 05:36:23,320 --> 05:36:24,880 is what the solution is. 6645 05:36:24,880 --> 05:36:27,460 And in the case of solving a maze, the solution was always obvious. 6646 05:36:27,460 --> 05:36:28,840 You could point to the solution. 6647 05:36:28,840 --> 05:36:31,280 You know exactly what the goal is, and the real question 6648 05:36:31,280 --> 05:36:33,160 is, what is the path to get there? 6649 05:36:33,160 --> 05:36:35,120 But local search is going to come up in cases 6650 05:36:35,120 --> 05:36:37,440 where figuring out exactly what the solution is, 6651 05:36:37,440 --> 05:36:41,640 exactly what the goal looks like, is actually the heart of the challenge. 6652 05:36:41,640 --> 05:36:44,140 And to give an example of one of these kinds of problems, 6653 05:36:44,140 --> 05:36:46,800 we'll consider a scenario where we have two types of buildings, 6654 05:36:46,800 --> 05:36:47,440 for example. 6655 05:36:47,440 --> 05:36:49,520 We have houses and hospitals. 6656 05:36:49,520 --> 05:36:52,520 And our goal might be in a world that's formatted as this grid, 6657 05:36:52,520 --> 05:36:55,080 where we have a whole bunch of houses, a house here, house here, 6658 05:36:55,080 --> 05:36:58,360 two houses over there, maybe we want to try and find a way 6659 05:36:58,360 --> 05:37:01,240 to place two hospitals on this map. 6660 05:37:01,240 --> 05:37:04,120 So maybe a hospital here and a hospital there. 6661 05:37:04,120 --> 05:37:07,160 And the problem now is we want to place two hospitals on the map, 6662 05:37:07,160 --> 05:37:09,960 but we want to do so with some sort of objective. 6663 05:37:09,960 --> 05:37:12,880 And our objective in this case is to try and minimize 6664 05:37:12,880 --> 05:37:16,280 the distance of any of the houses from a hospital. 6665 05:37:16,280 --> 05:37:18,320 So you might imagine, all right, what's the distance 6666 05:37:18,320 --> 05:37:20,440 from each of the houses to their nearest hospital? 6667 05:37:20,440 --> 05:37:23,040 There are a number of ways we could calculate that distance. 6668 05:37:23,040 --> 05:37:25,440 But one way is using a heuristic we've looked at before, 6669 05:37:25,440 --> 05:37:28,320 which is the Manhattan distance, this idea of how many rows 6670 05:37:28,320 --> 05:37:32,000 and columns would you have to move inside of this grid layout in order 6671 05:37:32,000 --> 05:37:34,360 to get to a hospital, for example. 6672 05:37:34,360 --> 05:37:36,760 And it turns out, if you take each of these four houses 6673 05:37:36,760 --> 05:37:39,600 and figure out, all right, how close are they to their nearest hospital, 6674 05:37:39,600 --> 05:37:42,960 you get something like this, where this house is three away from a hospital, 6675 05:37:42,960 --> 05:37:46,040 this house is six away, and these two houses are each four away. 6676 05:37:46,040 --> 05:37:48,040 And if you add all those numbers up together, 6677 05:37:48,040 --> 05:37:51,840 you get a total cost of 17, for example. 6678 05:37:51,840 --> 05:37:55,360 So for this particular configuration of hospitals, a hospital here 6679 05:37:55,360 --> 05:37:58,160 and a hospital there, that state, we might say, 6680 05:37:58,160 --> 05:37:59,920 has a cost of 17. 6681 05:37:59,920 --> 05:38:01,840 And the goal of this problem now that we would 6682 05:38:01,840 --> 05:38:04,160 like to apply a search algorithm to figure out 6683 05:38:04,160 --> 05:38:08,440 is, can you solve this problem to find a way to minimize that cost? 6684 05:38:08,440 --> 05:38:11,880 Minimize the total amount if you sum up all of the distances 6685 05:38:11,880 --> 05:38:14,040 from all the houses to the nearest hospital. 6686 05:38:14,040 --> 05:38:16,600 How can we minimize that final value? 6687 05:38:16,600 --> 05:38:19,320 And if we think about this problem a little bit more abstractly, 6688 05:38:19,320 --> 05:38:21,400 abstracting away from this specific problem 6689 05:38:21,400 --> 05:38:23,880 and thinking more generally about problems like it, 6690 05:38:23,880 --> 05:38:26,800 you can often formulate these problems by thinking about them 6691 05:38:26,800 --> 05:38:29,720 as a state-space landscape, as we'll soon call it. 6692 05:38:29,720 --> 05:38:32,120 Here in this diagram of a state-space landscape, 6693 05:38:32,120 --> 05:38:35,760 each of these vertical bars represents a particular state 6694 05:38:35,760 --> 05:38:37,040 that our world could be in. 6695 05:38:37,040 --> 05:38:39,320 So for example, each of these vertical bars 6696 05:38:39,320 --> 05:38:43,200 represents a particular configuration of two hospitals. 6697 05:38:43,200 --> 05:38:45,680 And the height of this vertical bar is generally 6698 05:38:45,680 --> 05:38:50,160 going to represent some function of that state, some value of that state. 6699 05:38:50,160 --> 05:38:52,560 So maybe in this case, the height of the vertical bar 6700 05:38:52,560 --> 05:38:56,160 represents what is the cost of this particular configuration 6701 05:38:56,160 --> 05:38:59,720 of hospitals in terms of what is the sum total of all the distances 6702 05:38:59,720 --> 05:39:03,320 from all of the houses to their nearest hospital. 6703 05:39:03,320 --> 05:39:06,360 And generally speaking, when we have a state-space landscape, 6704 05:39:06,360 --> 05:39:08,640 we want to do one of two things. 6705 05:39:08,640 --> 05:39:12,080 We might be trying to maximize the value of this function, 6706 05:39:12,080 --> 05:39:16,280 trying to find a global maximum, so to speak, of this state-space landscape, 6707 05:39:16,280 --> 05:39:20,360 a single state whose value is higher than all of the other states 6708 05:39:20,360 --> 05:39:22,040 that we could possibly choose from. 6709 05:39:22,040 --> 05:39:25,040 And generally in this case, when we're trying to find a global maximum, 6710 05:39:25,040 --> 05:39:27,720 we'll call the function that we're trying to optimize 6711 05:39:27,720 --> 05:39:30,120 some objective function, some function that 6712 05:39:30,120 --> 05:39:34,040 measures for any given state how good is that state, 6713 05:39:34,040 --> 05:39:37,160 such that we can take any state, pass it into the objective function, 6714 05:39:37,160 --> 05:39:39,640 and get a value for how good that state is. 6715 05:39:39,640 --> 05:39:42,760 And ultimately, what our goal is is to find one of these states 6716 05:39:42,760 --> 05:39:46,840 that has the highest possible value for that objective function. 6717 05:39:46,840 --> 05:39:49,280 An equivalent but reversed problem is the problem 6718 05:39:49,280 --> 05:39:52,400 of finding a global minimum, some state that has a value 6719 05:39:52,400 --> 05:39:55,960 after you pass it into this function that is lower than all of the other 6720 05:39:55,960 --> 05:39:57,840 possible values that we might choose from. 6721 05:39:57,840 --> 05:40:00,560 And generally speaking, when we're trying to find a global minimum, 6722 05:40:00,560 --> 05:40:03,720 we call the function that we're calculating a cost function. 6723 05:40:03,720 --> 05:40:05,960 Generally, each state has some sort of cost, 6724 05:40:05,960 --> 05:40:08,840 whether that cost is a monetary cost, or a time cost, 6725 05:40:08,840 --> 05:40:10,720 or in the case of the houses and hospitals, 6726 05:40:10,720 --> 05:40:13,360 we've been looking at just now, a distance cost in terms 6727 05:40:13,360 --> 05:40:17,000 of how far away each of the houses is from a hospital. 6728 05:40:17,000 --> 05:40:19,080 And we're trying to minimize the cost, find 6729 05:40:19,080 --> 05:40:23,560 the state that has the lowest possible value of that cost. 6730 05:40:23,560 --> 05:40:25,520 So these are the general types of ideas we 6731 05:40:25,520 --> 05:40:28,160 might be trying to go for within a state space landscape, 6732 05:40:28,160 --> 05:40:32,240 trying to find a global maximum, or trying to find a global minimum. 6733 05:40:32,240 --> 05:40:33,960 And how exactly do we do that? 6734 05:40:33,960 --> 05:40:36,160 We'll recall that in local search, we generally 6735 05:40:36,160 --> 05:40:39,160 operate this algorithm by maintaining just a single state, 6736 05:40:39,160 --> 05:40:41,960 just some current state represented inside of some node, 6737 05:40:41,960 --> 05:40:43,800 maybe inside of a data structure, where we're 6738 05:40:43,800 --> 05:40:46,280 keeping track of where we are currently. 6739 05:40:46,280 --> 05:40:49,320 And then ultimately, what we're going to do is from that state, 6740 05:40:49,320 --> 05:40:51,640 move to one of its neighbor states. 6741 05:40:51,640 --> 05:40:54,140 So in this case, represented in this one-dimensional space 6742 05:40:54,140 --> 05:40:57,000 by just the state immediately to the left or to the right of it. 6743 05:40:57,000 --> 05:40:58,960 But for any different problem, you might define 6744 05:40:58,960 --> 05:41:02,080 what it means for there to be a neighbor of a particular state. 6745 05:41:02,080 --> 05:41:05,000 In the case of a hospital, for example, that we were just looking at, 6746 05:41:05,000 --> 05:41:08,620 a neighbor might be moving one hospital one space to the left 6747 05:41:08,620 --> 05:41:10,280 or to the right or up or down. 6748 05:41:10,280 --> 05:41:14,560 Some state that is close to our current state, but slightly different, 6749 05:41:14,560 --> 05:41:17,040 and as a result, might have a slightly different value 6750 05:41:17,040 --> 05:41:21,600 in terms of its objective function or in terms of its cost function. 6751 05:41:21,600 --> 05:41:24,240 So this is going to be our general strategy in local search, 6752 05:41:24,240 --> 05:41:27,140 to be able to take a state, maintaining some current node, 6753 05:41:27,140 --> 05:41:29,960 and move where we're looking at in the state space landscape 6754 05:41:29,960 --> 05:41:33,800 in order to try to find a global maximum or a global minimum somehow. 6755 05:41:33,800 --> 05:41:35,760 And perhaps the simplest of algorithms that we 6756 05:41:35,760 --> 05:41:38,720 could use to implement this idea of local search 6757 05:41:38,720 --> 05:41:41,120 is an algorithm known as hill climbing. 6758 05:41:41,120 --> 05:41:43,160 And the basic idea of hill climbing is, let's 6759 05:41:43,160 --> 05:41:46,720 say I'm trying to maximize the value of my state. 6760 05:41:46,720 --> 05:41:49,160 I'm trying to figure out where the global maximum is. 6761 05:41:49,160 --> 05:41:50,720 I'm going to start at a state. 6762 05:41:50,720 --> 05:41:53,120 And generally, what hill climbing is going to do 6763 05:41:53,120 --> 05:41:55,720 is it's going to consider the neighbors of that state, 6764 05:41:55,720 --> 05:41:58,720 that from this state, all right, I could go left or I could go right, 6765 05:41:58,720 --> 05:42:01,880 and this neighbor happens to be higher and this neighbor happens to be lower. 6766 05:42:01,880 --> 05:42:04,880 And in hill climbing, if I'm trying to maximize the value, 6767 05:42:04,880 --> 05:42:07,680 I'll generally pick the highest one I can between the state 6768 05:42:07,680 --> 05:42:08,920 to the left and right of me. 6769 05:42:08,920 --> 05:42:10,120 This one is higher. 6770 05:42:10,120 --> 05:42:13,600 So I'll go ahead and move myself to consider that state instead. 6771 05:42:13,600 --> 05:42:17,160 And then I'll repeat this process, continually looking at all of my neighbors 6772 05:42:17,160 --> 05:42:19,360 and picking the highest neighbor, doing the same thing, 6773 05:42:19,360 --> 05:42:21,880 looking at my neighbors, picking the highest of my neighbors, 6774 05:42:21,880 --> 05:42:25,960 until I get to a point like right here, where I consider both of my neighbors 6775 05:42:25,960 --> 05:42:29,040 and both of my neighbors have a lower value than I do. 6776 05:42:29,040 --> 05:42:32,840 This current state has a value that is higher than any of its neighbors. 6777 05:42:32,840 --> 05:42:34,640 And at that point, the algorithm terminates. 6778 05:42:34,640 --> 05:42:38,320 And I can say, all right, here I have now found the solution. 6779 05:42:38,320 --> 05:42:40,760 And the same thing works in exactly the opposite way 6780 05:42:40,760 --> 05:42:42,120 for trying to find a global minimum. 6781 05:42:42,120 --> 05:42:44,120 But the algorithm is fundamentally the same. 6782 05:42:44,120 --> 05:42:47,360 If I'm trying to find a global minimum and say my current state starts here, 6783 05:42:47,360 --> 05:42:50,240 I'll continually look at my neighbors, pick the lowest value 6784 05:42:50,240 --> 05:42:53,160 that I possibly can, until I eventually, hopefully, 6785 05:42:53,160 --> 05:42:55,680 find that global minimum, a point at which when 6786 05:42:55,680 --> 05:42:58,600 I look at both of my neighbors, they each have a higher value. 6787 05:42:58,600 --> 05:43:02,560 And I'm trying to minimize the total score or cost or value 6788 05:43:02,560 --> 05:43:06,840 that I get as a result of calculating some sort of cost function. 6789 05:43:06,840 --> 05:43:09,880 So we can formulate this graphical idea in terms of pseudocode. 6790 05:43:09,880 --> 05:43:12,480 And the pseudocode for hill climbing might look like this. 6791 05:43:12,480 --> 05:43:15,080 We define some function called hill climb that 6792 05:43:15,080 --> 05:43:17,760 takes as input the problem that we're trying to solve. 6793 05:43:17,760 --> 05:43:21,200 And generally, we're going to start in some sort of initial state. 6794 05:43:21,200 --> 05:43:23,160 So I'll start with a variable called current 6795 05:43:23,160 --> 05:43:26,920 that is keeping track of my initial state, like an initial configuration 6796 05:43:26,920 --> 05:43:27,960 of hospitals. 6797 05:43:27,960 --> 05:43:30,480 And maybe some problems lend themselves to an initial state, 6798 05:43:30,480 --> 05:43:31,840 some place where you begin. 6799 05:43:31,840 --> 05:43:34,920 In other cases, maybe not, in which case we might just randomly 6800 05:43:34,920 --> 05:43:38,640 generate some initial state, just by choosing two locations for hospitals 6801 05:43:38,640 --> 05:43:41,160 at random, for example, and figuring out from there 6802 05:43:41,160 --> 05:43:42,800 how we might be able to improve. 6803 05:43:42,800 --> 05:43:46,240 But that initial state, we're going to store inside of current. 6804 05:43:46,240 --> 05:43:48,840 And now, here comes our loop, some repetitive process 6805 05:43:48,840 --> 05:43:52,040 we're going to do again and again until the algorithm terminates. 6806 05:43:52,040 --> 05:43:55,040 And what we're going to do is first say, let's 6807 05:43:55,040 --> 05:43:57,920 figure out all of the neighbors of the current state. 6808 05:43:57,920 --> 05:43:59,920 From my state, what are all of the neighboring 6809 05:43:59,920 --> 05:44:02,800 states for some definition of what it means to be a neighbor? 6810 05:44:02,800 --> 05:44:06,560 And I'll go ahead and choose the highest value of all of those neighbors 6811 05:44:06,560 --> 05:44:09,080 and save it inside of this variable called neighbor. 6812 05:44:09,080 --> 05:44:11,160 So keep track of the highest-valued neighbor. 6813 05:44:11,160 --> 05:44:14,080 This is in the case where I'm trying to maximize the value. 6814 05:44:14,080 --> 05:44:15,880 In the case where I'm trying to minimize the value, 6815 05:44:15,880 --> 05:44:17,360 you might imagine here, you'll pick the neighbor 6816 05:44:17,360 --> 05:44:18,880 with the lowest possible value. 6817 05:44:18,880 --> 05:44:21,640 But these ideas are really fundamentally interchangeable. 6818 05:44:21,640 --> 05:44:24,720 And it's possible, in some cases, there might be multiple neighbors 6819 05:44:24,720 --> 05:44:28,200 that each have an equally high value or an equally low value 6820 05:44:28,200 --> 05:44:29,480 in the minimizing case. 6821 05:44:29,480 --> 05:44:31,920 And in that case, we can just choose randomly from among them. 6822 05:44:31,920 --> 05:44:35,480 Choose one of them and save it inside of this variable neighbor. 6823 05:44:35,480 --> 05:44:39,680 And then the key question to ask is, is this neighbor better 6824 05:44:39,680 --> 05:44:41,600 than my current state? 6825 05:44:41,600 --> 05:44:44,840 And if the neighbor, the best neighbor that I was able to find, 6826 05:44:44,840 --> 05:44:48,440 is not better than my current state, well, then the algorithm is over. 6827 05:44:48,440 --> 05:44:50,520 And I'll just go ahead and return the current state. 6828 05:44:50,520 --> 05:44:53,800 If none of my neighbors are better, then I may as well stay where I am, 6829 05:44:53,800 --> 05:44:56,520 is the general logic of the hill climbing algorithm. 6830 05:44:56,520 --> 05:44:59,200 But otherwise, if the neighbor is better, then I may as well 6831 05:44:59,200 --> 05:45:00,320 move to that neighbor. 6832 05:45:00,320 --> 05:45:04,160 So you might imagine setting current equal to neighbor, where the general idea 6833 05:45:04,160 --> 05:45:07,040 is if I'm at a current state and I see a neighbor that is better than me, 6834 05:45:07,040 --> 05:45:08,360 then I'll go ahead and move there. 6835 05:45:08,360 --> 05:45:11,840 And then I'll repeat the process, continually moving to a better neighbor 6836 05:45:11,840 --> 05:45:15,760 until I reach a point at which none of my neighbors are better than I am. 6837 05:45:15,760 --> 05:45:19,600 And at that point, we'd say the algorithm can just terminate there. 6838 05:45:19,600 --> 05:45:21,640 So let's take a look at a real example of this 6839 05:45:21,640 --> 05:45:23,240 with these houses and hospitals. 6840 05:45:23,240 --> 05:45:26,480 So we've seen now that if we put the hospitals in these two locations, 6841 05:45:26,480 --> 05:45:28,360 that has a total cost of 17. 6842 05:45:28,360 --> 05:45:31,280 And now we need to define, if we're going to implement this hill climbing 6843 05:45:31,280 --> 05:45:34,920 algorithm, what it means to take this particular configuration 6844 05:45:34,920 --> 05:45:39,760 of hospitals, this particular state, and get a neighbor of that state. 6845 05:45:39,760 --> 05:45:42,080 And a simple definition of neighbor might be just, 6846 05:45:42,080 --> 05:45:46,680 let's pick one of the hospitals and move it by one square, the left or right 6847 05:45:46,680 --> 05:45:48,520 or up or down, for example. 6848 05:45:48,520 --> 05:45:50,960 And that would mean we have six possible neighbors 6849 05:45:50,960 --> 05:45:52,440 from this particular configuration. 6850 05:45:52,440 --> 05:45:56,200 We could take this hospital and move it to any of these three possible squares, 6851 05:45:56,200 --> 05:46:00,000 or we take this hospital and move it to any of those three possible squares. 6852 05:46:00,000 --> 05:46:02,640 And each of those would generate a neighbor. 6853 05:46:02,640 --> 05:46:04,800 And what I might do is say, all right, here's 6854 05:46:04,800 --> 05:46:07,720 the locations and the distances between each of the houses 6855 05:46:07,720 --> 05:46:09,240 and their nearest hospital. 6856 05:46:09,240 --> 05:46:12,280 Let me consider all of the neighbors and see if any of them 6857 05:46:12,280 --> 05:46:14,640 can do better than a cost of 17. 6858 05:46:14,640 --> 05:46:17,240 And it turns out there are a couple of ways that we could do that. 6859 05:46:17,240 --> 05:46:19,080 And it doesn't matter if we randomly choose 6860 05:46:19,080 --> 05:46:20,720 among all the ways that are the best. 6861 05:46:20,720 --> 05:46:24,880 But one such possible way is by taking a look at this hospital here 6862 05:46:24,880 --> 05:46:27,040 and considering the directions in which it might move. 6863 05:46:27,040 --> 05:46:30,360 If we hold this hospital constant, if we take this hospital 6864 05:46:30,360 --> 05:46:33,680 and move it one square up, for example, that doesn't really help us. 6865 05:46:33,680 --> 05:46:36,400 It gets closer to the house up here, but it gets further away 6866 05:46:36,400 --> 05:46:37,640 from the house down here. 6867 05:46:37,640 --> 05:46:40,080 And it doesn't really change anything for the two houses 6868 05:46:40,080 --> 05:46:41,800 along the left-hand side. 6869 05:46:41,800 --> 05:46:45,600 But if we take this hospital on the right and move it one square down, 6870 05:46:45,600 --> 05:46:46,600 it's the opposite problem. 6871 05:46:46,600 --> 05:46:49,000 It gets further away from the house up above, 6872 05:46:49,000 --> 05:46:51,160 and it gets closer to the house down below. 6873 05:46:51,160 --> 05:46:54,640 The real idea, the goal should be to be able to take this hospital 6874 05:46:54,640 --> 05:46:56,760 and move it one square to the left. 6875 05:46:56,760 --> 05:46:59,280 By moving it one square to the left, we move it closer 6876 05:46:59,280 --> 05:47:02,280 to both of these houses on the right without changing anything 6877 05:47:02,280 --> 05:47:03,360 about the houses on the left. 6878 05:47:03,360 --> 05:47:06,760 For them, this hospital is still the closer one, so they aren't affected. 6879 05:47:06,760 --> 05:47:10,480 So we're able to improve the situation by picking a neighbor that 6880 05:47:10,480 --> 05:47:13,000 results in a decrease in our total cost. 6881 05:47:13,000 --> 05:47:14,000 And so we might do that. 6882 05:47:14,000 --> 05:47:16,640 Move ourselves from this current state to a neighbor 6883 05:47:16,640 --> 05:47:19,440 by just taking that hospital and moving it. 6884 05:47:19,440 --> 05:47:21,160 And at this point, there's not a whole lot 6885 05:47:21,160 --> 05:47:22,640 that can be done with this hospital. 6886 05:47:22,640 --> 05:47:25,320 But there's still other optimizations we can make, other neighbors 6887 05:47:25,320 --> 05:47:27,920 we can move to that are going to have a better value. 6888 05:47:27,920 --> 05:47:29,960 If we consider this hospital, for example, 6889 05:47:29,960 --> 05:47:32,680 we might imagine that right now it's a bit far up, 6890 05:47:32,680 --> 05:47:34,840 that both of these houses are a little bit lower. 6891 05:47:34,840 --> 05:47:37,480 So we might be able to do better by taking this hospital 6892 05:47:37,480 --> 05:47:40,680 and moving it one square down, moving it down so that now instead 6893 05:47:40,680 --> 05:47:43,600 of a cost of 15, we're down to a cost of 13 6894 05:47:43,600 --> 05:47:45,320 for this particular configuration. 6895 05:47:45,320 --> 05:47:47,680 And we can do even better by taking the hospital 6896 05:47:47,680 --> 05:47:49,520 and moving it one square to the left. 6897 05:47:49,520 --> 05:47:52,360 Now instead of a cost of 13, we have a cost of 11, 6898 05:47:52,360 --> 05:47:54,880 because this house is one away from the hospital. 6899 05:47:54,880 --> 05:47:56,360 This one is four away. 6900 05:47:56,360 --> 05:47:57,520 This one is three away. 6901 05:47:57,520 --> 05:47:59,440 And this one is also three away. 6902 05:47:59,440 --> 05:48:02,160 So we've been able to do much better than that initial cost 6903 05:48:02,160 --> 05:48:04,680 that we had using the initial configuration. 6904 05:48:04,680 --> 05:48:07,720 Just by taking every state and asking ourselves the question, 6905 05:48:07,720 --> 05:48:11,120 can we do better by just making small incremental changes, 6906 05:48:11,120 --> 05:48:12,960 moving to a neighbor, moving to a neighbor, 6907 05:48:12,960 --> 05:48:15,360 and moving to a neighbor after that? 6908 05:48:15,360 --> 05:48:18,880 And now at this point, we can potentially see that at this point, 6909 05:48:18,880 --> 05:48:20,280 the algorithm is going to terminate. 6910 05:48:20,280 --> 05:48:22,680 There's actually no neighbor we can move to 6911 05:48:22,680 --> 05:48:27,120 that is going to improve the situation, get us a cost that is less than 11. 6912 05:48:27,120 --> 05:48:29,600 Because if we take this hospital and move it upper to the right, 6913 05:48:29,600 --> 05:48:31,320 well, that's going to make it further away. 6914 05:48:31,320 --> 05:48:34,480 If we take it and move it down, that doesn't really change the situation. 6915 05:48:34,480 --> 05:48:37,400 It gets further away from this house but closer to that house. 6916 05:48:37,400 --> 05:48:40,120 And likewise, the same story was true for this hospital. 6917 05:48:40,120 --> 05:48:42,880 Any neighbor we move it to, up, left, down, or right, 6918 05:48:42,880 --> 05:48:46,920 is either going to make it further away from the houses and increase the cost, 6919 05:48:46,920 --> 05:48:51,080 or it's going to have no effect on the cost whatsoever. 6920 05:48:51,080 --> 05:48:54,360 And so the question we might now ask is, is this the best we could do? 6921 05:48:54,360 --> 05:48:57,840 Is this the best placement of the hospitals we could possibly have? 6922 05:48:57,840 --> 05:49:00,560 And it turns out the answer is no, because there's a better way 6923 05:49:00,560 --> 05:49:02,720 that we could place these hospitals. 6924 05:49:02,720 --> 05:49:05,120 And in particular, there are a number of ways you could do this. 6925 05:49:05,120 --> 05:49:07,760 But one of the ways is by taking this hospital here 6926 05:49:07,760 --> 05:49:10,760 and moving it to this square, for example, moving it diagonally 6927 05:49:10,760 --> 05:49:13,520 by one square, which was not part of our definition of neighbor. 6928 05:49:13,520 --> 05:49:15,720 We could only move left, right, up, or down. 6929 05:49:15,720 --> 05:49:17,240 But this is, in fact, better. 6930 05:49:17,240 --> 05:49:18,760 It has a total cost of 9. 6931 05:49:18,760 --> 05:49:21,040 It is now closer to both of these houses. 6932 05:49:21,040 --> 05:49:24,240 And as a result, the total cost is less. 6933 05:49:24,240 --> 05:49:27,480 But we weren't able to find it, because in order to get there, 6934 05:49:27,480 --> 05:49:31,320 we had to go through a state that actually wasn't any better than the current 6935 05:49:31,320 --> 05:49:33,600 state that we had been on previously. 6936 05:49:33,600 --> 05:49:36,920 And so this appears to be a limitation, or a concern you might have 6937 05:49:36,920 --> 05:49:39,960 as you go about trying to implement a hill climbing algorithm, 6938 05:49:39,960 --> 05:49:43,320 is that it might not always give you the optimal solution. 6939 05:49:43,320 --> 05:49:46,400 If we're trying to maximize the value of any particular state, 6940 05:49:46,400 --> 05:49:49,000 we're trying to find the global maximum, a concern 6941 05:49:49,000 --> 05:49:53,040 might be that we could get stuck at one of the local maxima, 6942 05:49:53,040 --> 05:49:57,840 highlighted here in blue, where a local maxima is any state whose value is 6943 05:49:57,840 --> 05:49:59,360 higher than any of its neighbors. 6944 05:49:59,360 --> 05:50:02,040 If we ever find ourselves at one of these two states 6945 05:50:02,040 --> 05:50:04,320 when we're trying to maximize the value of the state, 6946 05:50:04,320 --> 05:50:05,820 we're not going to make any changes. 6947 05:50:05,820 --> 05:50:07,320 We're not going to move left or right. 6948 05:50:07,320 --> 05:50:10,760 We're not going to move left here, because those states are worse. 6949 05:50:10,760 --> 05:50:13,280 But yet, we haven't found the global optimum. 6950 05:50:13,280 --> 05:50:15,560 We haven't done as best as we could do. 6951 05:50:15,560 --> 05:50:18,100 And likewise, in the case of the hospitals, what we're ultimately 6952 05:50:18,100 --> 05:50:20,960 trying to do is find a global minimum, find a value that 6953 05:50:20,960 --> 05:50:22,720 is lower than all of the others. 6954 05:50:22,720 --> 05:50:26,640 But we have the potential to get stuck at one of the local minima, 6955 05:50:26,640 --> 05:50:30,160 any of these states whose value is lower than all of its neighbors, 6956 05:50:30,160 --> 05:50:33,680 but still not as low as the local minima. 6957 05:50:33,680 --> 05:50:36,800 And so the takeaway here is that it's not always 6958 05:50:36,800 --> 05:50:40,280 going to be the case that when we run this naive hill climbing algorithm, 6959 05:50:40,280 --> 05:50:42,280 that we're always going to find the optimal solution. 6960 05:50:42,280 --> 05:50:43,960 There are things that could go wrong. 6961 05:50:43,960 --> 05:50:47,640 If we started here, for example, and tried to maximize our value as much 6962 05:50:47,640 --> 05:50:50,800 as possible, we might move to the highest possible neighbor, 6963 05:50:50,800 --> 05:50:54,000 move to the highest possible neighbor, move to the highest possible neighbor, 6964 05:50:54,000 --> 05:50:57,960 and stop, and never realize that there's actually a better state way over there 6965 05:50:57,960 --> 05:51:00,280 that we could have gone to instead. 6966 05:51:00,280 --> 05:51:03,200 And other problems you might imagine just by taking a look at this state 6967 05:51:03,200 --> 05:51:06,800 space landscape are these various different types of plateaus, 6968 05:51:06,800 --> 05:51:09,200 something like this flat local maximum here, 6969 05:51:09,200 --> 05:51:12,840 where all six of these states each have the exact same value. 6970 05:51:12,840 --> 05:51:15,800 And so in the case of the algorithm we showed before, 6971 05:51:15,800 --> 05:51:17,800 none of the neighbors are better, so we might just 6972 05:51:17,800 --> 05:51:19,800 get stuck at this flat local maximum. 6973 05:51:19,800 --> 05:51:22,360 And even if you allowed yourself to move to one of the neighbors, 6974 05:51:22,360 --> 05:51:25,120 it wouldn't be clear which neighbor you would ultimately move to, 6975 05:51:25,120 --> 05:51:27,280 and you could get stuck here as well. 6976 05:51:27,280 --> 05:51:28,680 And there's another one over here. 6977 05:51:28,680 --> 05:51:30,040 This one is called a shoulder. 6978 05:51:30,040 --> 05:51:32,000 It's not really a local maximum, because there's still 6979 05:51:32,000 --> 05:51:35,240 places where we can go higher, not a local minimum, because we can go lower. 6980 05:51:35,240 --> 05:51:38,680 So we can still make progress, but it's still this flat area, 6981 05:51:38,680 --> 05:51:40,720 where if you have a local search algorithm, 6982 05:51:40,720 --> 05:51:44,560 there's potential to get lost here, unable to make some upward or downward 6983 05:51:44,560 --> 05:51:48,040 progress, depending on whether we're trying to maximize or minimize it, 6984 05:51:48,040 --> 05:51:50,200 and therefore another potential for us to be 6985 05:51:50,200 --> 05:51:54,960 able to find a solution that might not actually be the optimal solution. 6986 05:51:54,960 --> 05:51:57,880 And so because of this potential, the potential that hill climbing 6987 05:51:57,880 --> 05:52:00,500 has to not always find us the optimal result, 6988 05:52:00,500 --> 05:52:03,520 it turns out there are a number of different varieties and variations 6989 05:52:03,520 --> 05:52:07,360 on the hill climbing algorithm that help to solve the problem better 6990 05:52:07,360 --> 05:52:10,520 depending on the context, and depending on the specific type of problem, 6991 05:52:10,520 --> 05:52:13,240 some of these variants might be more applicable than others. 6992 05:52:13,240 --> 05:52:16,000 What we've taken a look at so far is a version of hill climbing 6993 05:52:16,000 --> 05:52:19,280 generally called steepest ascent hill climbing, 6994 05:52:19,280 --> 05:52:21,520 where the idea of steepest ascent hill climbing 6995 05:52:21,520 --> 05:52:24,440 is we are going to choose the highest valued neighbor, 6996 05:52:24,440 --> 05:52:27,320 in the case where we're trying to maximize or the lowest valued neighbor 6997 05:52:27,320 --> 05:52:28,860 in cases where we're trying to minimize. 6998 05:52:28,860 --> 05:52:31,160 But generally speaking, if I have five neighbors 6999 05:52:31,160 --> 05:52:33,240 and they're all better than my current state, 7000 05:52:33,240 --> 05:52:36,000 I will pick the best one of those five. 7001 05:52:36,000 --> 05:52:37,480 Now, sometimes that might work pretty well. 7002 05:52:37,480 --> 05:52:40,560 It's sort of a greedy approach of trying to take the best operation 7003 05:52:40,560 --> 05:52:43,360 at any particular time step, but it might not always work. 7004 05:52:43,360 --> 05:52:45,080 There might be cases where actually I want 7005 05:52:45,080 --> 05:52:47,320 to choose an option that is slightly better than me, 7006 05:52:47,320 --> 05:52:50,280 but maybe not the best one because that later on might 7007 05:52:50,280 --> 05:52:52,080 lead to a better outcome ultimately. 7008 05:52:52,080 --> 05:52:54,320 So there are other variants that we might consider 7009 05:52:54,320 --> 05:52:56,520 of this basic hill climbing algorithm. 7010 05:52:56,520 --> 05:52:58,560 One is known as stochastic hill climbing. 7011 05:52:58,560 --> 05:53:02,200 And in this case, we choose randomly from all of our higher value neighbors. 7012 05:53:02,200 --> 05:53:04,800 So if I'm at my current state and there are five neighbors that 7013 05:53:04,800 --> 05:53:07,680 are all better than I am, rather than choosing the best one, 7014 05:53:07,680 --> 05:53:10,320 as steep as the set would do, stochastic will just choose 7015 05:53:10,320 --> 05:53:13,880 randomly from one of them, thinking that if it's better, then it's better. 7016 05:53:13,880 --> 05:53:16,120 And maybe there's a potential to make forward progress, 7017 05:53:16,120 --> 05:53:20,680 even if it is not locally the best option I could possibly choose. 7018 05:53:20,680 --> 05:53:24,120 First choice hill climbing ends up just choosing the very first highest 7019 05:53:24,120 --> 05:53:27,040 valued neighbor that it follows, behaving on a similar idea, 7020 05:53:27,040 --> 05:53:28,960 rather than consider all of the neighbors. 7021 05:53:28,960 --> 05:53:31,800 As soon as we find a neighbor that is better than our current state, 7022 05:53:31,800 --> 05:53:33,080 we'll go ahead and move there. 7023 05:53:33,080 --> 05:53:35,000 There may be some efficiency improvements there 7024 05:53:35,000 --> 05:53:37,200 and maybe has the potential to find a solution 7025 05:53:37,200 --> 05:53:39,920 that the other strategies weren't able to find. 7026 05:53:39,920 --> 05:53:43,800 And with all of these variants, we still suffer from the same potential risk, 7027 05:53:43,800 --> 05:53:48,120 this risk that we might end up at a local minimum or a local maximum. 7028 05:53:48,120 --> 05:53:52,080 And we can reduce that risk by repeating the process multiple times. 7029 05:53:52,080 --> 05:53:55,760 So one variant of hill climbing is random restart hill climbing, 7030 05:53:55,760 --> 05:53:59,880 where the general idea is we'll conduct hill climbing multiple times. 7031 05:53:59,880 --> 05:54:02,680 If we apply steepest descent hill climbing, for example, 7032 05:54:02,680 --> 05:54:04,840 we'll start at some random state, try and figure out 7033 05:54:04,840 --> 05:54:06,560 how to solve the problem and figure out what 7034 05:54:06,560 --> 05:54:09,440 is the local maximum or local minimum we get to. 7035 05:54:09,440 --> 05:54:11,720 And then we'll just randomly restart and try again, 7036 05:54:11,720 --> 05:54:14,520 choose a new starting configuration, try and figure out 7037 05:54:14,520 --> 05:54:17,800 what the local maximum or minimum is, and do this some number of times. 7038 05:54:17,800 --> 05:54:19,920 And then after we've done it some number of times, 7039 05:54:19,920 --> 05:54:23,480 we can pick the best one out of all of the ones that we've taken a look at. 7040 05:54:23,480 --> 05:54:26,600 So there's another option we have access to as well. 7041 05:54:26,600 --> 05:54:29,360 And then, although I said that generally local search will usually 7042 05:54:29,360 --> 05:54:33,160 just keep track of a single node and then move to one of its neighbors, 7043 05:54:33,160 --> 05:54:36,880 there are variants of hill climbing that are known as local beam searches, 7044 05:54:36,880 --> 05:54:39,880 where rather than keep track of just one current best state, 7045 05:54:39,880 --> 05:54:43,880 we're keeping track of k highest valued neighbors, such that rather than 7046 05:54:43,880 --> 05:54:46,320 starting at one random initial configuration, 7047 05:54:46,320 --> 05:54:50,200 I might start with 3 or 4 or 5, randomly generate all the neighbors, 7048 05:54:50,200 --> 05:54:54,440 and then pick the 3 or 4 or 5 best of all of the neighbors that I find, 7049 05:54:54,440 --> 05:54:57,040 and continually repeat this process, with the idea 7050 05:54:57,040 --> 05:55:00,040 being that now I have more options that I'm considering, 7051 05:55:00,040 --> 05:55:02,680 more ways that I could potentially navigate myself 7052 05:55:02,680 --> 05:55:07,160 to the optimal solution that might exist for a particular problem. 7053 05:55:07,160 --> 05:55:09,440 So let's now take a look at some actual code that 7054 05:55:09,440 --> 05:55:11,800 can implement some of these kinds of ideas, something 7055 05:55:11,800 --> 05:55:14,440 like steepest ascent hill climbing, for example, 7056 05:55:14,440 --> 05:55:17,280 for trying to solve this hospital problem. 7057 05:55:17,280 --> 05:55:20,280 So I'm going to go ahead and go into my hospitals directory, where 7058 05:55:20,280 --> 05:55:24,240 I've actually set up the basic framework for solving this type of problem. 7059 05:55:24,240 --> 05:55:26,400 I'll go ahead and go into hospitals.py, and we'll 7060 05:55:26,400 --> 05:55:28,200 take a look at the code we've created here. 7061 05:55:28,200 --> 05:55:32,720 I've defined a class that is going to represent the state space. 7062 05:55:32,720 --> 05:55:36,760 So the space has a height, and a width, and also some number of hospitals. 7063 05:55:36,760 --> 05:55:41,040 So you can configure how big is your map, how many hospitals should go here. 7064 05:55:41,040 --> 05:55:44,000 We have a function for adding a new house to the state space, 7065 05:55:44,000 --> 05:55:45,880 and then some functions that are going to get 7066 05:55:45,880 --> 05:55:49,280 me all of the available spaces for if I want to randomly place hospitals 7067 05:55:49,280 --> 05:55:50,880 in particular locations. 7068 05:55:50,880 --> 05:55:54,240 And here now is the hill climbing algorithm. 7069 05:55:54,240 --> 05:55:56,800 So what are we going to do in the hill climbing algorithm? 7070 05:55:56,800 --> 05:55:59,920 Well, we're going to start by randomly initializing 7071 05:55:59,920 --> 05:56:01,360 where the hospitals are going to go. 7072 05:56:01,360 --> 05:56:03,480 We don't know where the hospitals should actually be, 7073 05:56:03,480 --> 05:56:05,360 so let's just randomly place them. 7074 05:56:05,360 --> 05:56:08,760 So here I'm running a loop for each of the hospitals that I have. 7075 05:56:08,760 --> 05:56:13,520 I'm going to go ahead and add a new hospital at some random location. 7076 05:56:13,520 --> 05:56:15,800 So I basically get all of the available spaces, 7077 05:56:15,800 --> 05:56:17,960 and I randomly choose one of them as where 7078 05:56:17,960 --> 05:56:20,960 I would like to add this particular hospital. 7079 05:56:20,960 --> 05:56:23,220 I have some logging output and generating some images, 7080 05:56:23,220 --> 05:56:25,200 which we'll take a look at a little bit later. 7081 05:56:25,200 --> 05:56:27,160 But here is the key idea. 7082 05:56:27,160 --> 05:56:30,240 So I'm going to just keep repeating this algorithm. 7083 05:56:30,240 --> 05:56:33,240 I could specify a maximum of how many times I want it to run, 7084 05:56:33,240 --> 05:56:37,200 or I could just run it up until it hits a local maximum or local minimum. 7085 05:56:37,200 --> 05:56:40,280 And now we'll basically consider all of the hospitals 7086 05:56:40,280 --> 05:56:41,480 that could potentially move. 7087 05:56:41,480 --> 05:56:43,960 So consider each of the two hospitals or more hospitals 7088 05:56:43,960 --> 05:56:45,440 if they're more than that. 7089 05:56:45,440 --> 05:56:49,120 And consider all of the places where that hospital could move to, 7090 05:56:49,120 --> 05:56:53,080 some neighbor of that hospital that we can move the neighbor to. 7091 05:56:53,080 --> 05:56:58,480 And then see, is this going to be better than where we were currently? 7092 05:56:58,480 --> 05:57:00,440 So if it is going to be better, then we'll 7093 05:57:00,440 --> 05:57:02,840 go ahead and update our best neighbor and keep 7094 05:57:02,840 --> 05:57:05,840 track of this new best neighbor that we found. 7095 05:57:05,840 --> 05:57:08,360 And then afterwards, we can ask ourselves the question, 7096 05:57:08,360 --> 05:57:10,440 if best neighbor cost is greater than or equal 7097 05:57:10,440 --> 05:57:13,040 to the cost of the current set of hospitals, 7098 05:57:13,040 --> 05:57:18,120 meaning if the cost of our best neighbor is greater than the current cost, 7099 05:57:18,120 --> 05:57:21,360 meaning our best neighbor is worse than our current state, 7100 05:57:21,360 --> 05:57:23,600 well, then we shouldn't make any changes at all. 7101 05:57:23,600 --> 05:57:27,200 And we should just go ahead and return the current set of hospitals. 7102 05:57:27,200 --> 05:57:29,800 But otherwise, we can update our hospitals 7103 05:57:29,800 --> 05:57:32,520 in order to change them to one of the best neighbors. 7104 05:57:32,520 --> 05:57:34,440 And if there are multiple that are all equivalent, 7105 05:57:34,440 --> 05:57:38,400 I'm here using random.choice to say go ahead and choose one randomly. 7106 05:57:38,400 --> 05:57:41,720 So this is really just a Python implementation of that same idea 7107 05:57:41,720 --> 05:57:44,640 that we were just talking about, this idea of taking a current state, 7108 05:57:44,640 --> 05:57:48,120 some current set of hospitals, generating all of the neighbors, 7109 05:57:48,120 --> 05:57:50,480 looking at all of the ways we could take one hospital 7110 05:57:50,480 --> 05:57:53,320 and move it one square to the left or right or up or down, 7111 05:57:53,320 --> 05:57:56,160 and then figuring out, based on all of that information, which 7112 05:57:56,160 --> 05:57:59,360 is the best neighbor or the set of all the best neighbors, 7113 05:57:59,360 --> 05:58:02,040 and then choosing from one of those. 7114 05:58:02,040 --> 05:58:05,920 And each time, we go ahead and generate an image in order to do that. 7115 05:58:05,920 --> 05:58:08,920 And so now what we're doing is if we look down at the bottom, 7116 05:58:08,920 --> 05:58:12,840 I'm going to randomly generate a space with height 10 and width 20. 7117 05:58:12,840 --> 05:58:16,000 And I'll say go ahead and put three hospitals somewhere in the space. 7118 05:58:16,000 --> 05:58:18,720 I'll randomly generate 15 houses that I just go ahead 7119 05:58:18,720 --> 05:58:20,680 and add in random locations. 7120 05:58:20,680 --> 05:58:23,640 And now I'm going to run this hill climbing algorithm in order 7121 05:58:23,640 --> 05:58:27,200 to try and figure out where we should place those hospitals. 7122 05:58:27,200 --> 05:58:31,400 So we'll go ahead and run this program by running Python hospitals. 7123 05:58:31,400 --> 05:58:32,440 And we see that we started. 7124 05:58:32,440 --> 05:58:35,360 Our initial state had a cost of 72, but we 7125 05:58:35,360 --> 05:58:38,560 were able to continually find neighbors that were able to decrease that cost, 7126 05:58:38,560 --> 05:58:43,440 decrease to 69, 66, 63, so on and so forth, all the way down to 53, 7127 05:58:43,440 --> 05:58:46,140 as the best neighbor we were able to ultimately find. 7128 05:58:46,140 --> 05:58:48,280 And we can take a look at what that looked like 7129 05:58:48,280 --> 05:58:50,200 by just opening up these files. 7130 05:58:50,200 --> 05:58:53,280 So here, for example, was the initial configuration. 7131 05:58:53,280 --> 05:58:57,280 We randomly selected a location for each of these 15 different houses 7132 05:58:57,280 --> 05:59:01,560 and then randomly selected locations for one, two, three hospitals 7133 05:59:01,560 --> 05:59:04,880 that were just located somewhere inside of the state space. 7134 05:59:04,880 --> 05:59:07,680 And if you add up all the distances from each of the houses 7135 05:59:07,680 --> 05:59:11,360 to their nearest hospital, you get a total cost of about 72. 7136 05:59:11,360 --> 05:59:14,280 And so now the question is, what neighbors can we move to 7137 05:59:14,280 --> 05:59:16,120 that improve the situation? 7138 05:59:16,120 --> 05:59:18,360 And it looks like the first one the algorithm found 7139 05:59:18,360 --> 05:59:21,680 was by taking this house that was over there on the right 7140 05:59:21,680 --> 05:59:23,880 and just moving it to the left. 7141 05:59:23,880 --> 05:59:25,640 And that probably makes sense because if you 7142 05:59:25,640 --> 05:59:29,760 look at the houses in that general area, really these five houses look like 7143 05:59:29,760 --> 05:59:33,240 they're probably the ones that are going to be closest to this hospital over here. 7144 05:59:33,240 --> 05:59:36,640 Moving it to the left decreases the total distance, at least 7145 05:59:36,640 --> 05:59:40,440 to most of these houses, though it does increase that distance for one of them. 7146 05:59:40,440 --> 05:59:43,160 And so we're able to make these improvements to the situation 7147 05:59:43,160 --> 05:59:47,280 by continually finding ways that we can move these hospitals around 7148 05:59:47,280 --> 05:59:50,800 until we eventually settle at this particular state that 7149 05:59:50,800 --> 05:59:54,760 has a cost of 53, where we figured out a position for each of the hospitals. 7150 05:59:54,760 --> 05:59:57,200 And now none of the neighbors that we could move to 7151 05:59:57,200 --> 05:59:59,600 are actually going to improve the situation. 7152 05:59:59,600 --> 06:00:02,280 We can take this hospital and this hospital and that hospital 7153 06:00:02,280 --> 06:00:03,760 and look at each of the neighbors. 7154 06:00:03,760 --> 06:00:07,400 And none of those are going to be better than this particular configuration. 7155 06:00:07,400 --> 06:00:10,040 And again, that's not to say that this is the best we could do. 7156 06:00:10,040 --> 06:00:12,520 There might be some other configuration of hospitals 7157 06:00:12,520 --> 06:00:14,240 that is a global minimum. 7158 06:00:14,240 --> 06:00:18,480 And this might just be a local minimum that is the best of all of its neighbors, 7159 06:00:18,480 --> 06:00:21,880 but maybe not the best in the entire possible state space. 7160 06:00:21,880 --> 06:00:24,000 And you could search through the entire state space 7161 06:00:24,000 --> 06:00:27,440 by considering all of the possible configurations for hospitals. 7162 06:00:27,440 --> 06:00:29,560 But ultimately, that's going to be very time intensive, 7163 06:00:29,560 --> 06:00:31,720 especially as our state space gets bigger and there 7164 06:00:31,720 --> 06:00:33,600 might be more and more possible states. 7165 06:00:33,600 --> 06:00:36,200 It's going to take quite a long time to look through all of them. 7166 06:00:36,200 --> 06:00:39,160 And so being able to use these sort of local search algorithms 7167 06:00:39,160 --> 06:00:42,600 can often be quite good for trying to find the best solution we can do. 7168 06:00:42,600 --> 06:00:45,440 And especially if we don't care about doing the best possible 7169 06:00:45,440 --> 06:00:47,800 and we just care about doing pretty good and finding 7170 06:00:47,800 --> 06:00:50,160 a pretty good placement of those hospitals, 7171 06:00:50,160 --> 06:00:53,240 then these methods can be particularly powerful. 7172 06:00:53,240 --> 06:00:56,080 But of course, we can try and mitigate some of this concern 7173 06:00:56,080 --> 06:00:59,520 by instead of using hill climbing to use random restart, 7174 06:00:59,520 --> 06:01:02,200 this idea of rather than just hill climb one time, 7175 06:01:02,200 --> 06:01:04,280 we can hill climb multiple times and say, 7176 06:01:04,280 --> 06:01:07,280 try hill climbing a whole bunch of times on the exact same map 7177 06:01:07,280 --> 06:01:10,320 and figure out what is the best one that we've been able to find. 7178 06:01:10,320 --> 06:01:14,440 And so I've here implemented a function for random restart 7179 06:01:14,440 --> 06:01:17,600 that restarts some maximum number of times. 7180 06:01:17,600 --> 06:01:22,280 And what we're going to do is repeat that number of times this process of just 7181 06:01:22,280 --> 06:01:24,380 go ahead and run the hill climbing algorithm, 7182 06:01:24,380 --> 06:01:28,000 figure out what the cost is of getting from all the houses to the hospitals, 7183 06:01:28,000 --> 06:01:31,640 and then figure out is this better than we've done so far. 7184 06:01:31,640 --> 06:01:35,120 So I can try this exact same idea where instead of running hill climbing, 7185 06:01:35,120 --> 06:01:37,400 I'll go ahead and run random restart. 7186 06:01:37,400 --> 06:01:41,240 And I'll randomly restart maybe 20 times, for example. 7187 06:01:41,240 --> 06:01:44,240 And we'll go ahead and now I'll remove all the images 7188 06:01:44,240 --> 06:01:46,280 and then rerun the program. 7189 06:01:46,280 --> 06:01:49,200 And now we started by finding a original state. 7190 06:01:49,200 --> 06:01:51,280 When we initially ran hill climbing, the best cost 7191 06:01:51,280 --> 06:01:53,000 we were able to find was 56. 7192 06:01:53,000 --> 06:01:56,960 Each of these iterations is a different iteration of the hill climbing 7193 06:01:56,960 --> 06:01:57,460 algorithm. 7194 06:01:57,460 --> 06:02:00,400 We're running hill climbing not one time, but 20 times here, 7195 06:02:00,400 --> 06:02:04,400 each time going until we find a local minimum in this case. 7196 06:02:04,400 --> 06:02:06,840 And we look and see each time did we do better 7197 06:02:06,840 --> 06:02:09,080 than we did the best time we've done so far. 7198 06:02:09,080 --> 06:02:11,180 So we went from 56 to 46. 7199 06:02:11,180 --> 06:02:12,720 This one was greater, so we ignored it. 7200 06:02:12,720 --> 06:02:16,440 This one was 41, which was less, so we went ahead and kept that one. 7201 06:02:16,440 --> 06:02:18,800 And for all of the remaining 16 times that we 7202 06:02:18,800 --> 06:02:21,860 tried to implement hill climbing and we tried to run the hill climbing 7203 06:02:21,860 --> 06:02:25,120 algorithm, we couldn't do any better than that 41. 7204 06:02:25,120 --> 06:02:28,000 Again, maybe there is a way to do better that we just didn't find, 7205 06:02:28,000 --> 06:02:31,760 but it looks like that way ended up being a pretty good solution 7206 06:02:31,760 --> 06:02:32,440 to the problem. 7207 06:02:32,440 --> 06:02:36,880 That was attempt number three, starting from counting at zero. 7208 06:02:36,880 --> 06:02:39,720 So we can take a look at that, open up number three. 7209 06:02:39,720 --> 06:02:42,880 And this was the state that happened to have a cost of 41, 7210 06:02:42,880 --> 06:02:45,360 that after running the hill climbing algorithm 7211 06:02:45,360 --> 06:02:48,720 on some particular random initial configuration of hospitals, 7212 06:02:48,720 --> 06:02:51,600 this is what we found was the local minimum in terms 7213 06:02:51,600 --> 06:02:53,040 of trying to minimize the cost. 7214 06:02:53,040 --> 06:02:54,800 And it looks like we did pretty well. 7215 06:02:54,800 --> 06:02:56,980 This hospital is pretty close to this region. 7216 06:02:56,980 --> 06:02:58,860 This one is pretty close to these houses here. 7217 06:02:58,860 --> 06:03:01,120 This hospital looks about as good as we can do 7218 06:03:01,120 --> 06:03:03,760 for trying to capture those houses over on that side. 7219 06:03:03,760 --> 06:03:06,400 And so these sorts of algorithms can be quite useful 7220 06:03:06,400 --> 06:03:09,200 for trying to solve these problems. 7221 06:03:09,200 --> 06:03:12,400 But the real problem with many of these different types of hill climbing, 7222 06:03:12,400 --> 06:03:15,200 steepest of sense, stochastic, first choice, and so forth, 7223 06:03:15,200 --> 06:03:18,720 is that they never make a move that makes our situation worse. 7224 06:03:18,720 --> 06:03:21,360 They're always going to take ourselves in our current state, 7225 06:03:21,360 --> 06:03:24,600 look at the neighbors, and consider can we do better than our current state 7226 06:03:24,600 --> 06:03:26,080 and move to one of those neighbors. 7227 06:03:26,080 --> 06:03:29,080 Which of those neighbors we choose might vary among these various different 7228 06:03:29,080 --> 06:03:32,560 types of algorithms, but we never go from a current position 7229 06:03:32,560 --> 06:03:35,560 to a position that is worse than our current position. 7230 06:03:35,560 --> 06:03:37,800 And ultimately, that's what we're going to need to do 7231 06:03:37,800 --> 06:03:40,920 if we want to be able to find a global maximum or a global minimum. 7232 06:03:40,920 --> 06:03:42,800 Because sometimes if we get stuck, we want 7233 06:03:42,800 --> 06:03:46,000 to find some way of dislodging ourselves from our local maximum 7234 06:03:46,000 --> 06:03:50,000 or local minimum in order to find the global maximum or the global minimum 7235 06:03:50,000 --> 06:03:52,840 or increase the probability that we do find it. 7236 06:03:52,840 --> 06:03:54,640 And so the most popular technique for trying 7237 06:03:54,640 --> 06:03:57,400 to approach the problem from that angle is a technique known 7238 06:03:57,400 --> 06:04:00,120 as simulated annealing, simulated because it's modeling 7239 06:04:00,120 --> 06:04:03,800 after a real physical process of annealing, where you can think about this 7240 06:04:03,800 --> 06:04:06,480 in terms of physics, a physical situation where 7241 06:04:06,480 --> 06:04:08,320 you have some system of particles. 7242 06:04:08,320 --> 06:04:10,200 And you might imagine that when you heat up 7243 06:04:10,200 --> 06:04:12,760 a particular physical system, there's a lot of energy there. 7244 06:04:12,760 --> 06:04:14,680 Things are moving around quite randomly. 7245 06:04:14,680 --> 06:04:17,640 But over time, as the system cools down, it eventually 7246 06:04:17,640 --> 06:04:20,280 settles into some final position. 7247 06:04:20,280 --> 06:04:23,220 And that's going to be the general idea of simulated annealing. 7248 06:04:23,220 --> 06:04:27,240 We're going to simulate that process of some high temperature system where 7249 06:04:27,240 --> 06:04:29,680 things are moving around randomly quite frequently, 7250 06:04:29,680 --> 06:04:32,920 but over time decreasing that temperature until we eventually 7251 06:04:32,920 --> 06:04:35,040 settle at our ultimate solution. 7252 06:04:35,040 --> 06:04:38,240 And the idea is going to be if we have some state space landscape that 7253 06:04:38,240 --> 06:04:42,400 looks like this and we begin at its initial state here, 7254 06:04:42,400 --> 06:04:44,680 if we're looking for a global maximum and we're 7255 06:04:44,680 --> 06:04:46,960 trying to maximize the value of the state, 7256 06:04:46,960 --> 06:04:50,160 our traditional hill climbing algorithms would just take the state 7257 06:04:50,160 --> 06:04:52,160 and look at the two neighbor ones and always 7258 06:04:52,160 --> 06:04:55,960 pick the one that is going to increase the value of the state. 7259 06:04:55,960 --> 06:04:58,880 But if we want some chance of being able to find the global maximum, 7260 06:04:58,880 --> 06:05:01,540 we can't always make good moves. 7261 06:05:01,540 --> 06:05:04,720 We have to sometimes make bad moves and allow ourselves 7262 06:05:04,720 --> 06:05:08,320 to make a move in a direction that actually seems for now 7263 06:05:08,320 --> 06:05:11,000 to make our situation worse such that later we 7264 06:05:11,000 --> 06:05:14,560 can find our way up to that global maximum in terms 7265 06:05:14,560 --> 06:05:16,160 of trying to solve that problem. 7266 06:05:16,160 --> 06:05:18,200 Of course, once we get up to this global maximum, 7267 06:05:18,200 --> 06:05:20,000 once we've done a whole lot of the searching, 7268 06:05:20,000 --> 06:05:22,360 then we probably don't want to be moving to states 7269 06:05:22,360 --> 06:05:24,080 that are worse than our current state. 7270 06:05:24,080 --> 06:05:26,200 And so this is where this metaphor for annealing 7271 06:05:26,200 --> 06:05:30,120 starts to come in, where we want to start making more random moves 7272 06:05:30,120 --> 06:05:33,440 and over time start to make fewer of those random moves based 7273 06:05:33,440 --> 06:05:36,160 on a particular temperature schedule. 7274 06:05:36,160 --> 06:05:38,240 So the basic outline looks something like this. 7275 06:05:38,240 --> 06:05:42,520 Early on in simulated annealing, we have a higher temperature state. 7276 06:05:42,520 --> 06:05:44,920 And what we mean by a higher temperature state 7277 06:05:44,920 --> 06:05:47,520 is that we are more likely to accept neighbors that 7278 06:05:47,520 --> 06:05:49,200 are worse than our current state. 7279 06:05:49,200 --> 06:05:50,520 We might look at our neighbors. 7280 06:05:50,520 --> 06:05:53,000 And if one of our neighbors is worse than the current state, 7281 06:05:53,000 --> 06:05:54,920 especially if it's not all that much worse, 7282 06:05:54,920 --> 06:05:57,200 if it's pretty close but just slightly worse, 7283 06:05:57,200 --> 06:05:59,960 then we might be more likely to accept that and go ahead 7284 06:05:59,960 --> 06:06:02,120 and move to that neighbor anyways. 7285 06:06:02,120 --> 06:06:04,560 But later on as we run simulated annealing, 7286 06:06:04,560 --> 06:06:06,440 we're going to decrease that temperature. 7287 06:06:06,440 --> 06:06:10,280 And at a lower temperature, we're going to be less likely to accept neighbors 7288 06:06:10,280 --> 06:06:12,800 that are worse than our current state. 7289 06:06:12,800 --> 06:06:15,300 Now to formalize this and put a little bit of pseudocode to it, 7290 06:06:15,300 --> 06:06:17,120 here is what that algorithm might look like. 7291 06:06:17,120 --> 06:06:19,080 We have a function called simulated annealing 7292 06:06:19,080 --> 06:06:21,560 that takes as input the problem we're trying to solve 7293 06:06:21,560 --> 06:06:24,320 and also potentially some maximum number of times 7294 06:06:24,320 --> 06:06:27,600 we might want to run the simulated annealing process, how many different 7295 06:06:27,600 --> 06:06:29,300 neighbors we're going to try and look for. 7296 06:06:29,300 --> 06:06:33,200 And that value is going to vary based on the problem you're trying to solve. 7297 06:06:33,200 --> 06:06:34,880 We'll, again, start with some current state 7298 06:06:34,880 --> 06:06:37,320 that will be equal to the initial state of the problem. 7299 06:06:37,320 --> 06:06:40,760 But now we need to repeat this process over and over 7300 06:06:40,760 --> 06:06:42,600 for max number of times. 7301 06:06:42,600 --> 06:06:45,880 Repeat some process some number of times where we're first 7302 06:06:45,880 --> 06:06:48,120 going to calculate a temperature. 7303 06:06:48,120 --> 06:06:51,160 And this temperature function takes the current time t 7304 06:06:51,160 --> 06:06:53,440 starting at 1 going all the way up to max 7305 06:06:53,440 --> 06:06:57,360 and then gives us some temperature that we can use in our computation, 7306 06:06:57,360 --> 06:07:01,120 where the idea is that this temperature is going to be higher early on 7307 06:07:01,120 --> 06:07:02,840 and it's going to be lower later on. 7308 06:07:02,840 --> 06:07:05,760 So there are a number of ways this temperature function could often work. 7309 06:07:05,760 --> 06:07:07,680 One of the simplest ways is just to say it 7310 06:07:07,680 --> 06:07:10,760 is like the proportion of time that we still have remaining. 7311 06:07:10,760 --> 06:07:14,040 Out of max units of time, how much time do we have remaining? 7312 06:07:14,040 --> 06:07:16,160 You start off with a lot of that time remaining. 7313 06:07:16,160 --> 06:07:18,580 And as time goes on, the temperature is going to decrease 7314 06:07:18,580 --> 06:07:22,440 because you have less and less of that remaining time still available to you. 7315 06:07:22,440 --> 06:07:25,200 So we calculate a temperature for the current time. 7316 06:07:25,200 --> 06:07:28,240 And then we pick a random neighbor of the current state. 7317 06:07:28,240 --> 06:07:31,360 No longer are we going to be picking the best neighbor that we possibly can 7318 06:07:31,360 --> 06:07:33,440 or just one of the better neighbors that we can. 7319 06:07:33,440 --> 06:07:34,840 We're going to pick a random neighbor. 7320 06:07:34,840 --> 06:07:35,500 It might be better. 7321 06:07:35,500 --> 06:07:36,280 It might be worse. 7322 06:07:36,280 --> 06:07:37,240 But we're going to calculate that. 7323 06:07:37,240 --> 06:07:40,900 We're going to calculate delta E, E for energy in this case, 7324 06:07:40,900 --> 06:07:45,320 which is just how much better is the neighbor than the current state. 7325 06:07:45,320 --> 06:07:47,840 So if delta E is positive, that means the neighbor 7326 06:07:47,840 --> 06:07:49,360 is better than our current state. 7327 06:07:49,360 --> 06:07:51,760 If delta E is negative, that means the neighbor 7328 06:07:51,760 --> 06:07:53,840 is worse than our current state. 7329 06:07:53,840 --> 06:07:56,120 And so we can then have a condition that looks like this. 7330 06:07:56,120 --> 06:07:59,720 If delta E is greater than 0, that means the neighbor state 7331 06:07:59,720 --> 06:08:01,920 is better than our current state. 7332 06:08:01,920 --> 06:08:05,760 And if ever that situation arises, we'll just go ahead and update current 7333 06:08:05,760 --> 06:08:06,560 to be that neighbor. 7334 06:08:06,560 --> 06:08:09,720 Same as before, move where we are currently to be the neighbor 7335 06:08:09,720 --> 06:08:11,920 because the neighbor is better than our current state. 7336 06:08:11,920 --> 06:08:13,240 We'll go ahead and accept that. 7337 06:08:13,240 --> 06:08:16,240 But now the difference is that whereas before, we never, 7338 06:08:16,240 --> 06:08:19,160 ever wanted to take a move that made our situation worse, 7339 06:08:19,160 --> 06:08:22,360 now we sometimes want to make a move that is actually 7340 06:08:22,360 --> 06:08:24,560 going to make our situation worse because sometimes we're 7341 06:08:24,560 --> 06:08:27,920 going to need to dislodge ourselves from a local minimum or local maximum 7342 06:08:27,920 --> 06:08:31,360 to increase the probability that we're able to find the global minimum 7343 06:08:31,360 --> 06:08:34,120 or the global maximum a little bit later. 7344 06:08:34,120 --> 06:08:35,120 And so how do we do that? 7345 06:08:35,120 --> 06:08:39,520 How do we decide to sometimes accept some state that might actually be worse? 7346 06:08:39,520 --> 06:08:43,160 Well, we're going to accept a worse state with some probability. 7347 06:08:43,160 --> 06:08:46,000 And that probability needs to be based on a couple of factors. 7348 06:08:46,000 --> 06:08:49,080 It needs to be based in part on the temperature, 7349 06:08:49,080 --> 06:08:52,320 where if the temperature is higher, we're more likely to move to a worse 7350 06:08:52,320 --> 06:08:52,920 neighbor. 7351 06:08:52,920 --> 06:08:56,680 And if the temperature is lower, we're less likely to move to a worse neighbor. 7352 06:08:56,680 --> 06:09:00,560 But it also, to some degree, should be based on delta E. 7353 06:09:00,560 --> 06:09:03,480 If the neighbor is much worse than the current state, 7354 06:09:03,480 --> 06:09:05,920 we probably want to be less likely to choose that 7355 06:09:05,920 --> 06:09:09,640 than if the neighbor is just a little bit worse than the current state. 7356 06:09:09,640 --> 06:09:12,080 So again, there are a couple of ways you could calculate this. 7357 06:09:12,080 --> 06:09:14,320 But it turns out one of the most popular is just 7358 06:09:14,320 --> 06:09:19,360 to calculate E to the power of delta E over T, where E is just a constant. 7359 06:09:19,360 --> 06:09:22,960 Delta E over T are based on delta E and T here. 7360 06:09:22,960 --> 06:09:24,560 We calculate that value. 7361 06:09:24,560 --> 06:09:26,760 And that'll be some value between 0 and 1. 7362 06:09:26,760 --> 06:09:29,720 And that is the probability with which we should just say, all right, 7363 06:09:29,720 --> 06:09:31,220 let's go ahead and move to that neighbor. 7364 06:09:31,220 --> 06:09:33,560 And it turns out that if you do the math for this value, 7365 06:09:33,560 --> 06:09:36,240 when delta E is such that the neighbor is not 7366 06:09:36,240 --> 06:09:38,240 that much worse than the current state, that's 7367 06:09:38,240 --> 06:09:41,120 going to be more likely that we're going to go ahead and move to that state. 7368 06:09:41,120 --> 06:09:43,040 And likewise, when the temperature is lower, 7369 06:09:43,040 --> 06:09:47,120 we're going to be less likely to move to that neighboring state as well. 7370 06:09:47,120 --> 06:09:49,800 So now this is the big picture for simulated annealing, 7371 06:09:49,800 --> 06:09:53,040 this process of taking the problem and going ahead and generating 7372 06:09:53,040 --> 06:09:55,240 random neighbors will always move to a neighbor 7373 06:09:55,240 --> 06:09:56,960 if it's better than our current state. 7374 06:09:56,960 --> 06:09:59,520 But even if the neighbor is worse than our current state, 7375 06:09:59,520 --> 06:10:03,040 we'll sometimes move there depending on how much worse it is 7376 06:10:03,040 --> 06:10:04,680 and also based on the temperature. 7377 06:10:04,680 --> 06:10:07,600 And as a result, the hope, the goal of this whole process 7378 06:10:07,600 --> 06:10:11,640 is that as we begin to try and find our way to the global maximum 7379 06:10:11,640 --> 06:10:14,160 or the global minimum, we can dislodge ourselves 7380 06:10:14,160 --> 06:10:17,040 if we ever get stuck at a local maximum or local minimum 7381 06:10:17,040 --> 06:10:19,660 in order to eventually make our way to exploring 7382 06:10:19,660 --> 06:10:22,080 the part of the state space that is going to be the best. 7383 06:10:22,080 --> 06:10:25,600 And then as the temperature decreases, eventually we settle there 7384 06:10:25,600 --> 06:10:27,840 without moving around too much from what we've 7385 06:10:27,840 --> 06:10:31,440 found to be the globally best thing that we can do thus far. 7386 06:10:31,440 --> 06:10:35,320 So at the very end, we just return whatever the current state happens to be. 7387 06:10:35,320 --> 06:10:37,520 And that is the conclusion of this algorithm. 7388 06:10:37,520 --> 06:10:40,600 We've been able to figure out what the solution is. 7389 06:10:40,600 --> 06:10:44,000 And these types of algorithms have a lot of different applications. 7390 06:10:44,000 --> 06:10:46,400 Any time you can take a problem and formulate it 7391 06:10:46,400 --> 06:10:49,760 as something where you can explore a particular configuration 7392 06:10:49,760 --> 06:10:51,940 and then ask, are any of the neighbors better 7393 06:10:51,940 --> 06:10:54,920 than this current configuration and have some way of measuring that, 7394 06:10:54,920 --> 06:10:58,440 then there is an applicable case for these hill climbing, simulated annealing 7395 06:10:58,440 --> 06:10:59,800 types of algorithms. 7396 06:10:59,800 --> 06:11:02,800 So sometimes it can be for facility location type problems, 7397 06:11:02,800 --> 06:11:05,080 like for when you're trying to plan a city and figure out 7398 06:11:05,080 --> 06:11:06,480 where the hospitals should be. 7399 06:11:06,480 --> 06:11:08,720 But there are definitely other applications as well. 7400 06:11:08,720 --> 06:11:11,200 And one of the most famous problems in computer science 7401 06:11:11,200 --> 06:11:13,240 is the traveling salesman problem. 7402 06:11:13,240 --> 06:11:16,240 Traveling salesman problem generally is formulated like this. 7403 06:11:16,240 --> 06:11:19,360 I have a whole bunch of cities here indicated by these dots. 7404 06:11:19,360 --> 06:11:22,000 And what I'd like to do is find some route that 7405 06:11:22,000 --> 06:11:25,600 takes me through all of the cities and ends up back where I started. 7406 06:11:25,600 --> 06:11:29,120 So some route that starts here, goes through all these cities, 7407 06:11:29,120 --> 06:11:32,080 and ends up back where I originally started. 7408 06:11:32,080 --> 06:11:35,800 And what I might like to do is minimize the total distance 7409 06:11:35,800 --> 06:11:40,040 that I have to travel or the total cost of taking this entire path. 7410 06:11:40,040 --> 06:11:43,720 And you can imagine this is a problem that's very applicable in situations 7411 06:11:43,720 --> 06:11:46,840 like when delivery companies are trying to deliver things 7412 06:11:46,840 --> 06:11:48,640 to a whole bunch of different houses, they 7413 06:11:48,640 --> 06:11:51,040 want to figure out, how do I get from the warehouse 7414 06:11:51,040 --> 06:11:53,980 to all these various different houses and get back again, 7415 06:11:53,980 --> 06:11:57,920 all using as minimal time and distance and energy as possible. 7416 06:11:57,920 --> 06:12:00,840 So you might want to try to solve these sorts of problems. 7417 06:12:00,840 --> 06:12:03,800 But it turns out that solving this particular kind of problem 7418 06:12:03,800 --> 06:12:05,680 is very computationally difficult. 7419 06:12:05,680 --> 06:12:09,320 It is a very computationally expensive task to be able to figure it out. 7420 06:12:09,320 --> 06:12:12,920 This falls under the category of what are known as NP-complete problems, 7421 06:12:12,920 --> 06:12:16,100 problems that there is no known efficient way to try and solve 7422 06:12:16,100 --> 06:12:17,560 these sorts of problems. 7423 06:12:17,560 --> 06:12:21,400 And so what we ultimately have to do is come up with some approximation, 7424 06:12:21,400 --> 06:12:25,040 some ways of trying to find a good solution, even if we're not 7425 06:12:25,040 --> 06:12:27,960 going to find the globally best solution that we possibly can, 7426 06:12:27,960 --> 06:12:30,920 at least not in a feasible or tractable amount of time. 7427 06:12:30,920 --> 06:12:34,040 And so what we could do is take the traveling salesman problem 7428 06:12:34,040 --> 06:12:38,160 and try to formulate it using local search and ask a question like, all right, 7429 06:12:38,160 --> 06:12:41,680 I can pick some state, some configuration, some route between all 7430 06:12:41,680 --> 06:12:42,800 of these nodes. 7431 06:12:42,800 --> 06:12:46,040 And I can measure the cost of that state, figure out what the distance is. 7432 06:12:46,040 --> 06:12:49,960 And I might now want to try to minimize that cost as much as possible. 7433 06:12:49,960 --> 06:12:51,920 And then the only question now is, what does it 7434 06:12:51,920 --> 06:12:54,080 mean to have a neighbor of this state? 7435 06:12:54,080 --> 06:12:55,920 What does it mean to take this particular route 7436 06:12:55,920 --> 06:12:59,040 and have some neighboring route that is close to it but slightly different 7437 06:12:59,040 --> 06:13:01,480 and such that it might have a different total distance? 7438 06:13:01,480 --> 06:13:03,440 And there are a number of different definitions 7439 06:13:03,440 --> 06:13:07,000 for what a neighbor of a traveling salesman configuration might look like. 7440 06:13:07,000 --> 06:13:09,280 But one way is just to say, a neighbor is 7441 06:13:09,280 --> 06:13:13,760 what happens if we pick two of these edges between nodes 7442 06:13:13,760 --> 06:13:16,640 and switch them effectively. 7443 06:13:16,640 --> 06:13:19,120 So for example, I might pick these two edges here, 7444 06:13:19,120 --> 06:13:23,200 these two that just happened across this node goes here, this node goes there, 7445 06:13:23,200 --> 06:13:24,880 and go ahead and switch them. 7446 06:13:24,880 --> 06:13:26,920 And what that process will generally look like 7447 06:13:26,920 --> 06:13:31,080 is removing both of these edges from the graph, taking this node, 7448 06:13:31,080 --> 06:13:33,640 and connecting it to the node it wasn't connected to. 7449 06:13:33,640 --> 06:13:35,720 So connecting it up here instead. 7450 06:13:35,720 --> 06:13:37,800 We'll need to take these arrows that were originally 7451 06:13:37,800 --> 06:13:40,880 going this way and reverse them, so move them going the other way, 7452 06:13:40,880 --> 06:13:42,960 and then just fill in that last remaining blank, 7453 06:13:42,960 --> 06:13:45,360 add an arrow that goes in that direction instead. 7454 06:13:45,360 --> 06:13:48,400 So by taking two edges and just switching them, 7455 06:13:48,400 --> 06:13:51,480 I have been able to consider one possible neighbor 7456 06:13:51,480 --> 06:13:53,080 of this particular configuration. 7457 06:13:53,080 --> 06:13:55,160 And it looks like this neighbor is actually better. 7458 06:13:55,160 --> 06:13:57,960 It looks like this probably travels a shorter distance in order 7459 06:13:57,960 --> 06:14:00,480 to get through all the cities through this route 7460 06:14:00,480 --> 06:14:02,080 than the current state did. 7461 06:14:02,080 --> 06:14:05,720 And so you could imagine implementing this idea inside of a hill climbing 7462 06:14:05,720 --> 06:14:08,640 or simulated annealing algorithm, where we repeat this process 7463 06:14:08,640 --> 06:14:11,720 to try and take a state of this traveling salesman problem, 7464 06:14:11,720 --> 06:14:14,720 look at all the neighbors, and then move to the neighbors if they're better, 7465 06:14:14,720 --> 06:14:16,920 or maybe even move to the neighbors if they're worse, 7466 06:14:16,920 --> 06:14:20,120 until we eventually settle upon some best solution 7467 06:14:20,120 --> 06:14:21,760 that we've been able to find. 7468 06:14:21,760 --> 06:14:24,280 And it turns out that these types of approximation algorithms, 7469 06:14:24,280 --> 06:14:26,840 even if they don't always find the very best solution, 7470 06:14:26,840 --> 06:14:32,120 can often do pretty well at trying to find solutions that are helpful too. 7471 06:14:32,120 --> 06:14:36,240 So that then was a look at local search, a particular category of algorithms 7472 06:14:36,240 --> 06:14:38,640 that can be used for solving a particular type of problem, 7473 06:14:38,640 --> 06:14:41,160 where we don't really care about the path to the solution. 7474 06:14:41,160 --> 06:14:43,280 I didn't care about the steps I took to decide 7475 06:14:43,280 --> 06:14:44,760 where the hospitals should go. 7476 06:14:44,760 --> 06:14:46,600 I just cared about the solution itself. 7477 06:14:46,600 --> 06:14:49,040 I just care about where the hospitals should be, 7478 06:14:49,040 --> 06:14:53,520 or what the route through the traveling salesman journey really ought to be. 7479 06:14:53,520 --> 06:14:55,720 Another type of algorithm that might come up 7480 06:14:55,720 --> 06:14:59,120 are known as these categories of linear programming types of problems. 7481 06:14:59,120 --> 06:15:01,640 And linear programming often comes up in the context 7482 06:15:01,640 --> 06:15:04,960 where we're trying to optimize for some mathematical function. 7483 06:15:04,960 --> 06:15:07,640 But oftentimes, linear programming will come up 7484 06:15:07,640 --> 06:15:10,000 when we might have real numbered values. 7485 06:15:10,000 --> 06:15:13,000 So it's not just discrete fixed values that we might have, 7486 06:15:13,000 --> 06:15:16,240 but any decimal values that we might want to be able to calculate. 7487 06:15:16,240 --> 06:15:19,680 And so linear programming is a family of types of problems 7488 06:15:19,680 --> 06:15:22,400 where we might have a situation that looks like this, where 7489 06:15:22,400 --> 06:15:26,640 the goal of linear programming is to minimize a cost function. 7490 06:15:26,640 --> 06:15:29,080 And you can invert the numbers and say try and maximize it, 7491 06:15:29,080 --> 06:15:32,560 but often we'll frame it as trying to minimize a cost function that 7492 06:15:32,560 --> 06:15:36,920 has some number of variables, x1, x2, x3, all the way up to xn, 7493 06:15:36,920 --> 06:15:38,840 just some number of variables that are involved, 7494 06:15:38,840 --> 06:15:41,320 things that I want to know the values to. 7495 06:15:41,320 --> 06:15:43,520 And this cost function might have coefficients 7496 06:15:43,520 --> 06:15:45,000 in front of those variables. 7497 06:15:45,000 --> 06:15:47,520 And this is what we would call a linear equation, 7498 06:15:47,520 --> 06:15:50,240 where we just have all of these variables that might be multiplied 7499 06:15:50,240 --> 06:15:52,040 by a coefficient and then add it together. 7500 06:15:52,040 --> 06:15:53,880 We're not going to square anything or cube anything, 7501 06:15:53,880 --> 06:15:56,040 because that'll give us different types of equations. 7502 06:15:56,040 --> 06:15:59,760 With linear programming, we're just dealing with linear equations 7503 06:15:59,760 --> 06:16:03,800 in addition to linear constraints, where a constraint is going 7504 06:16:03,800 --> 06:16:07,440 to look something like if we sum up this particular equation that 7505 06:16:07,440 --> 06:16:10,400 is just some linear combination of all of these variables, 7506 06:16:10,400 --> 06:16:13,280 it is less than or equal to some bound b. 7507 06:16:13,280 --> 06:16:16,400 And we might have a whole number of these various different constraints 7508 06:16:16,400 --> 06:16:21,280 that we might place onto our linear programming exercise. 7509 06:16:21,280 --> 06:16:24,400 And likewise, just as we can have constraints that are saying this linear 7510 06:16:24,400 --> 06:16:27,160 equation is less than or equal to some bound b, 7511 06:16:27,160 --> 06:16:28,760 it might also be equal to something. 7512 06:16:28,760 --> 06:16:31,400 That if you want some sum of some combination of variables 7513 06:16:31,400 --> 06:16:33,960 to be equal to a value, you can specify that. 7514 06:16:33,960 --> 06:16:37,840 And we can also maybe specify that each variable has lower and upper bounds, 7515 06:16:37,840 --> 06:16:39,960 that it needs to be a positive number, for example, 7516 06:16:39,960 --> 06:16:42,960 or it needs to be a number that is less than 50, for example. 7517 06:16:42,960 --> 06:16:44,800 And there are a number of other choices that we 7518 06:16:44,800 --> 06:16:47,920 can make there for defining what the bounds of a variable are. 7519 06:16:47,920 --> 06:16:50,200 But it turns out that if you can take a problem 7520 06:16:50,200 --> 06:16:54,560 and formulate it in these terms, formulate the problem as your goal 7521 06:16:54,560 --> 06:16:56,800 is to minimize a cost function, and you're 7522 06:16:56,800 --> 06:17:00,440 minimizing that cost function subject to particular constraints, 7523 06:17:00,440 --> 06:17:03,880 subjects to equations that are of the form like this of some sequence 7524 06:17:03,880 --> 06:17:07,840 of variables is less than a bound or is equal to some particular value, 7525 06:17:07,840 --> 06:17:10,320 then there are a number of algorithms that already 7526 06:17:10,320 --> 06:17:13,960 exist for solving these sorts of problems. 7527 06:17:13,960 --> 06:17:16,480 So let's go ahead and take a look at an example. 7528 06:17:16,480 --> 06:17:18,400 Here's an example of a problem that might come up 7529 06:17:18,400 --> 06:17:19,880 in the world of linear programming. 7530 06:17:19,880 --> 06:17:21,600 Often, this is going to come up when we're 7531 06:17:21,600 --> 06:17:23,320 trying to optimize for something. 7532 06:17:23,320 --> 06:17:25,360 And we want to be able to do some calculations, 7533 06:17:25,360 --> 06:17:27,880 and we have constraints on what we're trying to optimize. 7534 06:17:27,880 --> 06:17:29,760 And so it might be something like this. 7535 06:17:29,760 --> 06:17:34,040 In the context of a factory, we have two machines, x1 and x2. 7536 06:17:34,040 --> 06:17:36,480 x1 costs $50 an hour to run. 7537 06:17:36,480 --> 06:17:38,880 x2 costs $80 an hour to run. 7538 06:17:38,880 --> 06:17:41,960 And our goal, what we're trying to do, our objective, 7539 06:17:41,960 --> 06:17:45,040 is to minimize the total cost. 7540 06:17:45,040 --> 06:17:46,560 So that's what we'd like to do. 7541 06:17:46,560 --> 06:17:49,560 But we need to do so subject to certain constraints. 7542 06:17:49,560 --> 06:17:51,640 So there might be a labor constraint that x1 7543 06:17:51,640 --> 06:17:56,720 requires five units of labor per hour, x2 requires two units of labor per hour, 7544 06:17:56,720 --> 06:18:00,080 and we have a total of 20 units of labor that we have to spend. 7545 06:18:00,080 --> 06:18:01,040 So this is a constraint. 7546 06:18:01,040 --> 06:18:04,800 We have no more than 20 units of labor that we can spend, 7547 06:18:04,800 --> 06:18:08,120 and we have to spend it across x1 and x2, each of which 7548 06:18:08,120 --> 06:18:10,840 requires a different amount of labor. 7549 06:18:10,840 --> 06:18:13,120 And we might also have a constraint like this 7550 06:18:13,120 --> 06:18:16,760 that tells us x1 is going to produce 10 units of output per hour, 7551 06:18:16,760 --> 06:18:19,800 x2 is going to produce 12 units of output per hour, 7552 06:18:19,800 --> 06:18:22,640 and the company needs 90 units of output. 7553 06:18:22,640 --> 06:18:24,760 So we have some goal, something we need to achieve. 7554 06:18:24,760 --> 06:18:28,240 We need to achieve 90 units of output, but there are some constraints 7555 06:18:28,240 --> 06:18:31,060 that x1 can only produce 10 units of output per hour, 7556 06:18:31,060 --> 06:18:34,040 x2 produces 12 units of output per hour. 7557 06:18:34,040 --> 06:18:36,560 These types of problems come up quite frequently, 7558 06:18:36,560 --> 06:18:39,360 and you can start to notice patterns in these types of problems, 7559 06:18:39,360 --> 06:18:43,280 problems where I am trying to optimize for some goal, minimizing cost, 7560 06:18:43,280 --> 06:18:46,800 maximizing output, maximizing profits, or something like that. 7561 06:18:46,800 --> 06:18:50,040 And there are constraints that are placed on that process. 7562 06:18:50,040 --> 06:18:52,520 And so now we just need to formulate this problem 7563 06:18:52,520 --> 06:18:55,120 in terms of linear equations. 7564 06:18:55,120 --> 06:18:56,560 So let's start with this first point. 7565 06:18:56,560 --> 06:19:01,760 Two machines, x1 and x2, x costs $50 an hour, x2 costs $80 an hour. 7566 06:19:01,760 --> 06:19:05,360 Here we can come up with an objective function that might look like this. 7567 06:19:05,360 --> 06:19:07,280 This is our cost function, rather. 7568 06:19:07,280 --> 06:19:11,160 50 times x1 plus 80 times x2, where x1 is going 7569 06:19:11,160 --> 06:19:15,680 to be a variable representing how many hours do we run machine x1 for, 7570 06:19:15,680 --> 06:19:18,280 x2 is going to be a variable representing how many hours 7571 06:19:18,280 --> 06:19:20,280 are we running machine x2 for. 7572 06:19:20,280 --> 06:19:23,760 And what we're trying to minimize is this cost function, which 7573 06:19:23,760 --> 06:19:27,640 is just how much it costs to run each of these machines per hour summed up. 7574 06:19:27,640 --> 06:19:31,080 This is an example of a linear equation, just some combination 7575 06:19:31,080 --> 06:19:34,360 of these variables plus coefficients that are placed in front of them. 7576 06:19:34,360 --> 06:19:37,040 And I would like to minimize that total value. 7577 06:19:37,040 --> 06:19:40,360 But I need to do so subject to these constraints. 7578 06:19:40,360 --> 06:19:44,200 x1 requires 50 units of labor per hour, x2 requires 2, 7579 06:19:44,200 --> 06:19:46,800 and we have a total of 20 units of labor to spend. 7580 06:19:46,800 --> 06:19:50,200 And so that gives us a constraint of this form. 7581 06:19:50,200 --> 06:19:54,600 5 times x1 plus 2 times x2 is less than or equal to 20. 7582 06:19:54,600 --> 06:19:57,680 20 is the total number of units of labor we have to spend. 7583 06:19:57,680 --> 06:20:00,600 And that's spent across x1 and x2, each of which 7584 06:20:00,600 --> 06:20:05,120 requires a different number of units of labor per hour, for example. 7585 06:20:05,120 --> 06:20:07,360 And finally, we have this constraint here. 7586 06:20:07,360 --> 06:20:10,840 x1 produces 10 units of output per hour, x2 produces 12, 7587 06:20:10,840 --> 06:20:13,640 and we need 90 units of output. 7588 06:20:13,640 --> 06:20:15,920 And so this might look something like this. 7589 06:20:15,920 --> 06:20:20,080 That 10x1 plus 12x2, this is amount of output per hour, 7590 06:20:20,080 --> 06:20:21,920 it needs to be at least 90. 7591 06:20:21,920 --> 06:20:25,240 We can do better or great, but it needs to be at least 90. 7592 06:20:25,240 --> 06:20:27,320 And if you recall from my formulation before, 7593 06:20:27,320 --> 06:20:29,760 I said that generally speaking in linear programming, 7594 06:20:29,760 --> 06:20:33,520 we deal with equals constraints or less than or equal to constraints. 7595 06:20:33,520 --> 06:20:35,520 So we have a greater than or equal to sign here. 7596 06:20:35,520 --> 06:20:36,320 That's not a problem. 7597 06:20:36,320 --> 06:20:38,240 Whenever we have a greater than or equal to sign, 7598 06:20:38,240 --> 06:20:40,640 we can just multiply the equation by negative 1, 7599 06:20:40,640 --> 06:20:44,040 and that'll flip it around to a less than or equals negative 90, 7600 06:20:44,040 --> 06:20:47,440 for example, instead of a greater than or equal to 90. 7601 06:20:47,440 --> 06:20:49,440 And that's going to be an equivalent expression 7602 06:20:49,440 --> 06:20:51,920 that we can use to represent this problem. 7603 06:20:51,920 --> 06:20:55,840 So now that we have this cost function and these constraints 7604 06:20:55,840 --> 06:20:58,920 that it's subject to, it turns out there are a number of algorithms 7605 06:20:58,920 --> 06:21:02,080 that can be used in order to solve these types of problems. 7606 06:21:02,080 --> 06:21:05,400 And these problems go a little bit more into geometry and linear algebra 7607 06:21:05,400 --> 06:21:06,840 than we're really going to get into. 7608 06:21:06,840 --> 06:21:09,640 But the most popular of these types of algorithms 7609 06:21:09,640 --> 06:21:12,640 are simplex, which was one of the first algorithms discovered 7610 06:21:12,640 --> 06:21:14,720 for trying to solve linear programs. 7611 06:21:14,720 --> 06:21:17,680 And later on, a class of interior point algorithms 7612 06:21:17,680 --> 06:21:20,240 can be used to solve this type of problem as well. 7613 06:21:20,240 --> 06:21:23,120 The key is not to understand exactly how these algorithms work, 7614 06:21:23,120 --> 06:21:27,080 but to realize that these algorithms exist for efficiently finding solutions 7615 06:21:27,080 --> 06:21:30,400 any time we have a problem of this particular form. 7616 06:21:30,400 --> 06:21:39,760 And so we can take a look, for example, at the production directory here, 7617 06:21:39,760 --> 06:21:43,560 where here I have a file called production.py, where here I'm 7618 06:21:43,560 --> 06:21:47,920 using scipy, which was the library for a lot of science-related functions 7619 06:21:47,920 --> 06:21:49,000 within Python. 7620 06:21:49,000 --> 06:21:52,760 And I can go ahead and just run this optimization function 7621 06:21:52,760 --> 06:21:54,560 in order to run a linear program. 7622 06:21:54,560 --> 06:21:58,000 .linprog here is going to try and solve this linear program for me, 7623 06:21:58,000 --> 06:22:01,720 where I provide to this expression, to this function call, 7624 06:22:01,720 --> 06:22:03,520 all of the data about my linear program. 7625 06:22:03,520 --> 06:22:05,480 So it needs to be in a particular format, which 7626 06:22:05,480 --> 06:22:07,200 might be a little confusing at first. 7627 06:22:07,200 --> 06:22:11,000 But this first argument to scipy.optimize.linprogramming 7628 06:22:11,000 --> 06:22:15,040 is the cost function, which is in this case just an array or a list that 7629 06:22:15,040 --> 06:22:20,200 has 50 and 80, because my original cost function was 50 times x1 plus 80 7630 06:22:20,200 --> 06:22:21,280 times x2. 7631 06:22:21,280 --> 06:22:25,040 So I just tell Python, 50 and 80, those are the coefficients 7632 06:22:25,040 --> 06:22:27,960 that I am now trying to optimize for. 7633 06:22:27,960 --> 06:22:30,920 And then I provide all of the constraints. 7634 06:22:30,920 --> 06:22:33,560 So the constraints, and I wrote them up above in comments, 7635 06:22:33,560 --> 06:22:39,280 is the constraint 1 is 5x1 plus 2x2 is less than or equal to 20. 7636 06:22:39,280 --> 06:22:44,600 And constraint 2 is negative 10x1 plus negative 12x2 7637 06:22:44,600 --> 06:22:47,120 is less than or equal to negative 90. 7638 06:22:47,120 --> 06:22:51,440 And so scipy expects these constraints to be in a particular format. 7639 06:22:51,440 --> 06:22:54,680 It first expects me to provide all of the coefficients 7640 06:22:54,680 --> 06:22:58,440 for the upper bound equations, ub just for upper bound, 7641 06:22:58,440 --> 06:23:00,480 where the coefficients of the first equation 7642 06:23:00,480 --> 06:23:03,960 are 5 and 2, because we have 5x1 and 2x2. 7643 06:23:03,960 --> 06:23:06,120 And the coefficients for the second equation 7644 06:23:06,120 --> 06:23:12,560 are negative 10 and negative 12, because I have negative 10x1 plus negative 12x2. 7645 06:23:12,560 --> 06:23:14,880 And then here, we provide it as a separate argument, 7646 06:23:14,880 --> 06:23:17,520 just to keep things separate, what the actual bound is. 7647 06:23:17,520 --> 06:23:20,160 What is the upper bound for each of these constraints? 7648 06:23:20,160 --> 06:23:22,440 Well, for the first constraint, the upper bound is 20. 7649 06:23:22,440 --> 06:23:24,120 That was constraint number 1. 7650 06:23:24,120 --> 06:23:28,200 And then for constraint number 2, the upper bound is 90. 7651 06:23:28,200 --> 06:23:30,240 So a bit of a cryptic way of representing it. 7652 06:23:30,240 --> 06:23:33,680 It's not quite as simple as just writing the mathematical equations. 7653 06:23:33,680 --> 06:23:36,800 What really is being expected here are all of the coefficients 7654 06:23:36,800 --> 06:23:39,000 and all of the numbers that are in these equations 7655 06:23:39,000 --> 06:23:42,120 by first providing the coefficients for the cost function, 7656 06:23:42,120 --> 06:23:45,880 then providing all the coefficients for the inequality constraints, 7657 06:23:45,880 --> 06:23:50,560 and then providing all of the upper bounds for those inequality constraints. 7658 06:23:50,560 --> 06:23:52,880 And once all of that information is there, 7659 06:23:52,880 --> 06:23:57,080 then we can run any of these interior point algorithms or the simplex algorithm. 7660 06:23:57,080 --> 06:23:59,000 Even if you don't understand how it works, 7661 06:23:59,000 --> 06:24:02,640 you can just run the function and figure out what the result should be. 7662 06:24:02,640 --> 06:24:04,520 And here, I said if the result is a success, 7663 06:24:04,520 --> 06:24:06,520 we were able to solve this problem. 7664 06:24:06,520 --> 06:24:10,640 Go ahead and print out what the value of x1 and x2 should be. 7665 06:24:10,640 --> 06:24:13,440 Otherwise, go ahead and print out no solution. 7666 06:24:13,440 --> 06:24:19,760 And so if I run this program by running python production.py, 7667 06:24:19,760 --> 06:24:21,440 it takes a second to calculate. 7668 06:24:21,440 --> 06:24:24,520 But then we see here is what the optimal solution should be. 7669 06:24:24,520 --> 06:24:26,960 x1 should run for 1.5 hours. 7670 06:24:26,960 --> 06:24:30,080 x2 should run for 6.25 hours. 7671 06:24:30,080 --> 06:24:33,160 And we were able to do this by just formulating the problem 7672 06:24:33,160 --> 06:24:36,120 as a linear equation that we were trying to optimize, 7673 06:24:36,120 --> 06:24:38,200 some cost that we were trying to minimize, 7674 06:24:38,200 --> 06:24:40,440 and then some constraints that were placed on that. 7675 06:24:40,440 --> 06:24:43,600 And many, many problems fall into this category of problems 7676 06:24:43,600 --> 06:24:47,400 that you can solve if you can just figure out how to use equations 7677 06:24:47,400 --> 06:24:51,200 and use these constraints to represent that general idea. 7678 06:24:51,200 --> 06:24:53,400 And that's a theme that's going to come up a couple of times today, 7679 06:24:53,400 --> 06:24:55,600 where we want to be able to take some problem 7680 06:24:55,600 --> 06:24:57,640 and reduce it down to some problem we know 7681 06:24:57,640 --> 06:25:01,920 how to solve in order to begin to find a solution 7682 06:25:01,920 --> 06:25:04,400 and to use existing methods that we can use in order 7683 06:25:04,400 --> 06:25:08,120 to find a solution more effectively or more efficiently. 7684 06:25:08,120 --> 06:25:11,400 And it turns out that these types of problems, where we have constraints, 7685 06:25:11,400 --> 06:25:13,040 show up in other ways too. 7686 06:25:13,040 --> 06:25:16,320 And there's an entire class of problems that's more generally just known 7687 06:25:16,320 --> 06:25:18,880 as constraint satisfaction problems. 7688 06:25:18,880 --> 06:25:21,600 And we're going to now take a look at how you might formulate a constraint 7689 06:25:21,600 --> 06:25:24,640 satisfaction problem and how you might go about solving a constraint 7690 06:25:24,640 --> 06:25:26,000 satisfaction problem. 7691 06:25:26,000 --> 06:25:28,920 But the basic idea of a constraint satisfaction problem 7692 06:25:28,920 --> 06:25:32,400 is we have some number of variables that need to take on some values. 7693 06:25:32,400 --> 06:25:35,720 And we need to figure out what values each of those variables should take on. 7694 06:25:35,720 --> 06:25:39,240 But those variables are subject to particular constraints 7695 06:25:39,240 --> 06:25:43,440 that are going to limit what values those variables can actually take on. 7696 06:25:43,440 --> 06:25:46,560 So let's take a look at a real world example, for example. 7697 06:25:46,560 --> 06:25:48,520 Let's look at exam scheduling, that I have 7698 06:25:48,520 --> 06:25:51,440 four students here, students 1, 2, 3, and 4. 7699 06:25:51,440 --> 06:25:53,960 Each of them is taking some number of different classes. 7700 06:25:53,960 --> 06:25:56,440 Classes here are going to be represented by letters. 7701 06:25:56,440 --> 06:26:00,760 So student 1 is enrolled in courses A, B, and C. Student 2 7702 06:26:00,760 --> 06:26:04,480 is enrolled in courses B, D, and E, so on and so forth. 7703 06:26:04,480 --> 06:26:07,240 And now, say university, for example, is trying 7704 06:26:07,240 --> 06:26:10,080 to schedule exams for all of these courses. 7705 06:26:10,080 --> 06:26:13,960 But there are only three exam slots on Monday, Tuesday, and Wednesday. 7706 06:26:13,960 --> 06:26:17,240 And we have to schedule an exam for each of these courses. 7707 06:26:17,240 --> 06:26:19,280 But the constraint now, the constraint we 7708 06:26:19,280 --> 06:26:21,160 have to deal with with the scheduling, is 7709 06:26:21,160 --> 06:26:25,160 that we don't want anyone to have to take two exams on the same day. 7710 06:26:25,160 --> 06:26:29,560 We would like to try and minimize that or eliminate it if at all possible. 7711 06:26:29,560 --> 06:26:31,720 So how do we begin to represent this idea? 7712 06:26:31,720 --> 06:26:35,920 How do we structure this in a way that a computer with an AI algorithm 7713 06:26:35,920 --> 06:26:37,760 can begin to try and solve the problem? 7714 06:26:37,760 --> 06:26:41,240 Well, let's in particular just look at these classes that we might take 7715 06:26:41,240 --> 06:26:45,920 and represent each of the courses as some node inside of a graph. 7716 06:26:45,920 --> 06:26:49,560 And what we'll do is we'll create an edge between two nodes in this graph 7717 06:26:49,560 --> 06:26:54,360 if there is a constraint between those two nodes. 7718 06:26:54,360 --> 06:26:55,440 So what does this mean? 7719 06:26:55,440 --> 06:26:59,840 Well, we can start with student 1, who's enrolled in courses A, B, and C. 7720 06:26:59,840 --> 06:27:03,880 What that means is that A and B can't have an exam at the same time. 7721 06:27:03,880 --> 06:27:06,280 A and C can't have an exam at the same time. 7722 06:27:06,280 --> 06:27:09,160 And B and C also can't have an exam at the same time. 7723 06:27:09,160 --> 06:27:12,200 And I can represent that in this graph by just drawing edges. 7724 06:27:12,200 --> 06:27:15,200 One edge between A and B, one between B and C, 7725 06:27:15,200 --> 06:27:18,960 and then one between C and A. And that encodes now the idea 7726 06:27:18,960 --> 06:27:21,680 that between those nodes, there is a constraint. 7727 06:27:21,680 --> 06:27:23,800 And in particular, the constraint happens to be 7728 06:27:23,800 --> 06:27:25,760 that these two can't be equal to each other, 7729 06:27:25,760 --> 06:27:28,080 though there are other types of constraints that are possible, 7730 06:27:28,080 --> 06:27:31,240 depending on the type of problem that you're trying to solve. 7731 06:27:31,240 --> 06:27:34,000 And then we can do the same thing for each of the other students. 7732 06:27:34,000 --> 06:27:36,920 So for student 2, who's enrolled in courses B, D, and E, 7733 06:27:36,920 --> 06:27:39,080 well, that means B, D, and E, those all need 7734 06:27:39,080 --> 06:27:41,240 to have edges that connect each other as well. 7735 06:27:41,240 --> 06:27:44,520 Student 3 is enrolled in courses C, E, and F. So we'll go ahead 7736 06:27:44,520 --> 06:27:48,640 and take C, E, and F and connect those by drawing edges between them too. 7737 06:27:48,640 --> 06:27:52,240 And then finally, student 4 is enrolled in courses E, F, and G. 7738 06:27:52,240 --> 06:27:55,400 And we can represent that by drawing edges between E, F, and G, 7739 06:27:55,400 --> 06:27:57,440 although E and F already had an edge between them. 7740 06:27:57,440 --> 06:27:59,520 We don't need another one, because this constraint 7741 06:27:59,520 --> 06:28:03,400 is just encoding the idea that course E and course F cannot have 7742 06:28:03,400 --> 06:28:05,640 an exam on the same day. 7743 06:28:05,640 --> 06:28:09,360 So this then is what we might call the constraint graph. 7744 06:28:09,360 --> 06:28:13,040 There's some graphical representation of all of my variables, 7745 06:28:13,040 --> 06:28:16,960 so to speak, and the constraints between those possible variables. 7746 06:28:16,960 --> 06:28:19,800 Where in this particular case, each of the constraints 7747 06:28:19,800 --> 06:28:23,560 represents an inequality constraint, that an edge between B and D 7748 06:28:23,560 --> 06:28:27,040 means whatever value the variable B takes on cannot be the value 7749 06:28:27,040 --> 06:28:30,440 that the variable D takes on as well. 7750 06:28:30,440 --> 06:28:33,720 So what then actually is a constraint satisfaction problem? 7751 06:28:33,720 --> 06:28:38,000 Well, a constraint satisfaction problem is just some set of variables, x1 7752 06:28:38,000 --> 06:28:42,360 all the way through xn, some set of domains for each of those variables. 7753 06:28:42,360 --> 06:28:45,120 So every variable needs to take on some values. 7754 06:28:45,120 --> 06:28:47,040 Maybe every variable has the same domain, 7755 06:28:47,040 --> 06:28:49,640 but maybe each variable has a slightly different domain. 7756 06:28:49,640 --> 06:28:52,800 And then there's a set of constraints, and we'll just call a set C, 7757 06:28:52,800 --> 06:28:55,600 that is some constraints that are placed upon these variables, 7758 06:28:55,600 --> 06:28:58,000 like x1 is not equal to x2. 7759 06:28:58,000 --> 06:29:02,120 But there could be other forms too, like maybe x1 equals x2 plus 1 7760 06:29:02,120 --> 06:29:05,760 if these variables are taking on numerical values in their domain, 7761 06:29:05,760 --> 06:29:06,400 for example. 7762 06:29:06,400 --> 06:29:10,720 The types of constraints are going to vary based on the types of problems. 7763 06:29:10,720 --> 06:29:14,080 And constraint satisfaction shows up all over the place as well, 7764 06:29:14,080 --> 06:29:16,400 in any situation where we have variables that 7765 06:29:16,400 --> 06:29:19,200 are subject to particular constraints. 7766 06:29:19,200 --> 06:29:23,200 So one popular game is Sudoku, for example, this 9 by 9 grid 7767 06:29:23,200 --> 06:29:25,600 where you need to fill in numbers in each of these cells, 7768 06:29:25,600 --> 06:29:29,880 but you want to make sure there's never a duplicate number in any row, 7769 06:29:29,880 --> 06:29:34,240 or in any column, or in any grid of 3 by 3 cells, for example. 7770 06:29:34,240 --> 06:29:37,840 So what might this look like as a constraint satisfaction problem? 7771 06:29:37,840 --> 06:29:41,880 Well, my variables are all of the empty squares in the puzzle. 7772 06:29:41,880 --> 06:29:45,560 So represented here is just like an x comma y coordinate, for example, 7773 06:29:45,560 --> 06:29:48,080 as all of the squares where I need to plug in a value, 7774 06:29:48,080 --> 06:29:50,600 where I don't know what value it should take on. 7775 06:29:50,600 --> 06:29:54,760 The domain is just going to be all of the numbers from 1 through 9, 7776 06:29:54,760 --> 06:29:57,360 any value that I could fill in to one of these cells. 7777 06:29:57,360 --> 06:30:00,200 So that is going to be the domain for each of these variables. 7778 06:30:00,200 --> 06:30:02,800 And then the constraints are going to be of the form, 7779 06:30:02,800 --> 06:30:05,760 like this cell can't be equal to this cell, can't be equal to this cell, 7780 06:30:05,760 --> 06:30:08,360 can't be, and all of these need to be different, for example, 7781 06:30:08,360 --> 06:30:12,760 and same for all of the rows, and the columns, and the 3 by 3 squares as well. 7782 06:30:12,760 --> 06:30:17,920 So those constraints are going to enforce what values are actually allowed. 7783 06:30:17,920 --> 06:30:21,240 And we can formulate the same idea in the case of this exam scheduling 7784 06:30:21,240 --> 06:30:25,800 problem, where the variables we have are the different courses, a up through g. 7785 06:30:25,800 --> 06:30:29,560 The domain for each of these variables is going to be Monday, Tuesday, 7786 06:30:29,560 --> 06:30:30,120 and Wednesday. 7787 06:30:30,120 --> 06:30:33,560 Those are the possible values each of the variables can take on, 7788 06:30:33,560 --> 06:30:38,120 that in this case just represent when is the exam for that class. 7789 06:30:38,120 --> 06:30:41,600 And then the constraints are of this form, a is not equal to b, 7790 06:30:41,600 --> 06:30:45,920 a is not equal to c, meaning a and b can't have an exam on the same day, 7791 06:30:45,920 --> 06:30:48,240 a and c can't have an exam on the same day. 7792 06:30:48,240 --> 06:30:53,040 Or more formally, these two variables cannot take on the same value 7793 06:30:53,040 --> 06:30:56,080 within their domain. 7794 06:30:56,080 --> 06:31:00,040 So that then is this formulation of a constraint satisfaction problem 7795 06:31:00,040 --> 06:31:03,160 that we can begin to use to try and solve this problem. 7796 06:31:03,160 --> 06:31:05,400 And constraints can come in a number of different forms. 7797 06:31:05,400 --> 06:31:07,800 There are hard constraints, which are constraints 7798 06:31:07,800 --> 06:31:10,280 that must be satisfied for a correct solution. 7799 06:31:10,280 --> 06:31:14,240 So something like in the Sudoku puzzle, you cannot have this cell 7800 06:31:14,240 --> 06:31:17,360 and this cell that are in the same row take on the same value. 7801 06:31:17,360 --> 06:31:18,960 That is a hard constraint. 7802 06:31:18,960 --> 06:31:21,080 But problems can also have soft constraints, 7803 06:31:21,080 --> 06:31:24,040 where these are constraints that express some notion of preference, 7804 06:31:24,040 --> 06:31:27,840 that maybe a and b can't have an exam on the same day, 7805 06:31:27,840 --> 06:31:32,200 but maybe someone has a preference that a's exam is earlier than b's exam. 7806 06:31:32,200 --> 06:31:34,520 It doesn't need to be the case with some expression 7807 06:31:34,520 --> 06:31:37,560 that some solution is better than another solution. 7808 06:31:37,560 --> 06:31:39,680 And in that case, you might formulate the problem 7809 06:31:39,680 --> 06:31:43,000 as trying to optimize for maximizing people's preferences. 7810 06:31:43,000 --> 06:31:46,880 You want people's preferences to be satisfied as much as possible. 7811 06:31:46,880 --> 06:31:49,840 In this case, though, we'll mostly just deal with hard constraints, 7812 06:31:49,840 --> 06:31:54,280 constraints that must be met in order to have a correct solution to the problem. 7813 06:31:54,280 --> 06:31:57,600 So we want to figure out some assignment of these variables 7814 06:31:57,600 --> 06:32:00,080 to their particular values that is ultimately 7815 06:32:00,080 --> 06:32:02,360 going to give us a solution to the problem 7816 06:32:02,360 --> 06:32:05,760 by allowing us to assign some day to each of the classes 7817 06:32:05,760 --> 06:32:09,240 such that we don't have any conflicts between classes. 7818 06:32:09,240 --> 06:32:11,960 So it turns out that we can classify the constraints 7819 06:32:11,960 --> 06:32:16,200 in a constraint satisfaction problem into a number of different categories. 7820 06:32:16,200 --> 06:32:18,440 The first of those categories are perhaps the simplest 7821 06:32:18,440 --> 06:32:21,880 of the types of constraints, which are known as unary constraints, 7822 06:32:21,880 --> 06:32:26,200 where unary constraint is a constraint that just involves a single variable. 7823 06:32:26,200 --> 06:32:28,680 For example, a unary constraint might be something like, 7824 06:32:28,680 --> 06:32:33,360 a does not equal Monday, meaning Course A cannot have its exam on Monday. 7825 06:32:33,360 --> 06:32:35,320 If for some reason the instructor for the course 7826 06:32:35,320 --> 06:32:38,440 isn't available on Monday, you might have a constraint in your problem 7827 06:32:38,440 --> 06:32:41,640 that looks like this, something that just has a single variable a in it, 7828 06:32:41,640 --> 06:32:44,480 and maybe says a is not equal to Monday, or a is equal to something, 7829 06:32:44,480 --> 06:32:47,280 or in the case of numbers greater than or less than something, 7830 06:32:47,280 --> 06:32:51,920 a constraint that just has one variable, we consider to be a unary constraint. 7831 06:32:51,920 --> 06:32:55,280 And this is in contrast to something like a binary constraint, which 7832 06:32:55,280 --> 06:32:58,320 is a constraint that involves two variables, for example. 7833 06:32:58,320 --> 06:33:01,440 So this would be a constraint like the ones we were looking at before. 7834 06:33:01,440 --> 06:33:06,680 Something like a does not equal b is an example of a binary constraint, 7835 06:33:06,680 --> 06:33:10,560 because it is a constraint that has two variables involved in it, a and b. 7836 06:33:10,560 --> 06:33:14,880 And we represented that using some arc or some edge that 7837 06:33:14,880 --> 06:33:17,960 connects variable a to variable b. 7838 06:33:17,960 --> 06:33:20,440 And using this knowledge of, OK, what is a unary constraint? 7839 06:33:20,440 --> 06:33:21,880 What is a binary constraint? 7840 06:33:21,880 --> 06:33:23,640 There are different types of things we can 7841 06:33:23,640 --> 06:33:27,000 say about a particular constraint satisfaction problem. 7842 06:33:27,000 --> 06:33:31,600 And one thing we can say is we can try and make the problem node consistent. 7843 06:33:31,600 --> 06:33:33,360 So what does node consistency mean? 7844 06:33:33,360 --> 06:33:36,800 Node consistency means that we have all of the values 7845 06:33:36,800 --> 06:33:41,480 in a variable's domain satisfying that variable's unary constraints. 7846 06:33:41,480 --> 06:33:45,120 So for each of the variables inside of our constraint satisfaction problem, 7847 06:33:45,120 --> 06:33:48,840 if all of the values satisfy the unary constraints 7848 06:33:48,840 --> 06:33:53,040 for that particular variable, we can say that the entire problem is node 7849 06:33:53,040 --> 06:33:56,040 consistent, or we can even say that a particular variable is 7850 06:33:56,040 --> 06:34:00,680 node consistent if we just want to make one node consistent within itself. 7851 06:34:00,680 --> 06:34:02,320 So what does that actually look like? 7852 06:34:02,320 --> 06:34:04,480 Let's look at now a simplified example, where 7853 06:34:04,480 --> 06:34:06,520 instead of having a whole bunch of different classes, 7854 06:34:06,520 --> 06:34:09,640 we just have two classes, a and b, each of which 7855 06:34:09,640 --> 06:34:12,360 has an exam on either Monday or Tuesday or Wednesday. 7856 06:34:12,360 --> 06:34:14,640 So this is the domain for the variable a, 7857 06:34:14,640 --> 06:34:17,160 and this is the domain for the variable b. 7858 06:34:17,160 --> 06:34:21,120 And now let's imagine we have these constraints, a not equal to Monday, 7859 06:34:21,120 --> 06:34:24,920 b not equal to Tuesday, b not equal to Monday, a not equal to b. 7860 06:34:24,920 --> 06:34:28,600 So those are the constraints that we have on this particular problem. 7861 06:34:28,600 --> 06:34:32,560 And what we can now try to do is enforce node consistency. 7862 06:34:32,560 --> 06:34:35,480 And node consistency just means we make sure 7863 06:34:35,480 --> 06:34:41,280 that all of the values for any variable's domain satisfy its unary constraints. 7864 06:34:41,280 --> 06:34:45,760 And so we could start by trying to make node a node consistent. 7865 06:34:45,760 --> 06:34:46,560 Is it consistent? 7866 06:34:46,560 --> 06:34:51,120 Does every value inside of a's domain satisfy its unary constraints? 7867 06:34:51,120 --> 06:34:55,800 Well, initially, we'll see that Monday does not satisfy a's unary constraints, 7868 06:34:55,800 --> 06:34:58,520 because we have a constraint, a unary constraint here, 7869 06:34:58,520 --> 06:35:00,640 that a is not equal to Monday. 7870 06:35:00,640 --> 06:35:03,240 But Monday is still in a's domain. 7871 06:35:03,240 --> 06:35:06,120 And so this is something that is not node consistent, 7872 06:35:06,120 --> 06:35:07,640 because we have Monday in the domain. 7873 06:35:07,640 --> 06:35:11,160 But this is not a valid value for this particular node. 7874 06:35:11,160 --> 06:35:13,400 And so how do we make this node consistent? 7875 06:35:13,400 --> 06:35:15,520 Well, to make the node consistent, what we'll do 7876 06:35:15,520 --> 06:35:18,840 is we'll just go ahead and remove Monday from a's domain. 7877 06:35:18,840 --> 06:35:21,240 Now a can only be on Tuesday or Wednesday, 7878 06:35:21,240 --> 06:35:25,400 because we had this constraint that said a is not equal to Monday. 7879 06:35:25,400 --> 06:35:28,520 And at this point now, a is node consistent. 7880 06:35:28,520 --> 06:35:31,680 For each of the values that a can take on, Tuesday and Wednesday, 7881 06:35:31,680 --> 06:35:36,640 there is no constraint that is a unary constraint that conflicts with that idea. 7882 06:35:36,640 --> 06:35:39,120 There is no constraint that says that a can't be Tuesday. 7883 06:35:39,120 --> 06:35:43,000 There is no unary constraint that says that a cannot be on Wednesday. 7884 06:35:43,000 --> 06:35:44,800 And so now we can turn our attention to b. 7885 06:35:44,800 --> 06:35:47,520 b also has a domain, Monday, Tuesday, and Wednesday. 7886 06:35:47,520 --> 06:35:51,440 And we can begin to see whether those variables satisfy 7887 06:35:51,440 --> 06:35:53,120 the unary constraints as well. 7888 06:35:53,120 --> 06:35:56,600 Well, here is a unary constraint, b is not equal to Tuesday. 7889 06:35:56,600 --> 06:35:59,800 And that does not appear to be satisfied by this domain of Monday, Tuesday, 7890 06:35:59,800 --> 06:36:03,160 and Wednesday, because Tuesday, this possible value 7891 06:36:03,160 --> 06:36:07,680 that the variable b could take on is not consistent with this unary constraint, 7892 06:36:07,680 --> 06:36:09,520 that b is not equal to Tuesday. 7893 06:36:09,520 --> 06:36:13,560 So to solve that problem, we'll go ahead and remove Tuesday from b's domain. 7894 06:36:13,560 --> 06:36:16,320 Now b's domain only contains Monday and Wednesday. 7895 06:36:16,320 --> 06:36:18,920 But as it turns out, there's yet another unary constraint 7896 06:36:18,920 --> 06:36:21,600 that we placed on the variable b, which is here. 7897 06:36:21,600 --> 06:36:23,840 b is not equal to Monday. 7898 06:36:23,840 --> 06:36:27,280 And that means that this value, Monday, inside of b's domain, 7899 06:36:27,280 --> 06:36:30,040 is not consistent with b's unary constraints, 7900 06:36:30,040 --> 06:36:33,120 because we have a constraint that says the b cannot be Monday. 7901 06:36:33,120 --> 06:36:35,400 And so we can remove Monday from b's domain. 7902 06:36:35,400 --> 06:36:38,600 And now we've made it through all of the unary constraints. 7903 06:36:38,600 --> 06:36:41,920 We've not yet considered this constraint, which is a binary constraint. 7904 06:36:41,920 --> 06:36:44,080 But we've considered all of the unary constraints, 7905 06:36:44,080 --> 06:36:47,360 all of the constraints that involve just a single variable. 7906 06:36:47,360 --> 06:36:51,960 And we've made sure that every node is consistent with those unary constraints. 7907 06:36:51,960 --> 06:36:55,640 So we can say that now we have enforced node consistency, 7908 06:36:55,640 --> 06:36:59,280 that for each of these possible nodes, we can pick any of these values 7909 06:36:59,280 --> 06:37:00,160 in the domain. 7910 06:37:00,160 --> 06:37:05,560 And there won't be a unary constraint that is violated as a result of it. 7911 06:37:05,560 --> 06:37:07,760 So node consistency is fairly easy to enforce. 7912 06:37:07,760 --> 06:37:10,540 We just take each node, make sure the values in the domain 7913 06:37:10,540 --> 06:37:12,400 satisfy the unary constraints. 7914 06:37:12,400 --> 06:37:14,560 Where things get a little bit more interesting 7915 06:37:14,560 --> 06:37:17,480 is when we consider different types of consistency, 7916 06:37:17,480 --> 06:37:20,760 something like arc consistency, for example. 7917 06:37:20,760 --> 06:37:25,320 And arc consistency refers to when all of the values in a variable's domain 7918 06:37:25,320 --> 06:37:28,400 satisfy the variable's binary constraints. 7919 06:37:28,400 --> 06:37:31,640 So when we're looking at trying to make a arc consistent, 7920 06:37:31,640 --> 06:37:35,080 we're no longer just considering the unary constraints that involve a. 7921 06:37:35,080 --> 06:37:38,280 We're trying to consider all of the binary constraints 7922 06:37:38,280 --> 06:37:39,880 that involve a as well. 7923 06:37:39,880 --> 06:37:43,280 So any edge that connects a to another variable 7924 06:37:43,280 --> 06:37:47,560 inside of that constraint graph that we were taking a look at before. 7925 06:37:47,560 --> 06:37:50,360 Put a little bit more formally, arc consistency. 7926 06:37:50,360 --> 06:37:52,600 And arc really is just another word for an edge 7927 06:37:52,600 --> 06:37:55,840 that connects two of these nodes inside of our constraint graph. 7928 06:37:55,840 --> 06:37:59,040 We can define arc consistency a little more precisely like this. 7929 06:37:59,040 --> 06:38:03,520 In order to make some variable x arc consistent with respect 7930 06:38:03,520 --> 06:38:09,840 to some other variable y, we need to remove any element from x's domain 7931 06:38:09,840 --> 06:38:14,040 to make sure that every choice for x, every choice in x's domain, 7932 06:38:14,040 --> 06:38:17,440 has a possible choice for y. 7933 06:38:17,440 --> 06:38:19,200 So put another way, if I have a variable x 7934 06:38:19,200 --> 06:38:21,920 and I want to make x an arc consistent, then 7935 06:38:21,920 --> 06:38:25,560 I'm going to look at all of the possible values that x can take on 7936 06:38:25,560 --> 06:38:28,200 and make sure that for all of those possible values, 7937 06:38:28,200 --> 06:38:31,440 there is still some choice that I can make for y, 7938 06:38:31,440 --> 06:38:34,800 if there's some arc between x and y, to make sure 7939 06:38:34,800 --> 06:38:39,320 that y has a possible option that I can choose as well. 7940 06:38:39,320 --> 06:38:42,800 So let's look at an example of that going back to this example from before. 7941 06:38:42,800 --> 06:38:45,720 We enforced node consistency already by saying 7942 06:38:45,720 --> 06:38:47,640 that a can only be on Tuesday or Wednesday 7943 06:38:47,640 --> 06:38:49,640 because we knew that a could not be on Monday. 7944 06:38:49,640 --> 06:38:51,960 And we also said that b's only domain only 7945 06:38:51,960 --> 06:38:55,480 consists of Wednesday because we know that b does not equal Tuesday 7946 06:38:55,480 --> 06:38:58,400 and also b does not equal Monday. 7947 06:38:58,400 --> 06:39:01,440 So now let's begin to consider arc consistency. 7948 06:39:01,440 --> 06:39:05,000 Let's try and make a arc consistent with b. 7949 06:39:05,000 --> 06:39:08,560 And what that means is to make a arc consistent with respect to b 7950 06:39:08,560 --> 06:39:11,880 means that for any choice we make in a's domain, 7951 06:39:11,880 --> 06:39:16,520 there is some choice we can make in b's domain that is going to be consistent. 7952 06:39:16,520 --> 06:39:17,400 And we can try that. 7953 06:39:17,400 --> 06:39:20,680 For a, we can choose Tuesday as a possible value for a. 7954 06:39:20,680 --> 06:39:23,440 If I choose Tuesday for a, is there a value 7955 06:39:23,440 --> 06:39:26,360 for b that satisfies the binary constraint? 7956 06:39:26,360 --> 06:39:29,360 Well, yes, b Wednesday would satisfy this constraint 7957 06:39:29,360 --> 06:39:33,600 that a does not equal b because Tuesday does not equal Wednesday. 7958 06:39:33,600 --> 06:39:37,880 However, if we chose Wednesday for a, well, then 7959 06:39:37,880 --> 06:39:42,640 there is no choice in b's domain that satisfies this binary constraint. 7960 06:39:42,640 --> 06:39:47,320 There is no way I can choose something for b that satisfies a does not equal b 7961 06:39:47,320 --> 06:39:49,800 because I know b must be Wednesday. 7962 06:39:49,800 --> 06:39:52,080 And so if ever I run into a situation like this 7963 06:39:52,080 --> 06:39:55,480 where I see that here is a possible value for a such 7964 06:39:55,480 --> 06:39:59,600 that there is no choice of value for b that satisfies the binary constraint, 7965 06:39:59,600 --> 06:40:02,240 well, then this is not arc consistent. 7966 06:40:02,240 --> 06:40:05,560 And to make it arc consistent, I would need to take Wednesday 7967 06:40:05,560 --> 06:40:07,640 and remove it from a's domain. 7968 06:40:07,640 --> 06:40:11,240 Because Wednesday was not going to be a possible choice I can make for a 7969 06:40:11,240 --> 06:40:14,600 because it wasn't consistent with this binary constraint for b. 7970 06:40:14,600 --> 06:40:17,360 There was no way I could choose Wednesday for a 7971 06:40:17,360 --> 06:40:22,680 and still have an available solution by choosing something for b as well. 7972 06:40:22,680 --> 06:40:25,920 So here now, I've been able to enforce arc consistency. 7973 06:40:25,920 --> 06:40:28,320 And in doing so, I've actually solved this entire problem, 7974 06:40:28,320 --> 06:40:32,520 that given these constraints where a and b can have exams on either Monday 7975 06:40:32,520 --> 06:40:35,880 or Tuesday or Wednesday, the only solution, as it would appear, 7976 06:40:35,880 --> 06:40:40,400 is that a's exam must be on Tuesday and b's exam must be on Wednesday. 7977 06:40:40,400 --> 06:40:43,800 And that is the only option available to me. 7978 06:40:43,800 --> 06:40:46,720 So if we want to apply our consistency to a larger graph, 7979 06:40:46,720 --> 06:40:49,600 not just looking at one particular pair of our consistency, 7980 06:40:49,600 --> 06:40:51,040 there are ways we can do that too. 7981 06:40:51,040 --> 06:40:53,880 And we can begin to formalize what the pseudocode would look like 7982 06:40:53,880 --> 06:40:57,400 for trying to write an algorithm that enforces arc consistency. 7983 06:40:57,400 --> 06:41:01,000 And we'll start by defining a function called revise. 7984 06:41:01,000 --> 06:41:03,800 Revise is going to take as input a CSP, otherwise 7985 06:41:03,800 --> 06:41:06,320 known as a constraint satisfaction problem, 7986 06:41:06,320 --> 06:41:08,960 and also two variables, x and y. 7987 06:41:08,960 --> 06:41:11,160 And what revise is going to do is it is going 7988 06:41:11,160 --> 06:41:15,240 to make x arc consistent with respect to y, 7989 06:41:15,240 --> 06:41:18,120 meaning remove anything from x's domain that 7990 06:41:18,120 --> 06:41:21,720 doesn't allow for a possible option for y. 7991 06:41:21,720 --> 06:41:22,800 How does this work? 7992 06:41:22,800 --> 06:41:25,120 Well, we'll go ahead and first keep track of whether or not 7993 06:41:25,120 --> 06:41:26,040 we've made a revision. 7994 06:41:26,040 --> 06:41:29,240 Revise is ultimately going to return true or false. 7995 06:41:29,240 --> 06:41:33,560 It'll return true in the event that we did make a revision to x's domain. 7996 06:41:33,560 --> 06:41:37,000 It'll return false if we didn't make any change to x's domain. 7997 06:41:37,000 --> 06:41:39,880 And we'll see in a moment why that's going to be helpful. 7998 06:41:39,880 --> 06:41:41,720 But we start by saying revised equals false. 7999 06:41:41,720 --> 06:41:43,920 We haven't made any changes. 8000 06:41:43,920 --> 06:41:46,560 Then we'll say, all right, let's go ahead and loop over all 8001 06:41:46,560 --> 06:41:49,040 of the possible values in x's domain. 8002 06:41:49,040 --> 06:41:53,200 So loop over x's domain for each little x in x's domain. 8003 06:41:53,200 --> 06:41:55,520 I want to make sure that for each of those choices, 8004 06:41:55,520 --> 06:42:00,040 I have some available choice in y that satisfies the binary constraints that 8005 06:42:00,040 --> 06:42:03,480 are defined inside of my CSP, inside of my constraint 8006 06:42:03,480 --> 06:42:05,040 satisfaction problem. 8007 06:42:05,040 --> 06:42:11,200 So if ever it's the case that there is no value y in y's domain that 8008 06:42:11,200 --> 06:42:15,760 satisfies the constraint for x and y, well, if that's the case, 8009 06:42:15,760 --> 06:42:19,840 that means that this value x shouldn't be in x's domain. 8010 06:42:19,840 --> 06:42:22,440 So we'll go ahead and delete x from x's domain. 8011 06:42:22,440 --> 06:42:26,000 And I'll set revised equal to true because I did change x's domain. 8012 06:42:26,000 --> 06:42:29,320 I changed x's domain by removing little x. 8013 06:42:29,320 --> 06:42:33,040 And I removed little x because it wasn't art consistent. 8014 06:42:33,040 --> 06:42:35,720 There was no way I could choose a value for y 8015 06:42:35,720 --> 06:42:38,960 that would satisfy this xy constraint. 8016 06:42:38,960 --> 06:42:41,680 So in this case, we'll go ahead and set revised equal true. 8017 06:42:41,680 --> 06:42:44,800 And we'll do this again and again for every value in x's domain. 8018 06:42:44,800 --> 06:42:46,400 Sometimes it might be fine. 8019 06:42:46,400 --> 06:42:49,880 In other cases, it might not allow for a possible choice for y, 8020 06:42:49,880 --> 06:42:53,240 in which case we need to remove this value from x's domain. 8021 06:42:53,240 --> 06:42:56,920 And at the end, we just return revised to indicate whether or not 8022 06:42:56,920 --> 06:42:59,000 we actually made a change. 8023 06:42:59,000 --> 06:43:01,000 So this function, then, this revised function 8024 06:43:01,000 --> 06:43:04,760 is effectively an implementation of what you saw me do graphically a moment ago. 8025 06:43:04,760 --> 06:43:09,200 And it makes one variable, x, arc consistent with another variable, 8026 06:43:09,200 --> 06:43:10,960 in this case, y. 8027 06:43:10,960 --> 06:43:14,160 But generally speaking, when we want to enforce our consistency, 8028 06:43:14,160 --> 06:43:17,760 we'll often want to enforce our consistency not just for a single arc, 8029 06:43:17,760 --> 06:43:20,360 but for the entire constraint satisfaction problem. 8030 06:43:20,360 --> 06:43:22,880 And it turns out there's an algorithm to do that as well. 8031 06:43:22,880 --> 06:43:25,200 And that algorithm is known as AC3. 8032 06:43:25,200 --> 06:43:27,920 AC3 takes a constraint satisfaction problem. 8033 06:43:27,920 --> 06:43:32,160 And it enforces our consistency across the entire problem. 8034 06:43:32,160 --> 06:43:33,120 How does it do that? 8035 06:43:33,120 --> 06:43:36,600 Well, it's going to basically maintain a queue or basically just a line 8036 06:43:36,600 --> 06:43:39,800 of all of the arcs that it needs to make consistent. 8037 06:43:39,800 --> 06:43:42,360 And over time, we might remove things from that queue 8038 06:43:42,360 --> 06:43:44,560 as we begin dealing with our consistency. 8039 06:43:44,560 --> 06:43:47,000 And we might need to add things to that queue as well 8040 06:43:47,000 --> 06:43:50,680 if there are more things we need to make arc consistent. 8041 06:43:50,680 --> 06:43:52,560 So we'll go ahead and start with a queue that 8042 06:43:52,560 --> 06:43:56,480 contains all of the arcs in the constraint satisfaction problem, 8043 06:43:56,480 --> 06:43:58,840 all of the edges that connect two nodes that 8044 06:43:58,840 --> 06:44:02,200 have some sort of binary constraint between them. 8045 06:44:02,200 --> 06:44:06,320 And now, as long as the queue is non-empty, there is work to be done. 8046 06:44:06,320 --> 06:44:10,040 The queue is all of the things that we need to make arc consistent. 8047 06:44:10,040 --> 06:44:13,600 So as long as the queue is non-empty, there's still things we have to do. 8048 06:44:13,600 --> 06:44:15,200 What do we have to do? 8049 06:44:15,200 --> 06:44:17,960 Well, we'll start by de-queuing from the queue, 8050 06:44:17,960 --> 06:44:19,640 remove something from the queue. 8051 06:44:19,640 --> 06:44:21,400 And strictly speaking, it doesn't need to be a queue, 8052 06:44:21,400 --> 06:44:23,440 but a queue is a traditional way of doing this. 8053 06:44:23,440 --> 06:44:27,480 We'll de-queue from the queue, and that'll give us an arc, x and y, 8054 06:44:27,480 --> 06:44:32,920 these two variables where I would like to make x arc consistent with y. 8055 06:44:32,920 --> 06:44:35,840 So how do we make x arc consistent with y? 8056 06:44:35,840 --> 06:44:38,200 Well, we can go ahead and just use that revise function 8057 06:44:38,200 --> 06:44:39,640 that we talked about a moment ago. 8058 06:44:39,640 --> 06:44:43,560 We called the revise function, passing as input the constraint satisfaction 8059 06:44:43,560 --> 06:44:46,320 problem, and also these variables x and y, 8060 06:44:46,320 --> 06:44:49,240 because I want to make x arc consistent with y. 8061 06:44:49,240 --> 06:44:52,440 In other words, remove any values from x's domain 8062 06:44:52,440 --> 06:44:55,880 that don't leave an available option for y. 8063 06:44:55,880 --> 06:44:57,920 And recall, what does revised return? 8064 06:44:57,920 --> 06:45:00,840 Well, it returns true if we actually made a change, 8065 06:45:00,840 --> 06:45:04,000 if we removed something from x's domain, because there 8066 06:45:04,000 --> 06:45:06,680 wasn't an available option for y, for example. 8067 06:45:06,680 --> 06:45:10,760 And it returns false if we didn't make any change to x's domain at all. 8068 06:45:10,760 --> 06:45:14,360 And it turns out if revised returns false, if we didn't make any changes, 8069 06:45:14,360 --> 06:45:15,800 well, then there's not a whole lot more work 8070 06:45:15,800 --> 06:45:17,080 to be done here for this arc. 8071 06:45:17,080 --> 06:45:20,600 We can just move ahead to the next arc that's in the queue. 8072 06:45:20,600 --> 06:45:24,120 But if we did make a change, if we did reduce x's domain 8073 06:45:24,120 --> 06:45:28,680 by removing values from x's domain, well, then what we might realize 8074 06:45:28,680 --> 06:45:31,160 is that this creates potential problems later on, 8075 06:45:31,160 --> 06:45:35,800 that it might mean that some arc that was arc consistent with x, 8076 06:45:35,800 --> 06:45:38,680 that node might no longer be arc consistent with x, 8077 06:45:38,680 --> 06:45:41,640 because while there used to be an option that we could choose for x, 8078 06:45:41,640 --> 06:45:44,560 now there might not be, because now we might have removed something 8079 06:45:44,560 --> 06:45:49,240 from x that was necessary for some other arc to be arc consistent. 8080 06:45:49,240 --> 06:45:52,040 And so if ever we did revise x's domain, 8081 06:45:52,040 --> 06:45:55,800 we're going to need to add some things to the queue, some additional arcs 8082 06:45:55,800 --> 06:45:57,320 that we might want to check. 8083 06:45:57,320 --> 06:45:58,640 How do we do that? 8084 06:45:58,640 --> 06:46:02,960 Well, first thing we want to check is to make sure that x's domain is not 0. 8085 06:46:02,960 --> 06:46:07,160 If x's domain is 0, that means there are no available options for x at all. 8086 06:46:07,160 --> 06:46:10,360 And that means that there's no way you can solve the constraint satisfaction 8087 06:46:10,360 --> 06:46:10,860 problem. 8088 06:46:10,860 --> 06:46:13,240 If we've removed everything from x's domain, 8089 06:46:13,240 --> 06:46:15,600 we'll go ahead and just return false here to indicate there's 8090 06:46:15,600 --> 06:46:19,640 no way to solve the problem, because there's nothing left in x's domain. 8091 06:46:19,640 --> 06:46:23,640 But otherwise, if there are things left in x's domain, 8092 06:46:23,640 --> 06:46:26,920 but fewer things than before, well, then what we'll do 8093 06:46:26,920 --> 06:46:31,800 is we'll loop over each variable z that is in all of x's neighbors, 8094 06:46:31,800 --> 06:46:33,680 except for y, y we already handled. 8095 06:46:33,680 --> 06:46:37,120 But we'll consider all of x's other's neighbors and ask ourselves, 8096 06:46:37,120 --> 06:46:41,040 all right, will that arc from each of those z's to x, 8097 06:46:41,040 --> 06:46:43,400 that arc might no longer be arc consistent, 8098 06:46:43,400 --> 06:46:46,840 because while for each z, there might have been a possible option 8099 06:46:46,840 --> 06:46:50,400 we could choose for x to correspond with each of z's possible values, 8100 06:46:50,400 --> 06:46:54,680 now there might not be, because we removed some elements from x's domain. 8101 06:46:54,680 --> 06:46:57,400 And so what we'll do here is we'll go ahead and enqueue, 8102 06:46:57,400 --> 06:47:02,800 adding something to the queue, this arc zx for all of those neighbors z. 8103 06:47:02,800 --> 06:47:05,320 So we need to add back some arcs to the queue 8104 06:47:05,320 --> 06:47:08,960 in order to continue to enforce arc consistency. 8105 06:47:08,960 --> 06:47:11,400 At the very end, if we make it through all this process, 8106 06:47:11,400 --> 06:47:13,760 then we can return true. 8107 06:47:13,760 --> 06:47:18,360 But this now is AC3, this algorithm for enforcing arc consistency 8108 06:47:18,360 --> 06:47:20,200 on a constraint satisfaction problem. 8109 06:47:20,200 --> 06:47:23,360 And the big idea is really just keep track of all of the arcs 8110 06:47:23,360 --> 06:47:25,600 that we might need to make arc consistent, 8111 06:47:25,600 --> 06:47:28,560 make it arc consistent by calling the revise function. 8112 06:47:28,560 --> 06:47:31,400 And if we did revise it, then there are some new arcs 8113 06:47:31,400 --> 06:47:33,600 that might need to be added to the queue in order 8114 06:47:33,600 --> 06:47:36,840 to make sure that everything is still arc consistent, even 8115 06:47:36,840 --> 06:47:40,680 after we've removed some of the elements from a particular variable's 8116 06:47:40,680 --> 06:47:42,000 domain. 8117 06:47:42,000 --> 06:47:46,080 So what then would happen if we tried to enforce arc consistency 8118 06:47:46,080 --> 06:47:48,680 on a graph like this, on a graph where each of these variables 8119 06:47:48,680 --> 06:47:51,680 has a domain of Monday, Tuesday, and Wednesday? 8120 06:47:51,680 --> 06:47:55,400 Well, it turns out that by enforcing arc consistency on this graph, 8121 06:47:55,400 --> 06:47:57,520 well, it can solve some types of problems. 8122 06:47:57,520 --> 06:47:59,680 Nothing actually changes here. 8123 06:47:59,680 --> 06:48:03,200 For any particular arc, just considering two variables, 8124 06:48:03,200 --> 06:48:05,960 there's always a way for me to just, for any of the choices 8125 06:48:05,960 --> 06:48:08,840 I make for one of them, make a choice for the other one, 8126 06:48:08,840 --> 06:48:11,440 because there are three options, and I just need the two 8127 06:48:11,440 --> 06:48:12,720 to be different from each other. 8128 06:48:12,720 --> 06:48:15,040 So this is actually quite easy to just take an arc 8129 06:48:15,040 --> 06:48:17,160 and just declare that it is arc consistent, 8130 06:48:17,160 --> 06:48:19,920 because if I pick Monday for D, then I just 8131 06:48:19,920 --> 06:48:23,640 pick something that isn't Monday for B. In arc consistency, 8132 06:48:23,640 --> 06:48:28,680 we only consider consistency between a binary constraint between two nodes, 8133 06:48:28,680 --> 06:48:32,600 and we're not really considering all of the rest of the nodes yet. 8134 06:48:32,600 --> 06:48:36,600 So just using AC3, the enforcement of arc consistency, 8135 06:48:36,600 --> 06:48:39,240 that can sometimes have the effect of reducing domains 8136 06:48:39,240 --> 06:48:42,880 to make it easier to find solutions, but it will not always actually 8137 06:48:42,880 --> 06:48:44,200 solve the problem. 8138 06:48:44,200 --> 06:48:48,360 We might still need to somehow search to try and find a solution. 8139 06:48:48,360 --> 06:48:52,000 And we can use classical traditional search algorithms to try to do so. 8140 06:48:52,000 --> 06:48:55,280 You'll recall that a search problem generally consists of these parts. 8141 06:48:55,280 --> 06:48:59,000 We have some initial state, some actions, a transition model 8142 06:48:59,000 --> 06:49:01,280 that takes me from one state to another state, 8143 06:49:01,280 --> 06:49:05,640 a goal test to tell me have I satisfied my objective correctly, 8144 06:49:05,640 --> 06:49:09,200 and then some path cost function, because in the case of like maze solving, 8145 06:49:09,200 --> 06:49:12,240 I was trying to get to my goal as quickly as possible. 8146 06:49:12,240 --> 06:49:16,840 So you could formulate a CSP, or a constraint satisfaction problem, 8147 06:49:16,840 --> 06:49:18,800 as one of these types of search problems. 8148 06:49:18,800 --> 06:49:22,240 The initial state will just be an empty assignment, 8149 06:49:22,240 --> 06:49:26,120 where an assignment is just a way for me to assign any particular variable 8150 06:49:26,120 --> 06:49:27,760 to any particular value. 8151 06:49:27,760 --> 06:49:30,960 So if an empty assignment is no variables that are assigned to any values 8152 06:49:30,960 --> 06:49:37,240 yet, then the action I can take is adding some new variable equals value 8153 06:49:37,240 --> 06:49:40,000 pair to that assignment, saying for this assignment, 8154 06:49:40,000 --> 06:49:43,040 let me add a new value for this variable. 8155 06:49:43,040 --> 06:49:46,360 And the transition model just defines what happens when you take that action. 8156 06:49:46,360 --> 06:49:50,200 You get a new assignment that has that variable equal to that value inside 8157 06:49:50,200 --> 06:49:51,080 of it. 8158 06:49:51,080 --> 06:49:54,840 The goal test is just checking to make sure all the variables have been assigned 8159 06:49:54,840 --> 06:49:57,720 and making sure all the constraints have been satisfied. 8160 06:49:57,720 --> 06:50:00,680 And the path cost function is sort of irrelevant. 8161 06:50:00,680 --> 06:50:02,840 I don't really care about what the path really is. 8162 06:50:02,840 --> 06:50:06,280 I just care about finding some assignment that actually satisfies 8163 06:50:06,280 --> 06:50:07,640 all of the constraints. 8164 06:50:07,640 --> 06:50:09,640 So really, all the paths have the same cost. 8165 06:50:09,640 --> 06:50:12,240 I don't really care about the path to the goal. 8166 06:50:12,240 --> 06:50:17,280 I just care about the solution itself, much as we've talked about now before. 8167 06:50:17,280 --> 06:50:20,440 The problem here, though, is that if we just implement this naive search 8168 06:50:20,440 --> 06:50:23,280 algorithm just by implementing like breadth-first search or depth-first 8169 06:50:23,280 --> 06:50:25,920 search, this is going to be very, very inefficient. 8170 06:50:25,920 --> 06:50:28,600 And there are ways we can take advantage of efficiencies 8171 06:50:28,600 --> 06:50:31,960 in the structure of a constraint satisfaction problem itself. 8172 06:50:31,960 --> 06:50:37,200 And one of the key ideas is that we can really just order these variables. 8173 06:50:37,200 --> 06:50:39,840 And it doesn't matter what order we assign variables in. 8174 06:50:39,840 --> 06:50:43,480 The assignment a equals 2 and then b equals 8 8175 06:50:43,480 --> 06:50:47,480 is identical to the assignment of b equals 8 and then a equals 2. 8176 06:50:47,480 --> 06:50:50,240 Switching the order doesn't really change anything 8177 06:50:50,240 --> 06:50:53,360 about the fundamental nature of that assignment. 8178 06:50:53,360 --> 06:50:56,240 And so there are some ways that we can try and revise 8179 06:50:56,240 --> 06:50:59,400 this idea of a search algorithm to apply it specifically 8180 06:50:59,400 --> 06:51:02,000 for a problem like a constraint satisfaction problem. 8181 06:51:02,000 --> 06:51:04,160 And it turns out the search algorithm we'll generally 8182 06:51:04,160 --> 06:51:06,880 use when talking about constraint satisfaction problems 8183 06:51:06,880 --> 06:51:09,400 is something known as backtracking search. 8184 06:51:09,400 --> 06:51:11,760 And the big idea of backtracking search is we'll 8185 06:51:11,760 --> 06:51:14,920 go ahead and make assignments from variables to values. 8186 06:51:14,920 --> 06:51:17,640 And if ever we get stuck, we arrive at a place 8187 06:51:17,640 --> 06:51:20,800 where there is no way we can make any forward progress while still 8188 06:51:20,800 --> 06:51:23,640 preserving the constraints that we need to enforce, 8189 06:51:23,640 --> 06:51:27,720 we'll go ahead and backtrack and try something else instead. 8190 06:51:27,720 --> 06:51:30,760 So the very basic sketch of what backtracking search looks like 8191 06:51:30,760 --> 06:51:32,000 is it looks like this. 8192 06:51:32,000 --> 06:51:35,800 Function called backtrack that takes as input an assignment 8193 06:51:35,800 --> 06:51:37,840 and a constraint satisfaction problem. 8194 06:51:37,840 --> 06:51:40,400 So initially, we don't have any assigned variables. 8195 06:51:40,400 --> 06:51:42,880 So when we begin backtracking search, this assignment 8196 06:51:42,880 --> 06:51:46,120 is just going to be the empty assignment with no variables inside of it. 8197 06:51:46,120 --> 06:51:49,400 But we'll see later this is going to be a recursive function. 8198 06:51:49,400 --> 06:51:53,320 So backtrack takes as input the assignment and the problem. 8199 06:51:53,320 --> 06:51:57,760 If the assignment is complete, meaning all of the variables have been assigned, 8200 06:51:57,760 --> 06:51:59,280 we just return that assignment. 8201 06:51:59,280 --> 06:52:00,960 That, of course, won't be true initially, 8202 06:52:00,960 --> 06:52:02,800 because we start with an empty assignment. 8203 06:52:02,800 --> 06:52:05,080 But over time, we might add things to that assignment. 8204 06:52:05,080 --> 06:52:08,040 So if ever the assignment actually is complete, then we're done. 8205 06:52:08,040 --> 06:52:10,880 Then just go ahead and return that assignment. 8206 06:52:10,880 --> 06:52:13,400 But otherwise, there is some work to be done. 8207 06:52:13,400 --> 06:52:17,480 So what we'll need to do is select an unassigned variable 8208 06:52:17,480 --> 06:52:18,760 for this particular problem. 8209 06:52:18,760 --> 06:52:21,520 So we need to take the problem, look at the variables that have already 8210 06:52:21,520 --> 06:52:26,400 been assigned, and pick a variable that has not yet been assigned. 8211 06:52:26,400 --> 06:52:28,280 And I'll go ahead and take that variable. 8212 06:52:28,280 --> 06:52:32,440 And then I need to consider all of the values in that variable's domain. 8213 06:52:32,440 --> 06:52:34,720 So we'll go ahead and call this domain values function. 8214 06:52:34,720 --> 06:52:37,600 We'll talk a little more about that later, that takes a variable 8215 06:52:37,600 --> 06:52:42,000 and just gives me back an ordered list of all of the values in its domain. 8216 06:52:42,000 --> 06:52:44,480 So I've taken a random unselected variable. 8217 06:52:44,480 --> 06:52:47,200 I'm going to loop over all of the possible values. 8218 06:52:47,200 --> 06:52:50,400 And the idea is, let me just try all of these values 8219 06:52:50,400 --> 06:52:53,120 as possible values for the variable. 8220 06:52:53,120 --> 06:52:56,880 So if the value is consistent with the assignment so far, 8221 06:52:56,880 --> 06:52:59,360 it doesn't violate any of the constraints, 8222 06:52:59,360 --> 06:53:02,720 well then let's go ahead and add variable equals value to the assignment 8223 06:53:02,720 --> 06:53:04,680 because it's so far consistent. 8224 06:53:04,680 --> 06:53:08,080 And now let's recursively call backtrack to try and make 8225 06:53:08,080 --> 06:53:10,880 the rest of the assignments also consistent. 8226 06:53:10,880 --> 06:53:13,920 So I'll go ahead and call backtrack on this new assignment 8227 06:53:13,920 --> 06:53:17,400 that I've added the variable equals value to. 8228 06:53:17,400 --> 06:53:20,720 And now I recursively call backtrack and see what the result is. 8229 06:53:20,720 --> 06:53:27,000 And if the result isn't a failure, well then let me just return that result. 8230 06:53:27,000 --> 06:53:30,120 And otherwise, what else could happen? 8231 06:53:30,120 --> 06:53:32,680 Well, if it turns out the result was a failure, well then 8232 06:53:32,680 --> 06:53:35,200 that means this value was probably a bad choice 8233 06:53:35,200 --> 06:53:37,680 for this particular variable because when I assigned 8234 06:53:37,680 --> 06:53:41,120 this variable equal to that value, eventually down the road 8235 06:53:41,120 --> 06:53:43,720 I ran into a situation where I violated constraints. 8236 06:53:43,720 --> 06:53:45,160 There was nothing more I could do. 8237 06:53:45,160 --> 06:53:48,800 So now I'll remove variable equals value from the assignment, 8238 06:53:48,800 --> 06:53:52,080 effectively backtracking to say, all right, that value didn't work. 8239 06:53:52,080 --> 06:53:55,200 Let's try another value instead. 8240 06:53:55,200 --> 06:53:57,000 And then at the very end, if we were never 8241 06:53:57,000 --> 06:54:00,760 able to return a complete assignment, we'll just go ahead and return failure 8242 06:54:00,760 --> 06:54:04,000 because that means that none of the values worked for this particular 8243 06:54:04,000 --> 06:54:05,560 variable. 8244 06:54:05,560 --> 06:54:07,760 This now is the idea for backtracking search, 8245 06:54:07,760 --> 06:54:10,840 to take each of the variables, try values for them, 8246 06:54:10,840 --> 06:54:14,200 and recursively try backtracking search, see if we can make progress. 8247 06:54:14,200 --> 06:54:16,000 And if ever we run into a dead end, we run 8248 06:54:16,000 --> 06:54:19,160 into a situation where there is no possible value we can choose 8249 06:54:19,160 --> 06:54:22,280 that satisfies the constraints, we return failure. 8250 06:54:22,280 --> 06:54:24,400 And that propagates up, and eventually we 8251 06:54:24,400 --> 06:54:29,080 make a different choice by going back and trying something else instead. 8252 06:54:29,080 --> 06:54:31,120 So let's put this algorithm into practice. 8253 06:54:31,120 --> 06:54:35,000 Let's actually try and use backtracking search to solve this problem now, 8254 06:54:35,000 --> 06:54:37,520 where I need to figure out how to assign each of these courses 8255 06:54:37,520 --> 06:54:41,080 to an exam slot on Monday or Tuesday or Wednesday in such a way 8256 06:54:41,080 --> 06:54:44,120 that it satisfies these constraints, that each of these edges 8257 06:54:44,120 --> 06:54:47,880 mean those two classes cannot have an exam on the same day. 8258 06:54:47,880 --> 06:54:50,080 So I can start by just starting at a node. 8259 06:54:50,080 --> 06:54:51,800 It doesn't really matter which I start with, 8260 06:54:51,800 --> 06:54:54,120 but in this case, I'll just start with A. 8261 06:54:54,120 --> 06:54:57,840 And I'll ask the question, all right, let me loop over the values in the domain. 8262 06:54:57,840 --> 06:55:00,200 And maybe in this case, I'll just start with Monday and say, all right, 8263 06:55:00,200 --> 06:55:02,120 let's go ahead and assign A to Monday. 8264 06:55:02,120 --> 06:55:04,800 We'll just go and order Monday, Tuesday, Wednesday. 8265 06:55:04,800 --> 06:55:08,320 And now let's consider node B. So I've made an assignment to A, 8266 06:55:08,320 --> 06:55:11,480 so I recursively call backtrack with this new part of the assignment. 8267 06:55:11,480 --> 06:55:14,320 And now I'm looking to pick another unassigned variable like B. 8268 06:55:14,320 --> 06:55:16,320 And I'll say, all right, maybe I'll start with Monday, 8269 06:55:16,320 --> 06:55:18,960 because that's the very first value in B's domain. 8270 06:55:18,960 --> 06:55:22,240 And I ask, all right, does Monday violate any constraints? 8271 06:55:22,240 --> 06:55:23,440 And it turns out, yes, it does. 8272 06:55:23,440 --> 06:55:26,240 It violates this constraint here between A and B, 8273 06:55:26,240 --> 06:55:29,200 because A and B are now both on Monday, and that doesn't work, 8274 06:55:29,200 --> 06:55:33,600 because B can't be on the same day as A. So that doesn't work. 8275 06:55:33,600 --> 06:55:37,200 So we might instead try Tuesday, try the next value in B's domain. 8276 06:55:37,200 --> 06:55:39,960 And is that consistent with the assignment so far? 8277 06:55:39,960 --> 06:55:43,160 Well, yeah, B, Tuesday, A, Monday, that is consistent so far, 8278 06:55:43,160 --> 06:55:44,800 because they're not on the same day. 8279 06:55:44,800 --> 06:55:45,400 So that's good. 8280 06:55:45,400 --> 06:55:47,440 Now we can recursively call backtrack. 8281 06:55:47,440 --> 06:55:48,280 Try again. 8282 06:55:48,280 --> 06:55:51,400 Pick another unassigned variable, something like D, and say, all right, 8283 06:55:51,400 --> 06:55:53,160 let's go through its possible values. 8284 06:55:53,160 --> 06:55:55,600 Is Monday consistent with this assignment? 8285 06:55:55,600 --> 06:55:56,520 Well, yes, it is. 8286 06:55:56,520 --> 06:55:59,480 B and D are on different days, Monday versus Tuesday. 8287 06:55:59,480 --> 06:56:02,520 And A and B are also on different days, Monday versus Tuesday. 8288 06:56:02,520 --> 06:56:04,200 So that's fine so far, too. 8289 06:56:04,200 --> 06:56:05,440 We'll go ahead and try again. 8290 06:56:05,440 --> 06:56:09,080 Maybe we'll go to this variable here, E. Say, can we make that consistent? 8291 06:56:09,080 --> 06:56:10,680 Let's go through the possible values. 8292 06:56:10,680 --> 06:56:12,560 We've recursively called backtrack. 8293 06:56:12,560 --> 06:56:15,800 We might start with Monday and say, all right, that's not consistent, 8294 06:56:15,800 --> 06:56:19,120 because D and E now have exams on the same day. 8295 06:56:19,120 --> 06:56:21,760 So we might try Tuesday instead, going to the next one. 8296 06:56:21,760 --> 06:56:23,440 Ask, is that consistent? 8297 06:56:23,440 --> 06:56:27,240 Well, no, it's not, because B and E, those have exams on the same day. 8298 06:56:27,240 --> 06:56:29,760 And so we try, all right, is Wednesday consistent? 8299 06:56:29,760 --> 06:56:31,120 And in turn, it's like, all right, yes, it is. 8300 06:56:31,120 --> 06:56:33,080 Wednesday is consistent, because D and E now 8301 06:56:33,080 --> 06:56:34,680 have exams on different days. 8302 06:56:34,680 --> 06:56:37,240 B and E now have exams on different days. 8303 06:56:37,240 --> 06:56:38,760 All seems to be well so far. 8304 06:56:38,760 --> 06:56:43,440 I recursively call backtrack, select another unassigned variable, 8305 06:56:43,440 --> 06:56:45,960 we'll say maybe choose C this time, and say, all right, 8306 06:56:45,960 --> 06:56:48,240 let's try the values that C could take on. 8307 06:56:48,240 --> 06:56:49,600 Let's start with Monday. 8308 06:56:49,600 --> 06:56:53,320 And it turns out that's not consistent, because now A and C both 8309 06:56:53,320 --> 06:56:55,040 have exams on the same day. 8310 06:56:55,040 --> 06:56:57,560 So I try Tuesday and say, that's not consistent either, 8311 06:56:57,560 --> 06:57:00,760 because B and C now have exams on the same day. 8312 06:57:00,760 --> 06:57:04,280 And then I say, all right, let's go ahead and try Wednesday. 8313 06:57:04,280 --> 06:57:08,120 But that's not consistent either, because C and E each have 8314 06:57:08,120 --> 06:57:09,880 exams on the same day too. 8315 06:57:09,880 --> 06:57:13,200 So now we've gone through all the possible values for C, Monday, Tuesday, 8316 06:57:13,200 --> 06:57:14,080 and Wednesday. 8317 06:57:14,080 --> 06:57:15,440 And none of them are consistent. 8318 06:57:15,440 --> 06:57:18,360 There is no way we can have a consistent assignment. 8319 06:57:18,360 --> 06:57:21,480 Backtrack, in this case, will return a failure. 8320 06:57:21,480 --> 06:57:24,920 And so then we'd say, all right, we have to backtrack back to here. 8321 06:57:24,920 --> 06:57:28,800 Well, now for E, we've tried all of Monday, Tuesday, and Wednesday. 8322 06:57:28,800 --> 06:57:31,400 And none of those work, because Wednesday, which seemed to work, 8323 06:57:31,400 --> 06:57:33,480 turned out to be a failure. 8324 06:57:33,480 --> 06:57:36,200 So that means there's no possible way we can assign E. 8325 06:57:36,200 --> 06:57:37,240 So that's a failure too. 8326 06:57:37,240 --> 06:57:41,000 We have to go back up to D, which means that Monday assignment to D, 8327 06:57:41,000 --> 06:57:41,920 that must be wrong. 8328 06:57:41,920 --> 06:57:43,320 We must try something else. 8329 06:57:43,320 --> 06:57:47,880 So we can try, all right, what if instead of Monday, we try Tuesday? 8330 06:57:47,880 --> 06:57:49,640 Tuesday, it turns out, is not consistent, 8331 06:57:49,640 --> 06:57:51,960 because B and D now have an exam on the same day. 8332 06:57:51,960 --> 06:57:55,360 But Wednesday, as it turns out, works. 8333 06:57:55,360 --> 06:57:57,560 And now we can begin to mix and forward progress again. 8334 06:57:57,560 --> 06:58:00,640 We go back to E and say, all right, which of these values works? 8335 06:58:00,640 --> 06:58:03,800 Monday turns out to work by not violating any constraints. 8336 06:58:03,800 --> 06:58:05,440 Then we go up to C now. 8337 06:58:05,440 --> 06:58:08,080 Monday doesn't work, because it violates a constraint. 8338 06:58:08,080 --> 06:58:09,600 Violates two, actually. 8339 06:58:09,600 --> 06:58:12,160 Tuesday doesn't work, because it violates a constraint as well. 8340 06:58:12,160 --> 06:58:13,600 But Wednesday does work. 8341 06:58:13,600 --> 06:58:16,520 Then we can go to the next variable, F, and say, all right, does Monday work? 8342 06:58:16,520 --> 06:58:17,020 We'll know. 8343 06:58:17,020 --> 06:58:18,320 It violates a constraint. 8344 06:58:18,320 --> 06:58:19,800 But Tuesday does work. 8345 06:58:19,800 --> 06:58:21,880 And then finally, we can look at the last variable, G, 8346 06:58:21,880 --> 06:58:24,280 recursively calling backtrack one more time. 8347 06:58:24,280 --> 06:58:25,640 Monday is inconsistent. 8348 06:58:25,640 --> 06:58:27,320 That violates a constraint. 8349 06:58:27,320 --> 06:58:29,680 Tuesday also violates a constraint. 8350 06:58:29,680 --> 06:58:33,120 But Wednesday, that doesn't violate a constraint. 8351 06:58:33,120 --> 06:58:36,840 And so now at this point, we recursively call backtrack one last time. 8352 06:58:36,840 --> 06:58:40,240 We now have a satisfactory assignment of all of the variables. 8353 06:58:40,240 --> 06:58:42,640 And at this point, we can say that we are now done. 8354 06:58:42,640 --> 06:58:47,240 We have now been able to successfully assign a variable or a value 8355 06:58:47,240 --> 06:58:49,080 to each one of these variables in such a way 8356 06:58:49,080 --> 06:58:51,480 that we're not violating any constraints. 8357 06:58:51,480 --> 06:58:55,520 We're going to go ahead and have classes A and E have their exams on Monday. 8358 06:58:55,520 --> 06:58:58,560 Classes B and F can have their exams on Tuesday. 8359 06:58:58,560 --> 06:59:02,440 And classes C, D, and G can have their exams on Wednesday. 8360 06:59:02,440 --> 06:59:06,280 And there's no violated constraints that might come up there. 8361 06:59:06,280 --> 06:59:08,840 So that then was a graphical look at how this might work. 8362 06:59:08,840 --> 06:59:11,640 Let's now take a look at some code we could use to actually try 8363 06:59:11,640 --> 06:59:14,640 and solve this problem as well. 8364 06:59:14,640 --> 06:59:20,160 So here I'll go ahead and go into the scheduling directory. 8365 06:59:20,160 --> 06:59:21,160 We're here now. 8366 06:59:21,160 --> 06:59:24,120 We'll start by looking at schedule0.py. 8367 06:59:24,120 --> 06:59:25,160 We're here. 8368 06:59:25,160 --> 06:59:28,560 I define a list of variables, A, B, C, D, E, F, G. 8369 06:59:28,560 --> 06:59:31,280 Those are all different classes. 8370 06:59:31,280 --> 06:59:34,480 Then underneath that, I define my list of constraints. 8371 06:59:34,480 --> 06:59:36,520 So constraint A and B. That is a constraint 8372 06:59:36,520 --> 06:59:38,160 because they can't be on the same day. 8373 06:59:38,160 --> 06:59:40,960 Likewise, A and C, B and C, so on and so forth, 8374 06:59:40,960 --> 06:59:43,760 enforcing those exact same constraints. 8375 06:59:43,760 --> 06:59:47,640 And here then is what the backtracking function might look like. 8376 06:59:47,640 --> 06:59:50,800 First, if the assignment is complete, if I've 8377 06:59:50,800 --> 06:59:54,000 made an assignment of every variable to a value, 8378 06:59:54,000 --> 06:59:56,760 go ahead and just return that assignment. 8379 06:59:56,760 --> 07:00:00,160 Then we'll select an unassigned variable from that assignment. 8380 07:00:00,160 --> 07:00:03,280 Then for each of the possible values in the domain, Monday, Tuesday, 8381 07:00:03,280 --> 07:00:06,640 Wednesday, let's go ahead and create a new assignment that 8382 07:00:06,640 --> 07:00:09,200 assigns the variable to that value. 8383 07:00:09,200 --> 07:00:11,700 I'll call this consistent function, which I'll show you in a moment, 8384 07:00:11,700 --> 07:00:14,640 that just checks to make sure this new assignment is consistent. 8385 07:00:14,640 --> 07:00:17,160 But if it is consistent, we'll go ahead and call backtrack 8386 07:00:17,160 --> 07:00:20,480 to go ahead and continue trying to run backtracking search. 8387 07:00:20,480 --> 07:00:24,200 And as long as the result is not none, meaning it wasn't a failure, 8388 07:00:24,200 --> 07:00:26,920 we can go ahead and return that result. 8389 07:00:26,920 --> 07:00:31,160 But if we make it through all the values and nothing works, then it is a failure. 8390 07:00:31,160 --> 07:00:32,400 There's no solution. 8391 07:00:32,400 --> 07:00:35,200 We go ahead and return none here. 8392 07:00:35,200 --> 07:00:36,400 What do these functions do? 8393 07:00:36,400 --> 07:00:40,440 Select unassigned variable is just going to choose a variable not yet assigned. 8394 07:00:40,440 --> 07:00:42,440 So it's going to loop over all the variables. 8395 07:00:42,440 --> 07:00:46,400 And if it's not already assigned, we'll go ahead and just return that variable. 8396 07:00:46,400 --> 07:00:48,440 And what does the consistent function do? 8397 07:00:48,440 --> 07:00:51,840 Well, the consistent function goes through all the constraints. 8398 07:00:51,840 --> 07:00:56,240 And if we have a situation where we've assigned both of those values 8399 07:00:56,240 --> 07:00:59,040 to variables, but they are the same, well, 8400 07:00:59,040 --> 07:01:03,120 then that is a violation of the constraint, in which case we'll return false. 8401 07:01:03,120 --> 07:01:06,360 But if nothing is inconsistent, then the assignment is consistent 8402 07:01:06,360 --> 07:01:08,760 and will return true. 8403 07:01:08,760 --> 07:01:12,000 And then all the program does is it calls backtrack 8404 07:01:12,000 --> 07:01:15,440 on an empty assignment, an empty dictionary that has no variable assigned 8405 07:01:15,440 --> 07:01:18,680 and no values yet, save that inside a solution, 8406 07:01:18,680 --> 07:01:21,120 and then print out that solution. 8407 07:01:21,120 --> 07:01:27,160 So by running this now, I can run Python schedule0.py. 8408 07:01:27,160 --> 07:01:29,960 And what I get as a result of that is an assignment 8409 07:01:29,960 --> 07:01:31,560 of all these variables to values. 8410 07:01:31,560 --> 07:01:35,080 And it turns out we assign a to Monday as we would expect, b to Tuesday, 8411 07:01:35,080 --> 07:01:37,280 c to Wednesday, exactly the same type of thing 8412 07:01:37,280 --> 07:01:40,280 we were talking about before, an assignment of each of these variables 8413 07:01:40,280 --> 07:01:43,960 to values that doesn't violate any constraints. 8414 07:01:43,960 --> 07:01:45,800 And I had to do a fair amount of work in order 8415 07:01:45,800 --> 07:01:47,360 to implement this idea myself. 8416 07:01:47,360 --> 07:01:49,520 I had to write the backtrack function that went ahead 8417 07:01:49,520 --> 07:01:51,880 and went through this process of recursively trying 8418 07:01:51,880 --> 07:01:53,600 to do this backtracking search. 8419 07:01:53,600 --> 07:01:56,840 But it turns out the constraint satisfaction problems are so popular 8420 07:01:56,840 --> 07:02:00,960 that there exist many libraries that already implement this type of idea. 8421 07:02:00,960 --> 07:02:03,280 Again, as with before, the specific library 8422 07:02:03,280 --> 07:02:06,360 is not as important as the fact that libraries do exist. 8423 07:02:06,360 --> 07:02:09,600 This is just one example of a Python constraint library, 8424 07:02:09,600 --> 07:02:13,320 where now, rather than having to do all the work from scratch 8425 07:02:13,320 --> 07:02:15,960 inside of schedule1.py, I'm just taking advantage 8426 07:02:15,960 --> 07:02:19,200 of a library that implements a lot of these ideas already. 8427 07:02:19,200 --> 07:02:22,520 So here, I create a new problem, add variables to it 8428 07:02:22,520 --> 07:02:24,160 with particular domains. 8429 07:02:24,160 --> 07:02:27,200 I add a whole bunch of these individual constraints, 8430 07:02:27,200 --> 07:02:30,860 where I call addConstraint and pass in a function describing 8431 07:02:30,860 --> 07:02:32,160 what the constraint is. 8432 07:02:32,160 --> 07:02:35,240 And the constraint basically says the function that takes two variables, x 8433 07:02:35,240 --> 07:02:38,480 and y, and makes sure that x is not equal to y, 8434 07:02:38,480 --> 07:02:43,480 enforcing the idea that these two classes cannot have exams on the same day. 8435 07:02:43,480 --> 07:02:46,760 And then, for any constraint satisfaction problem, 8436 07:02:46,760 --> 07:02:50,640 I can call getSolutions to get all the solutions to that problem. 8437 07:02:50,640 --> 07:02:53,160 And then, for each of those solutions, print out 8438 07:02:53,160 --> 07:02:55,520 what that solution happens to be. 8439 07:02:55,520 --> 07:02:59,320 And if I run python schedule1.py, and now see, 8440 07:02:59,320 --> 07:03:01,880 there are actually a number of different solutions 8441 07:03:01,880 --> 07:03:03,640 that can be used to solve the problem. 8442 07:03:03,640 --> 07:03:06,720 There are, in fact, six different solutions, assignments of variables 8443 07:03:06,720 --> 07:03:10,920 to values that will give me a satisfactory answer to this constraint 8444 07:03:10,920 --> 07:03:13,080 satisfaction problem. 8445 07:03:13,080 --> 07:03:17,200 So this then was an implementation of a very basic backtracking search method, 8446 07:03:17,200 --> 07:03:19,560 where really we just went through each of the variables, 8447 07:03:19,560 --> 07:03:22,480 picked one that wasn't assigned, tried the possible values 8448 07:03:22,480 --> 07:03:23,880 the variable could take on. 8449 07:03:23,880 --> 07:03:27,240 And then, if it worked, if it didn't violate any constraints, 8450 07:03:27,240 --> 07:03:28,960 then we kept trying other variables. 8451 07:03:28,960 --> 07:03:31,480 And if ever we hit a dead end, we had to backtrack. 8452 07:03:31,480 --> 07:03:34,080 But ultimately, we might be able to be a little bit more 8453 07:03:34,080 --> 07:03:36,280 intelligent about how we do this in order 8454 07:03:36,280 --> 07:03:39,520 to improve the efficiency of how we solve these sorts of problems. 8455 07:03:39,520 --> 07:03:41,640 And one thing we might imagine trying to do 8456 07:03:41,640 --> 07:03:44,280 is going back to this idea of inference, using the knowledge we 8457 07:03:44,280 --> 07:03:47,200 know to be able to draw conclusions in order 8458 07:03:47,200 --> 07:03:51,200 to make the rest of the problem solving process a little bit easier. 8459 07:03:51,200 --> 07:03:55,320 And let's now go back to where we got stuck in this problem the first time. 8460 07:03:55,320 --> 07:03:59,320 When we were solving this constraint satisfaction problem, we dealt with B. 8461 07:03:59,320 --> 07:04:03,040 And then we went on to D. And we went ahead and just assigned D to Monday, 8462 07:04:03,040 --> 07:04:05,240 because that seemed to work with the assignment so far. 8463 07:04:05,240 --> 07:04:07,600 It didn't violate any constraints. 8464 07:04:07,600 --> 07:04:11,480 But it turned out that later on that choice turned out to be a bad one, 8465 07:04:11,480 --> 07:04:15,040 that that choice wasn't consistent with the rest of the values 8466 07:04:15,040 --> 07:04:16,920 that we could take on here. 8467 07:04:16,920 --> 07:04:18,640 And the question is, is there anything we 8468 07:04:18,640 --> 07:04:21,600 could do to avoid getting into a situation like this, 8469 07:04:21,600 --> 07:04:25,240 avoid trying to go down a path that's ultimately not going to lead anywhere 8470 07:04:25,240 --> 07:04:28,360 by taking advantage of knowledge that we have initially? 8471 07:04:28,360 --> 07:04:30,680 And it turns out we do have that kind of knowledge. 8472 07:04:30,680 --> 07:04:33,720 We can look at just the structure of this graph so far. 8473 07:04:33,720 --> 07:04:37,720 And we can say that right now C's domain, for example, 8474 07:04:37,720 --> 07:04:41,360 contains values Monday, Tuesday, and Wednesday. 8475 07:04:41,360 --> 07:04:46,160 And based on those values, we can say that this graph is not arc consistent. 8476 07:04:46,160 --> 07:04:49,140 Recall that arc consistency is all about making sure 8477 07:04:49,140 --> 07:04:52,480 that for every possible value for a particular node, 8478 07:04:52,480 --> 07:04:55,600 that there is some other value that we are able to choose. 8479 07:04:55,600 --> 07:04:58,200 And as we can see here, Monday and Tuesday 8480 07:04:58,200 --> 07:05:01,640 are not going to be possible values that we can choose for C. 8481 07:05:01,640 --> 07:05:06,120 They're not going to be consistent with a node like B, for example, 8482 07:05:06,120 --> 07:05:09,800 because B is equal to Tuesday, which means that C cannot be Tuesday. 8483 07:05:09,800 --> 07:05:13,560 And because A is equal to Monday, C also cannot be Monday. 8484 07:05:13,560 --> 07:05:18,400 So using that information, by making C arc consistent with A and B, 8485 07:05:18,400 --> 07:05:21,600 we could remove Monday and Tuesday from C's domain 8486 07:05:21,600 --> 07:05:25,440 and just leave C with Wednesday, for example. 8487 07:05:25,440 --> 07:05:28,800 And if we continued to try and enforce arc consistency, 8488 07:05:28,800 --> 07:05:31,400 we'd see there are some other conclusions we can draw as well. 8489 07:05:31,400 --> 07:05:35,360 We see that B's only option is Tuesday and C's only option is Wednesday. 8490 07:05:35,360 --> 07:05:38,800 And so if we want to make E arc consistent, 8491 07:05:38,800 --> 07:05:42,160 well, E can't be Tuesday, because that wouldn't be arc consistent with B. 8492 07:05:42,160 --> 07:05:45,440 And E can't be Wednesday, because that wouldn't be arc consistent with C. 8493 07:05:45,440 --> 07:05:49,120 So we can go ahead and say E and just set that equal to Monday, for example. 8494 07:05:49,120 --> 07:05:51,560 And then we can begin to do this process again and again, 8495 07:05:51,560 --> 07:05:54,640 that in order to make D arc consistent with B and E, 8496 07:05:54,640 --> 07:05:56,120 then D would have to be Wednesday. 8497 07:05:56,120 --> 07:05:57,880 That's the only possible option. 8498 07:05:57,880 --> 07:06:01,480 And likewise, we can make the same judgments for F and G as well. 8499 07:06:01,480 --> 07:06:04,680 And it turns out that without having to do any additional search, 8500 07:06:04,680 --> 07:06:07,920 just by enforcing arc consistency, we were 8501 07:06:07,920 --> 07:06:10,920 able to actually figure out what the assignment of all the variables 8502 07:06:10,920 --> 07:06:14,360 should be without needing to backtrack at all. 8503 07:06:14,360 --> 07:06:18,360 And the way we did that is by interleaving this search process 8504 07:06:18,360 --> 07:06:22,920 and the inference step, by this step of trying to enforce arc consistency. 8505 07:06:22,920 --> 07:06:26,120 And the algorithm to do this is often called just the maintaining arc 8506 07:06:26,120 --> 07:06:30,840 consistency algorithm, which just enforces arc consistency every time 8507 07:06:30,840 --> 07:06:34,880 we make a new assignment of a value to an existing variable. 8508 07:06:34,880 --> 07:06:38,760 So sometimes we can enforce our consistency using that AC3 algorithm 8509 07:06:38,760 --> 07:06:41,920 at the very beginning of the problem before we even begin searching 8510 07:06:41,920 --> 07:06:43,880 in order to limit the domain of the variables 8511 07:06:43,880 --> 07:06:45,640 in order to make it easier to search. 8512 07:06:45,640 --> 07:06:48,720 But we can also take advantage of the interleaving 8513 07:06:48,720 --> 07:06:52,560 of enforcing our consistency with search such that every time in the search 8514 07:06:52,560 --> 07:06:56,720 process we make a new assignment, we go ahead and enforce arc consistency 8515 07:06:56,720 --> 07:06:59,440 as well to make sure that we're just eliminating 8516 07:06:59,440 --> 07:07:02,680 possible values from domains whenever possible. 8517 07:07:02,680 --> 07:07:03,840 And how do we do this? 8518 07:07:03,840 --> 07:07:06,440 Well, this is really equivalent to just every time 8519 07:07:06,440 --> 07:07:09,160 we make a new assignment to a variable x. 8520 07:07:09,160 --> 07:07:12,280 We'll go ahead and call our AC3 algorithm, 8521 07:07:12,280 --> 07:07:15,680 this algorithm that enforces arc consistency on a constraint satisfaction 8522 07:07:15,680 --> 07:07:16,680 problem. 8523 07:07:16,680 --> 07:07:18,640 And we go ahead and call that, starting it 8524 07:07:18,640 --> 07:07:22,120 with a Q, not of all of the arcs, which we did originally, 8525 07:07:22,120 --> 07:07:26,600 but just of all of the arcs that we want to make arc consistent with x, 8526 07:07:26,600 --> 07:07:28,920 this thing that we have just made an assignment to. 8527 07:07:28,920 --> 07:07:33,280 So all arcs yx, where y is a neighbor of x, something 8528 07:07:33,280 --> 07:07:36,560 that shares a constraint with x, for example. 8529 07:07:36,560 --> 07:07:40,760 And by maintaining arc consistency in the backtracking search process, 8530 07:07:40,760 --> 07:07:44,040 we can ultimately make our search process a little bit more efficient. 8531 07:07:44,040 --> 07:07:47,680 And so this is the revised version of this backtrack function. 8532 07:07:47,680 --> 07:07:50,480 Same as before, the changes here are highlighted in yellow. 8533 07:07:50,480 --> 07:07:54,320 Every time we add a new variable equals value to our assignment, 8534 07:07:54,320 --> 07:07:56,520 we'll go ahead and run this inference procedure, which 8535 07:07:56,520 --> 07:07:57,920 might do a number of different things. 8536 07:07:57,920 --> 07:08:00,880 But one thing it could do is call the maintaining arc consistency 8537 07:08:00,880 --> 07:08:05,240 algorithm to make sure we're able to enforce arc consistency on the problem. 8538 07:08:05,240 --> 07:08:09,360 And we might be able to draw new inferences as a result of that process. 8539 07:08:09,360 --> 07:08:13,600 Get new guarantees of this variable needs to be equal to that value, 8540 07:08:13,600 --> 07:08:14,360 for example. 8541 07:08:14,360 --> 07:08:15,360 That might happen one time. 8542 07:08:15,360 --> 07:08:16,720 It might happen many times. 8543 07:08:16,720 --> 07:08:19,320 And so long as those inferences are not a failure, 8544 07:08:19,320 --> 07:08:22,200 as long as they don't lead to a situation where there is no possible way 8545 07:08:22,200 --> 07:08:26,080 to make forward progress, well, then we can go ahead and add those inferences, 8546 07:08:26,080 --> 07:08:28,120 those new knowledge, that new pieces of knowledge 8547 07:08:28,120 --> 07:08:31,200 I know about what variables should be assigned to what values, 8548 07:08:31,200 --> 07:08:35,040 I can add those to the assignment in order to more quickly make forward 8549 07:08:35,040 --> 07:08:38,720 progress by taking advantage of information that I can just deduce, 8550 07:08:38,720 --> 07:08:41,960 information I know based on the rest of the structure 8551 07:08:41,960 --> 07:08:44,240 of the constraint satisfaction problem. 8552 07:08:44,240 --> 07:08:46,040 And the only other change I'll need to make now 8553 07:08:46,040 --> 07:08:49,240 is if it turns out this value doesn't work, well, then down here, 8554 07:08:49,240 --> 07:08:52,320 I'll go ahead and need to remove not only variable equals value, 8555 07:08:52,320 --> 07:08:54,920 but also any of those inferences that I made, 8556 07:08:54,920 --> 07:08:57,480 remove that from the assignment as well. 8557 07:08:57,480 --> 07:09:01,480 So here, then, we're often able to solve the problem by backtracking less 8558 07:09:01,480 --> 07:09:03,560 than we might originally have needed to, just 8559 07:09:03,560 --> 07:09:05,880 by taking advantage of the fact that every time we 8560 07:09:05,880 --> 07:09:08,480 make a new assignment of one variable to one value, 8561 07:09:08,480 --> 07:09:12,000 that might reduce the domains of other variables as well. 8562 07:09:12,000 --> 07:09:15,520 And we can use that information to begin to more quickly draw conclusions 8563 07:09:15,520 --> 07:09:19,440 in order to try and solve the problem more efficiently as well. 8564 07:09:19,440 --> 07:09:21,560 And it turns out there are other heuristics 8565 07:09:21,560 --> 07:09:25,240 we can use to try and improve the efficiency of our search process 8566 07:09:25,240 --> 07:09:25,920 as well. 8567 07:09:25,920 --> 07:09:28,800 And it really boils down to a couple of these functions 8568 07:09:28,800 --> 07:09:30,800 that I've talked about, but we haven't really 8569 07:09:30,800 --> 07:09:32,280 talked about how they're working. 8570 07:09:32,280 --> 07:09:37,000 And one of them is this function here, select unassigned variable, 8571 07:09:37,000 --> 07:09:40,360 where we're selecting some variable in the constraint satisfaction problem 8572 07:09:40,360 --> 07:09:42,240 that has not yet been assigned. 8573 07:09:42,240 --> 07:09:45,080 So far, I've sort of just been selecting variables randomly, 8574 07:09:45,080 --> 07:09:48,320 just like picking one variable and one unassigned variable in order 8575 07:09:48,320 --> 07:09:50,240 to decide, all right, this is the variable 8576 07:09:50,240 --> 07:09:53,240 that we're going to assign next, and then going from there. 8577 07:09:53,240 --> 07:09:55,480 But it turns out that by being a little bit intelligent, 8578 07:09:55,480 --> 07:09:57,720 by following certain heuristics, we might be 8579 07:09:57,720 --> 07:10:00,400 able to make the search process much more efficient just 8580 07:10:00,400 --> 07:10:05,560 by choosing very carefully which variable we should explore next. 8581 07:10:05,560 --> 07:10:09,320 So some of those heuristics include the minimum remaining values, 8582 07:10:09,320 --> 07:10:12,240 or MRV heuristic, which generally says that if I 8583 07:10:12,240 --> 07:10:14,880 have a choice between which variable I should select, 8584 07:10:14,880 --> 07:10:18,000 I should select the variable with the smallest domain, 8585 07:10:18,000 --> 07:10:21,480 the variable that has the fewest number of remaining values left. 8586 07:10:21,480 --> 07:10:24,640 With the idea being, if there are only two remaining values left, 8587 07:10:24,640 --> 07:10:27,720 well, I may as well prune one of them very quickly in order 8588 07:10:27,720 --> 07:10:30,920 to get to the other, because one of those two has got to be the solution, 8589 07:10:30,920 --> 07:10:33,640 if a solution does exist. 8590 07:10:33,640 --> 07:10:37,600 Sometimes minimum remaining values might not give a conclusive result 8591 07:10:37,600 --> 07:10:40,920 if all the nodes have the same number of remaining values, for example. 8592 07:10:40,920 --> 07:10:43,800 And in that case, another heuristic that can be helpful to look at 8593 07:10:43,800 --> 07:10:45,680 is the degree heuristic. 8594 07:10:45,680 --> 07:10:49,440 The degree of a node is the number of nodes that are attached to that node, 8595 07:10:49,440 --> 07:10:52,880 the number of nodes that are constrained by that particular node. 8596 07:10:52,880 --> 07:10:54,960 And if you imagine which variable should I choose, 8597 07:10:54,960 --> 07:10:57,240 should I choose a variable that has a high degree that 8598 07:10:57,240 --> 07:10:59,240 is connected to a lot of different things, 8599 07:10:59,240 --> 07:11:01,120 or a variable with a low degree that is not 8600 07:11:01,120 --> 07:11:03,120 connected to a lot of different things, well, 8601 07:11:03,120 --> 07:11:06,240 it can often make sense to choose the variable that 8602 07:11:06,240 --> 07:11:09,800 has the highest degree that is connected to the most other nodes 8603 07:11:09,800 --> 07:11:11,760 as the thing you would search first. 8604 07:11:11,760 --> 07:11:12,920 Why is that the case? 8605 07:11:12,920 --> 07:11:16,320 Well, it's because by choosing a variable with a high degree, 8606 07:11:16,320 --> 07:11:20,040 that is immediately going to constrain the rest of the variables more, 8607 07:11:20,040 --> 07:11:23,760 and it's more likely to be able to eliminate large sections of the state 8608 07:11:23,760 --> 07:11:26,960 space that you don't need to search through at all. 8609 07:11:26,960 --> 07:11:29,440 So what could this actually look like? 8610 07:11:29,440 --> 07:11:31,440 Let's go back to this search problem here. 8611 07:11:31,440 --> 07:11:34,320 In this particular case, I've made an assignment here. 8612 07:11:34,320 --> 07:11:35,720 I've made an assignment here. 8613 07:11:35,720 --> 07:11:38,840 And the question is, what should I look at next? 8614 07:11:38,840 --> 07:11:41,600 And according to the minimum remaining values heuristic, 8615 07:11:41,600 --> 07:11:44,320 what I should choose is the variable that has the fewest 8616 07:11:44,320 --> 07:11:46,240 remaining possible values. 8617 07:11:46,240 --> 07:11:48,160 And in this case, that's this node here, node 8618 07:11:48,160 --> 07:11:51,720 C, that only has one variable left in this domain, which in this case 8619 07:11:51,720 --> 07:11:55,360 is Wednesday, which is a very reasonable choice of a next assignment 8620 07:11:55,360 --> 07:11:58,240 to make, because I know it's the only option, for example. 8621 07:11:58,240 --> 07:12:01,480 I know that the only possible option for C is Wednesday, 8622 07:12:01,480 --> 07:12:04,760 so I may as well make that assignment and then potentially explore 8623 07:12:04,760 --> 07:12:07,440 the rest of the space after that. 8624 07:12:07,440 --> 07:12:09,520 But meanwhile, at the very start of the problem, 8625 07:12:09,520 --> 07:12:12,960 when I didn't have any knowledge of what nodes should have what values yet, 8626 07:12:12,960 --> 07:12:16,840 I still had to pick what node should be the first one that I try and assign 8627 07:12:16,840 --> 07:12:17,640 a value to. 8628 07:12:17,640 --> 07:12:20,960 And I arbitrarily just chose the one at the top, node A originally. 8629 07:12:20,960 --> 07:12:23,480 But we can be more intelligent about that. 8630 07:12:23,480 --> 07:12:25,480 We can look at this particular graph. 8631 07:12:25,480 --> 07:12:28,240 All of them have domains of the same size, domain of size 3. 8632 07:12:28,240 --> 07:12:31,240 So minimum remaining values doesn't really help us there. 8633 07:12:31,240 --> 07:12:34,760 But we might notice that node E has the highest degree. 8634 07:12:34,760 --> 07:12:37,040 It is connected to the most things. 8635 07:12:37,040 --> 07:12:39,800 And so perhaps it makes sense to begin our search, 8636 07:12:39,800 --> 07:12:41,880 rather than starting at node A at the very top, 8637 07:12:41,880 --> 07:12:43,720 start with the node with the highest degree. 8638 07:12:43,720 --> 07:12:46,760 Start by searching from node E, because from there, 8639 07:12:46,760 --> 07:12:49,160 that's going to much more easily allow us to enforce 8640 07:12:49,160 --> 07:12:51,600 the constraints that are nearby, eliminating 8641 07:12:51,600 --> 07:12:55,400 large portions of the search space that I might not need to search through. 8642 07:12:55,400 --> 07:12:59,480 And in fact, by starting with E, we can immediately then assign other variables. 8643 07:12:59,480 --> 07:13:02,160 And following that, we can actually assign the rest of the variables 8644 07:13:02,160 --> 07:13:04,600 without needing to do any backtracking at all, 8645 07:13:04,600 --> 07:13:06,880 even if I'm not using this inference procedure. 8646 07:13:06,880 --> 07:13:09,360 Just by starting with a node that has a high degree, 8647 07:13:09,360 --> 07:13:12,560 that is going to very quickly restrict the possible values 8648 07:13:12,560 --> 07:13:14,960 that other nodes can take on. 8649 07:13:14,960 --> 07:13:17,360 So that then is how we can go about selecting 8650 07:13:17,360 --> 07:13:19,840 an unassigned variable in a particular order. 8651 07:13:19,840 --> 07:13:22,160 Rather than randomly picking a variable, if we're 8652 07:13:22,160 --> 07:13:24,200 a little bit intelligent about how we choose it, 8653 07:13:24,200 --> 07:13:26,960 we can make our search process much, much more efficient 8654 07:13:26,960 --> 07:13:30,040 by making sure we don't have to search through portions of the search space 8655 07:13:30,040 --> 07:13:32,640 that ultimately aren't going to matter. 8656 07:13:32,640 --> 07:13:34,600 The other variable we haven't really talked about, 8657 07:13:34,600 --> 07:13:37,880 the other function here, is this domain values function. 8658 07:13:37,880 --> 07:13:40,520 This domain values function that takes a variable 8659 07:13:40,520 --> 07:13:43,040 and gives me back a sequence of all of the values 8660 07:13:43,040 --> 07:13:45,880 inside of that variable's domain. 8661 07:13:45,880 --> 07:13:47,960 The naive way to approach it is what we did before, 8662 07:13:47,960 --> 07:13:51,880 which is just go in order, go Monday, then Tuesday, then Wednesday. 8663 07:13:51,880 --> 07:13:53,560 But the problem is that going in that order 8664 07:13:53,560 --> 07:13:55,760 might not be the most efficient order to search in, 8665 07:13:55,760 --> 07:13:59,560 that sometimes it might be more efficient to choose values 8666 07:13:59,560 --> 07:14:04,320 that are likely to be solutions first and then go to other values. 8667 07:14:04,320 --> 07:14:06,320 Now, how do you assess whether a value is 8668 07:14:06,320 --> 07:14:10,200 likelier to lead to a solution or less likely to lead to a solution? 8669 07:14:10,200 --> 07:14:15,160 Well, one thing you can take a look at is how many constraints get added, 8670 07:14:15,160 --> 07:14:17,880 how many things get removed from domains as you 8671 07:14:17,880 --> 07:14:21,520 make this new assignment of a variable to this particular value. 8672 07:14:21,520 --> 07:14:26,080 And the heuristic we can use here is the least constraining value heuristic, 8673 07:14:26,080 --> 07:14:28,960 which is the idea that we should return variables in order 8674 07:14:28,960 --> 07:14:32,840 based on the number of choices that are ruled out for neighboring values. 8675 07:14:32,840 --> 07:14:36,440 And I want to start with the least constraining value, the value that 8676 07:14:36,440 --> 07:14:40,080 rules out the fewest possible options. 8677 07:14:40,080 --> 07:14:43,280 And the idea there is that if all I care about doing 8678 07:14:43,280 --> 07:14:47,520 is finding a solution, if I start with a value that 8679 07:14:47,520 --> 07:14:51,400 rules out a lot of other choices, I'm ruling out a lot of possibilities 8680 07:14:51,400 --> 07:14:55,160 that maybe is going to make it less likely that this particular choice 8681 07:14:55,160 --> 07:14:56,480 leads to a solution. 8682 07:14:56,480 --> 07:14:58,640 Whereas on the other hand, if I have a variable 8683 07:14:58,640 --> 07:15:02,160 and I start by choosing a value that doesn't rule out very much, 8684 07:15:02,160 --> 07:15:05,320 well, then I still have a lot of space where there might be a solution 8685 07:15:05,320 --> 07:15:06,640 that I could ultimately find. 8686 07:15:06,640 --> 07:15:09,680 And this might seem a little bit counterintuitive and a little bit at odds 8687 07:15:09,680 --> 07:15:12,080 with what we were talking about before, where I said, 8688 07:15:12,080 --> 07:15:14,000 when you're picking a variable, you should 8689 07:15:14,000 --> 07:15:18,360 pick the variable that is going to have the fewest possible values remaining. 8690 07:15:18,360 --> 07:15:20,480 But here, I want to pick the value for the variable 8691 07:15:20,480 --> 07:15:22,160 that is the least constraining. 8692 07:15:22,160 --> 07:15:25,040 But the general idea is that when I am picking a variable, 8693 07:15:25,040 --> 07:15:27,720 I would like to prune large portions of the search space 8694 07:15:27,720 --> 07:15:30,960 by just choosing a variable that is going to allow me to quickly eliminate 8695 07:15:30,960 --> 07:15:32,560 possible options. 8696 07:15:32,560 --> 07:15:34,880 Whereas here, within a particular variable, 8697 07:15:34,880 --> 07:15:37,880 as I'm considering values that that variable could take on, 8698 07:15:37,880 --> 07:15:40,360 I would like to just find a solution. 8699 07:15:40,360 --> 07:15:42,640 And so what I want to do is ultimately choose 8700 07:15:42,640 --> 07:15:46,680 a value that still leaves open the possibility of me finding a solution 8701 07:15:46,680 --> 07:15:48,040 to be as likely as possible. 8702 07:15:48,040 --> 07:15:51,800 By not ruling out many options, I leave open the possibility 8703 07:15:51,800 --> 07:15:54,120 that I can still find a solution without needing 8704 07:15:54,120 --> 07:15:56,080 to go back later and backtrack. 8705 07:15:56,080 --> 07:15:59,360 So an example of that might be in this particular situation here, 8706 07:15:59,360 --> 07:16:03,360 if I'm trying to choose a variable for a value for node C here, 8707 07:16:03,360 --> 07:16:06,080 that C is equal to either Tuesday or Wednesday. 8708 07:16:06,080 --> 07:16:09,280 We know it can't be Monday because it conflicts with this domain here, 8709 07:16:09,280 --> 07:16:13,360 where we already know that A is Monday, so C must be Tuesday or Wednesday. 8710 07:16:13,360 --> 07:16:16,120 And the question is, should I try Tuesday first, 8711 07:16:16,120 --> 07:16:18,120 or should I try Wednesday first? 8712 07:16:18,120 --> 07:16:21,280 And if I try Tuesday, what gets ruled out? 8713 07:16:21,280 --> 07:16:25,760 Well, one option gets ruled out here, a second option gets ruled out here, 8714 07:16:25,760 --> 07:16:27,720 and a third option gets ruled out here. 8715 07:16:27,720 --> 07:16:30,920 So choosing Tuesday would rule out three possible options. 8716 07:16:30,920 --> 07:16:32,600 And what about choosing Wednesday? 8717 07:16:32,600 --> 07:16:35,140 Well, choosing Wednesday would rule out one option here, 8718 07:16:35,140 --> 07:16:37,400 and it would rule out one option there. 8719 07:16:37,400 --> 07:16:38,740 And so I have two choices. 8720 07:16:38,740 --> 07:16:41,480 I can choose Tuesday that rules out three options, 8721 07:16:41,480 --> 07:16:43,760 or Wednesday that rules out two options. 8722 07:16:43,760 --> 07:16:46,600 And according to the least constraining value heuristic, 8723 07:16:46,600 --> 07:16:49,300 what I should probably do is go ahead and choose Wednesday, 8724 07:16:49,300 --> 07:16:52,240 the one that rules out the fewest number of possible options, 8725 07:16:52,240 --> 07:16:55,040 leaving open as many chances as possible for me 8726 07:16:55,040 --> 07:16:58,320 to eventually find the solution inside of the state space. 8727 07:16:58,320 --> 07:17:00,280 And ultimately, if you continue this process, 8728 07:17:00,280 --> 07:17:05,520 we will find the solution, an assignment of variables, two values, 8729 07:17:05,520 --> 07:17:09,520 that allows us to give each of these exams, each of these classes, 8730 07:17:09,520 --> 07:17:12,240 an exam date that doesn't conflict with anyone 8731 07:17:12,240 --> 07:17:16,320 that happens to be enrolled in two classes at the same time. 8732 07:17:16,320 --> 07:17:18,400 So the big takeaway now with all of this is 8733 07:17:18,400 --> 07:17:21,760 that there are a number of different ways we can formulate a problem. 8734 07:17:21,760 --> 07:17:24,520 The ways we've looked at today are we can formulate a problem 8735 07:17:24,520 --> 07:17:27,840 as a local search problem, a problem where we're looking at a current node 8736 07:17:27,840 --> 07:17:30,720 and moving to a neighbor based on whether that neighbor is better 8737 07:17:30,720 --> 07:17:33,200 or worse than the current node that we are looking at. 8738 07:17:33,200 --> 07:17:35,640 We looked at formulating problems as linear programs, 8739 07:17:35,640 --> 07:17:38,920 where just by putting things in terms of equations and constraints, 8740 07:17:38,920 --> 07:17:41,880 we're able to solve problems a little bit more efficiently. 8741 07:17:41,880 --> 07:17:45,600 And we saw formulating a problem as a constraint satisfaction problem, 8742 07:17:45,600 --> 07:17:48,200 creating this graph of all of the constraints 8743 07:17:48,200 --> 07:17:51,320 that connect two variables that have some constraint between them, 8744 07:17:51,320 --> 07:17:54,080 and using that information to be able to figure out 8745 07:17:54,080 --> 07:17:56,360 what the solution should be. 8746 07:17:56,360 --> 07:17:58,320 And so the takeaway of all of this now is 8747 07:17:58,320 --> 07:18:00,800 that if we have some problem in artificial intelligence 8748 07:18:00,800 --> 07:18:03,200 that we would like to use AI to be able to solve them, 8749 07:18:03,200 --> 07:18:05,540 whether that's trying to figure out where hospitals should be 8750 07:18:05,540 --> 07:18:07,880 or trying to solve the traveling salesman problem, 8751 07:18:07,880 --> 07:18:10,560 trying to optimize productions and costs and whatnot, 8752 07:18:10,560 --> 07:18:13,200 or trying to figure out how to satisfy certain constraints, 8753 07:18:13,200 --> 07:18:15,440 whether that's in a Sudoku puzzle, or whether that's 8754 07:18:15,440 --> 07:18:18,440 in trying to figure out how to schedule exams for a university, 8755 07:18:18,440 --> 07:18:21,200 or any number of a wide variety of types of problems, 8756 07:18:21,200 --> 07:18:24,920 if we can formulate that problem as one of these sorts of problems, 8757 07:18:24,920 --> 07:18:27,640 then we can use these known algorithms, these algorithms 8758 07:18:27,640 --> 07:18:30,640 for enforcing art consistency and backtracking search, 8759 07:18:30,640 --> 07:18:33,240 these hill climbing and simulated annealing algorithms, 8760 07:18:33,240 --> 07:18:36,240 these simplex algorithms and interior point algorithms that 8761 07:18:36,240 --> 07:18:38,220 can be used to solve linear programs, that we 8762 07:18:38,220 --> 07:18:42,400 can use those techniques to begin to solve a whole wide variety of problems 8763 07:18:42,400 --> 07:18:46,600 all in this world of optimization inside of artificial intelligence. 8764 07:18:46,600 --> 07:18:49,760 This was an introduction to artificial intelligence with Python for today. 8765 07:18:49,760 --> 07:18:52,320 We will see you next time. 8766 07:18:52,320 --> 07:18:53,320 [" 8767 07:19:11,120 --> 07:19:11,620 All right. 8768 07:19:11,620 --> 07:19:13,360 Welcome back, everyone, to an introduction 8769 07:19:13,360 --> 07:19:15,440 to artificial intelligence with Python. 8770 07:19:15,440 --> 07:19:17,920 Now, so far in this class, we've used AI to solve 8771 07:19:17,920 --> 07:19:20,400 a number of different problems, giving AI instructions 8772 07:19:20,400 --> 07:19:24,520 for how to search for a solution, or how to satisfy certain constraints in order 8773 07:19:24,520 --> 07:19:27,480 to find its way from some input point to some output point 8774 07:19:27,480 --> 07:19:29,720 in order to solve some sort of problem. 8775 07:19:29,720 --> 07:19:31,760 Today, we're going to turn to the world of learning, 8776 07:19:31,760 --> 07:19:34,880 in particular the idea of machine learning, which generally refers 8777 07:19:34,880 --> 07:19:38,620 to the idea where we are not going to give the computer explicit instructions 8778 07:19:38,620 --> 07:19:42,160 for how to perform a task, but rather we are going to give the computer access 8779 07:19:42,160 --> 07:19:45,560 to information in the form of data, or patterns that it can learn from, 8780 07:19:45,560 --> 07:19:48,740 and let the computer try and figure out what those patterns are, 8781 07:19:48,740 --> 07:19:52,520 try and understand that data to be able to perform a task on its own. 8782 07:19:52,520 --> 07:19:54,860 Now, machine learning comes in a number of different forms, 8783 07:19:54,860 --> 07:19:56,120 and it's a very wide field. 8784 07:19:56,120 --> 07:20:00,000 So today, we'll explore some of the foundational algorithms and ideas 8785 07:20:00,000 --> 07:20:03,320 that are behind a lot of the different areas within machine learning. 8786 07:20:03,320 --> 07:20:07,200 And one of the most popular is the idea of supervised machine learning, 8787 07:20:07,200 --> 07:20:08,780 or just supervised learning. 8788 07:20:08,780 --> 07:20:11,480 And supervised learning is a particular type of task. 8789 07:20:11,480 --> 07:20:14,480 It refers to the task where we give the computer access 8790 07:20:14,480 --> 07:20:19,120 to a data set, where that data set consists of input-output pairs. 8791 07:20:19,120 --> 07:20:21,000 And what we would like the computer to do 8792 07:20:21,000 --> 07:20:23,360 is we would like our AI to be able to figure out 8793 07:20:23,360 --> 07:20:27,200 some function that maps inputs to outputs. 8794 07:20:27,200 --> 07:20:29,580 So we have a whole bunch of data that generally consists 8795 07:20:29,580 --> 07:20:32,000 of some kind of input, some evidence, some information 8796 07:20:32,000 --> 07:20:33,800 that the computer will have access to. 8797 07:20:33,800 --> 07:20:36,720 And we would like the computer, based on that input information, 8798 07:20:36,720 --> 07:20:40,000 to predict what some output is going to be. 8799 07:20:40,000 --> 07:20:43,280 And we'll give it some data so that the computer can train its model on 8800 07:20:43,280 --> 07:20:46,220 and begin to understand how it is that this information works 8801 07:20:46,220 --> 07:20:49,520 and how it is that the inputs and outputs relate to each other. 8802 07:20:49,520 --> 07:20:51,280 But ultimately, we hope that our computer 8803 07:20:51,280 --> 07:20:54,400 will be able to figure out some function that, given those inputs, 8804 07:20:54,400 --> 07:20:56,640 is able to get those outputs. 8805 07:20:56,640 --> 07:20:59,480 There are a couple of different tasks within supervised learning. 8806 07:20:59,480 --> 07:21:02,840 The one we'll focus on and start with is known as classification. 8807 07:21:02,840 --> 07:21:07,200 And classification is the problem where, if I give you a whole bunch of inputs, 8808 07:21:07,200 --> 07:21:11,560 you need to figure out some way to map those inputs into discrete categories, 8809 07:21:11,560 --> 07:21:13,600 where you can decide what those categories are, 8810 07:21:13,600 --> 07:21:16,800 and it's the job of the computer to predict what those categories are 8811 07:21:16,800 --> 07:21:17,360 going to be. 8812 07:21:17,360 --> 07:21:19,960 So that might be, for example, I give you information 8813 07:21:19,960 --> 07:21:23,560 about a bank note, like a US dollar, and I'm asking you to predict for me, 8814 07:21:23,560 --> 07:21:26,480 does it belong to the category of authentic bank notes, 8815 07:21:26,480 --> 07:21:29,400 or does it belong to the category of counterfeit bank notes? 8816 07:21:29,400 --> 07:21:31,520 You need to categorize the input, and we want 8817 07:21:31,520 --> 07:21:33,840 to train the computer to figure out some function 8818 07:21:33,840 --> 07:21:36,160 to be able to do that calculation. 8819 07:21:36,160 --> 07:21:38,280 Another example might be the case of weather, 8820 07:21:38,280 --> 07:21:40,840 someone we've talked about a little bit so far in this class, 8821 07:21:40,840 --> 07:21:43,440 where we would like to predict on a given day, 8822 07:21:43,440 --> 07:21:44,960 is it going to rain on that day? 8823 07:21:44,960 --> 07:21:46,600 Is it going to be cloudy on that day? 8824 07:21:46,600 --> 07:21:49,800 And before we've seen how we could do this, if we really give the computer 8825 07:21:49,800 --> 07:21:53,200 all the exact probabilities for if these are the conditions, 8826 07:21:53,200 --> 07:21:54,800 what's the probability of rain? 8827 07:21:54,800 --> 07:21:57,520 Oftentimes, we don't have access to that information, though. 8828 07:21:57,520 --> 07:22:00,440 But what we do have access to is a whole bunch of data. 8829 07:22:00,440 --> 07:22:02,640 So if we wanted to be able to predict something like, 8830 07:22:02,640 --> 07:22:04,560 is it going to rain or is it not going to rain, 8831 07:22:04,560 --> 07:22:07,880 we would give the computer historical information about days 8832 07:22:07,880 --> 07:22:10,320 when it was raining and days when it was not raining 8833 07:22:10,320 --> 07:22:14,200 and ask the computer to look for patterns in that data. 8834 07:22:14,200 --> 07:22:15,800 So what might that data look like? 8835 07:22:15,800 --> 07:22:18,320 Well, we could structure that data in a table like this. 8836 07:22:18,320 --> 07:22:21,440 This might be what our table looks like, where for any particular day, 8837 07:22:21,440 --> 07:22:24,720 going back, we have information about that day's humidity, 8838 07:22:24,720 --> 07:22:28,120 that day's air pressure, and then importantly, we have a label, 8839 07:22:28,120 --> 07:22:31,000 something where the human has said that on this particular day, 8840 07:22:31,000 --> 07:22:33,040 it was raining or it was not raining. 8841 07:22:33,040 --> 07:22:35,680 So you could fill in this table with a whole bunch of data. 8842 07:22:35,680 --> 07:22:39,280 And what makes this what we would call a supervised learning exercise 8843 07:22:39,280 --> 07:22:42,360 is that a human has gone in and labeled each of these data points, 8844 07:22:42,360 --> 07:22:45,920 said that on this day, when these were the values for the humidity and pressure, 8845 07:22:45,920 --> 07:22:49,280 that day was a rainy day and this day was a not rainy day. 8846 07:22:49,280 --> 07:22:51,760 And what we would like the computer to be able to do then 8847 07:22:51,760 --> 07:22:55,320 is to be able to figure out, given these inputs, given the humidity 8848 07:22:55,320 --> 07:22:58,360 and the pressure, can the computer predict what label 8849 07:22:58,360 --> 07:22:59,840 should be associated with that day? 8850 07:22:59,840 --> 07:23:02,840 Does that day look more like it's going to be a day that rains 8851 07:23:02,840 --> 07:23:06,600 or does it look more like a day when it's not going to rain? 8852 07:23:06,600 --> 07:23:10,440 Put a little bit more mathematically, you can think of this as a function 8853 07:23:10,440 --> 07:23:13,280 that takes two inputs, the inputs being the data points 8854 07:23:13,280 --> 07:23:16,520 that our computer will have access to, things like humidity and pressure. 8855 07:23:16,520 --> 07:23:18,400 So we could write a function f that takes 8856 07:23:18,400 --> 07:23:20,560 as input both humidity and pressure. 8857 07:23:20,560 --> 07:23:24,080 And then the output is going to be what category 8858 07:23:24,080 --> 07:23:27,520 we would ascribe to these particular input points, what label 8859 07:23:27,520 --> 07:23:29,240 we would associate with that input. 8860 07:23:29,240 --> 07:23:31,480 So we've seen a couple of example data points here, 8861 07:23:31,480 --> 07:23:34,160 where given this value for humidity and this value for pressure, 8862 07:23:34,160 --> 07:23:37,560 we predict, is it going to rain or is it not going to rain? 8863 07:23:37,560 --> 07:23:40,520 And that's information that we just gathered from the world. 8864 07:23:40,520 --> 07:23:44,000 We measured on various different days what the humidity and pressure were. 8865 07:23:44,000 --> 07:23:48,120 We observed whether or not we saw rain or no rain on that particular day. 8866 07:23:48,120 --> 07:23:51,880 And this function f is what we would like to approximate. 8867 07:23:51,880 --> 07:23:53,840 Now, the computer and we humans don't really 8868 07:23:53,840 --> 07:23:55,640 know exactly how this function f works. 8869 07:23:55,640 --> 07:23:57,920 It's probably quite a complex function. 8870 07:23:57,920 --> 07:24:01,000 So what we're going to do instead is attempt to estimate it. 8871 07:24:01,000 --> 07:24:03,960 We would like to come up with a hypothesis function. 8872 07:24:03,960 --> 07:24:08,240 h, which is going to try to approximate what f does. 8873 07:24:08,240 --> 07:24:12,200 We want to come up with some function h that will also take the same inputs 8874 07:24:12,200 --> 07:24:15,720 and will also produce an output, rain or no rain. 8875 07:24:15,720 --> 07:24:20,080 And ideally, we'd like these two functions to agree as much as possible. 8876 07:24:20,080 --> 07:24:23,720 So the goal then of the supervised learning classification tasks 8877 07:24:23,720 --> 07:24:26,880 is going to be to figure out, what does that function h look like? 8878 07:24:26,880 --> 07:24:30,880 How can we begin to estimate, given all of this information, all of this data, 8879 07:24:30,880 --> 07:24:35,280 what category or what label should be assigned to a particular data point? 8880 07:24:35,280 --> 07:24:37,400 So where could you begin doing this? 8881 07:24:37,400 --> 07:24:39,960 Well, a reasonable thing to do, especially in this situation, 8882 07:24:39,960 --> 07:24:42,240 I have two numerical values, is I could try 8883 07:24:42,240 --> 07:24:47,040 to plot this on a graph that has two axes, an x-axis and a y-axis. 8884 07:24:47,040 --> 07:24:50,440 And in this case, we're just going to be using two numerical values as input. 8885 07:24:50,440 --> 07:24:54,120 But these same types of ideas scale as you add more and more inputs as well. 8886 07:24:54,120 --> 07:24:56,000 We'll be plotting things in two dimensions. 8887 07:24:56,000 --> 07:24:58,440 But as we soon see, you could add more inputs 8888 07:24:58,440 --> 07:25:00,720 and just imagine things in multiple dimensions. 8889 07:25:00,720 --> 07:25:04,040 And while we humans have trouble conceptualizing anything really 8890 07:25:04,040 --> 07:25:06,320 beyond three dimensions, at least visually, 8891 07:25:06,320 --> 07:25:08,800 a computer has no problem with trying to imagine things 8892 07:25:08,800 --> 07:25:11,280 in many, many more dimensions, that for a computer, 8893 07:25:11,280 --> 07:25:14,440 each dimension is just some separate number that it is keeping track of. 8894 07:25:14,440 --> 07:25:17,600 So it wouldn't be unreasonable for a computer to think in 10 dimensions 8895 07:25:17,600 --> 07:25:20,840 or 100 dimensions to be able to try to solve a problem. 8896 07:25:20,840 --> 07:25:22,320 But for now, we've got two inputs. 8897 07:25:22,320 --> 07:25:25,440 So we'll graph things along two axes, an x-axis, which will here 8898 07:25:25,440 --> 07:25:29,400 represent humidity, and a y-axis, which here represents pressure. 8899 07:25:29,400 --> 07:25:32,280 And what we might do is say, let's take all of the days 8900 07:25:32,280 --> 07:25:35,200 that were raining and just try to plot them on this graph 8901 07:25:35,200 --> 07:25:37,080 and see where they fall on this graph. 8902 07:25:37,080 --> 07:25:40,540 And here might be all of the rainy days, where each rainy day is 8903 07:25:40,540 --> 07:25:42,800 one of these blue dots here that corresponds 8904 07:25:42,800 --> 07:25:46,440 to a particular value for humidity and a particular value for pressure. 8905 07:25:46,440 --> 07:25:49,280 And then I might do the same thing with the days that were not rainy. 8906 07:25:49,280 --> 07:25:51,320 So take all the not rainy days, figure out 8907 07:25:51,320 --> 07:25:53,960 what their values were for each of these two inputs, 8908 07:25:53,960 --> 07:25:56,680 and go ahead and plot them on this graph as well. 8909 07:25:56,680 --> 07:25:58,080 And I've here plotted them in red. 8910 07:25:58,080 --> 07:26:00,320 So blue here stands for a rainy day. 8911 07:26:00,320 --> 07:26:02,800 Red here stands for a not rainy day. 8912 07:26:02,800 --> 07:26:04,880 And this then is the input that my computer 8913 07:26:04,880 --> 07:26:07,080 has access to all of this input. 8914 07:26:07,080 --> 07:26:09,560 And what I would like the computer to be able to do 8915 07:26:09,560 --> 07:26:13,440 is to train a model such that if I'm ever presented with a new input that 8916 07:26:13,440 --> 07:26:18,080 doesn't have a label associated with it, something like this white dot here, 8917 07:26:18,080 --> 07:26:21,440 I would like to predict, given those values for each of the two inputs, 8918 07:26:21,440 --> 07:26:24,800 should we classify it as a blue dot, a rainy day, 8919 07:26:24,800 --> 07:26:28,120 or should we classify it as a red dot, a not rainy day? 8920 07:26:28,120 --> 07:26:30,800 And if you're just looking at this picture graphically, trying to say, 8921 07:26:30,800 --> 07:26:34,080 all right, this white dot, does it look like it belongs to the blue category, 8922 07:26:34,080 --> 07:26:36,480 or does it look like it belongs to the red category, 8923 07:26:36,480 --> 07:26:40,360 I think most people would agree that it probably belongs to the blue category. 8924 07:26:40,360 --> 07:26:41,120 And why is that? 8925 07:26:41,120 --> 07:26:45,280 Well, it looks like it's close to other blue dots. 8926 07:26:45,280 --> 07:26:47,280 And that's not a very formal notion, but it's a notion 8927 07:26:47,280 --> 07:26:49,120 that we'll formalize in just a moment. 8928 07:26:49,120 --> 07:26:52,120 That because it seems to be close to this blue dot here, 8929 07:26:52,120 --> 07:26:54,280 nothing else is closer to it, then we might 8930 07:26:54,280 --> 07:26:56,640 say that it should be categorized as blue. 8931 07:26:56,640 --> 07:26:58,960 It should fall into that category of, I think 8932 07:26:58,960 --> 07:27:01,680 that day is going to be a rainy day based on that input. 8933 07:27:01,680 --> 07:27:04,840 Might not be totally accurate, but it's a pretty good guess. 8934 07:27:04,840 --> 07:27:08,040 And this type of algorithm is actually a very popular and common machine 8935 07:27:08,040 --> 07:27:11,640 learning algorithm known as nearest neighbor classification. 8936 07:27:11,640 --> 07:27:14,720 It's an algorithm for solving these classification-type problems. 8937 07:27:14,720 --> 07:27:18,360 And in nearest neighbor classification, it's going to perform this algorithm. 8938 07:27:18,360 --> 07:27:20,360 What it will do is, given an input, it will 8939 07:27:20,360 --> 07:27:24,560 choose the class of the nearest data point to that input. 8940 07:27:24,560 --> 07:27:27,800 By class, we just here mean category, like rain or no rain, 8941 07:27:27,800 --> 07:27:29,760 counterfeit or not counterfeit. 8942 07:27:29,760 --> 07:27:34,600 And we choose the category or the class based on the nearest data point. 8943 07:27:34,600 --> 07:27:36,480 So given all that data, we just looked at, 8944 07:27:36,480 --> 07:27:39,960 is the nearest data point a blue point or is it a red point? 8945 07:27:39,960 --> 07:27:42,600 And depending on the answer to that question, 8946 07:27:42,600 --> 07:27:44,600 we were able to make some sort of judgment. 8947 07:27:44,600 --> 07:27:47,320 We were able to say something like, we think it's going to be blue 8948 07:27:47,320 --> 07:27:49,360 or we think it's going to be red. 8949 07:27:49,360 --> 07:27:51,480 So likewise, we could apply this to other data points 8950 07:27:51,480 --> 07:27:52,800 that we encounter as well. 8951 07:27:52,800 --> 07:27:56,800 If suddenly this data point comes about, well, its nearest data is red. 8952 07:27:56,800 --> 07:28:00,240 So we would go ahead and classify this as a red point, not raining. 8953 07:28:00,240 --> 07:28:03,480 Things get a little bit trickier, though, when you look at a point 8954 07:28:03,480 --> 07:28:07,160 like this white point over here and you ask the same sort of question. 8955 07:28:07,160 --> 07:28:10,640 Should it belong to the category of blue points, the rainy days? 8956 07:28:10,640 --> 07:28:14,800 Or should it belong to the category of red points, the not rainy days? 8957 07:28:14,800 --> 07:28:18,760 Now, nearest neighbor classification would say the way you solve this problem 8958 07:28:18,760 --> 07:28:21,000 is look at which point is nearest to that point. 8959 07:28:21,000 --> 07:28:23,000 You look at this nearest point and say it's red. 8960 07:28:23,000 --> 07:28:24,240 It's a not rainy day. 8961 07:28:24,240 --> 07:28:27,080 And therefore, according to nearest neighbor classification, 8962 07:28:27,080 --> 07:28:30,400 I would say that this unlabeled point, well, that should also be red. 8963 07:28:30,400 --> 07:28:33,720 It should also be classified as a not rainy day. 8964 07:28:33,720 --> 07:28:37,080 But your intuition might think that that's a reasonable judgment to make, 8965 07:28:37,080 --> 07:28:39,280 that it's the closest thing is a not rainy day. 8966 07:28:39,280 --> 07:28:41,480 So may as well guess that it's a not rainy day. 8967 07:28:41,480 --> 07:28:44,640 But it's probably also reasonable to look at the bigger picture of things 8968 07:28:44,640 --> 07:28:49,480 to say, yes, it is true that the nearest point to it was a red point. 8969 07:28:49,480 --> 07:28:52,920 But it's surrounded by a whole bunch of other blue points. 8970 07:28:52,920 --> 07:28:55,160 So looking at the bigger picture, there's potentially 8971 07:28:55,160 --> 07:28:59,160 an argument to be made that this point should actually be blue. 8972 07:28:59,160 --> 07:29:01,440 And with only this data, we actually don't know for sure. 8973 07:29:01,440 --> 07:29:04,080 We are given some input, something we're trying to predict. 8974 07:29:04,080 --> 07:29:07,240 And we don't necessarily know what the output is going to be. 8975 07:29:07,240 --> 07:29:10,320 So in this case, which one is correct is difficult to say. 8976 07:29:10,320 --> 07:29:13,560 But oftentimes, considering more than just a single neighbor, 8977 07:29:13,560 --> 07:29:18,080 considering multiple neighbors can sometimes give us a better result. 8978 07:29:18,080 --> 07:29:21,800 And so there's a variant on the nearest neighbor classification algorithm 8979 07:29:21,800 --> 07:29:25,400 that is known as the K nearest neighbor classification algorithm, 8980 07:29:25,400 --> 07:29:28,320 where K is some parameter, some number that we choose, 8981 07:29:28,320 --> 07:29:30,920 for how many neighbors are we going to look at. 8982 07:29:30,920 --> 07:29:34,280 So one nearest neighbor classification is what we saw before. 8983 07:29:34,280 --> 07:29:37,600 Just pick the one nearest neighbor and use that category. 8984 07:29:37,600 --> 07:29:39,640 But with K nearest neighbor classification, 8985 07:29:39,640 --> 07:29:44,840 where K might be 3, or 5, or 7, to say look at the 3, or 5, or 7 closest 8986 07:29:44,840 --> 07:29:48,760 neighbors, closest data points to that point, works a little bit differently. 8987 07:29:48,760 --> 07:29:50,600 This algorithm, we'll give it an input. 8988 07:29:50,600 --> 07:29:55,520 Choose the most common class out of the K nearest data points to that input. 8989 07:29:55,520 --> 07:29:59,560 So if we look at the five nearest points, and three of them say it's raining, 8990 07:29:59,560 --> 07:30:01,360 and two of them say it's not raining, we'll 8991 07:30:01,360 --> 07:30:05,320 go with the three instead of the two, because each one effectively 8992 07:30:05,320 --> 07:30:09,280 gets one vote towards what they believe the category ought to be. 8993 07:30:09,280 --> 07:30:12,760 And ultimately, you choose the category that has the most votes 8994 07:30:12,760 --> 07:30:14,680 as a consequence of that. 8995 07:30:14,680 --> 07:30:17,640 So K nearest neighbor classification, fairly straightforward one 8996 07:30:17,640 --> 07:30:18,880 to understand intuitively. 8997 07:30:18,880 --> 07:30:21,800 You just look at the neighbors and figure out what the answer might be. 8998 07:30:21,800 --> 07:30:24,120 And it turns out this can work very, very well 8999 07:30:24,120 --> 07:30:28,360 for solving a whole variety of different types of classification problems. 9000 07:30:28,360 --> 07:30:31,020 But not every model is going to work under every situation. 9001 07:30:31,020 --> 07:30:33,520 And so one of the things we'll take a look at today, especially 9002 07:30:33,520 --> 07:30:35,600 in the context of supervised machine learning, 9003 07:30:35,600 --> 07:30:38,400 is that there are a number of different approaches to machine learning, 9004 07:30:38,400 --> 07:30:40,760 a number of different algorithms that we can apply, 9005 07:30:40,760 --> 07:30:44,880 all solving the same type of problem, all solving some kind of classification 9006 07:30:44,880 --> 07:30:47,820 problem where we want to take inputs and organize it 9007 07:30:47,820 --> 07:30:49,080 into different categories. 9008 07:30:49,080 --> 07:30:51,280 And no one algorithm is necessarily always 9009 07:30:51,280 --> 07:30:53,440 going to be better than some other algorithm. 9010 07:30:53,440 --> 07:30:54,640 They each have their trade-offs. 9011 07:30:54,640 --> 07:30:57,480 And maybe depending on the data, one type of algorithm 9012 07:30:57,480 --> 07:30:59,360 is going to be better suited to trying to model 9013 07:30:59,360 --> 07:31:01,320 that information than some other algorithm. 9014 07:31:01,320 --> 07:31:04,440 And so this is what a lot of machine learning research ends up being about, 9015 07:31:04,440 --> 07:31:06,740 that when you're trying to apply machine learning techniques, 9016 07:31:06,740 --> 07:31:09,400 you're often looking not just at one particular algorithm, 9017 07:31:09,400 --> 07:31:11,160 but trying multiple different algorithms, 9018 07:31:11,160 --> 07:31:14,520 trying to see what is going to give you the best results for trying 9019 07:31:14,520 --> 07:31:18,720 to predict some function that maps inputs to outputs. 9020 07:31:18,720 --> 07:31:22,320 So what then are the drawbacks of K nearest neighbor classification? 9021 07:31:22,320 --> 07:31:23,560 Well, there are a couple. 9022 07:31:23,560 --> 07:31:27,200 One might be that in a naive approach, at least, it could be fairly slow 9023 07:31:27,200 --> 07:31:30,000 to have to go through and measure the distance between a point 9024 07:31:30,000 --> 07:31:32,240 and every single one of these points that exist here. 9025 07:31:32,240 --> 07:31:33,740 Now, there are ways of trying to get around that. 9026 07:31:33,740 --> 07:31:36,320 There are data structures that can help to make it more quickly 9027 07:31:36,320 --> 07:31:38,080 to be able to find these neighbors. 9028 07:31:38,080 --> 07:31:41,440 There are also techniques you can use to try and prune some of this data, 9029 07:31:41,440 --> 07:31:43,600 remove some of the data points so that you're only 9030 07:31:43,600 --> 07:31:47,320 left with the relevant data points just to make it a little bit easier. 9031 07:31:47,320 --> 07:31:49,840 But ultimately, what we might like to do is come up 9032 07:31:49,840 --> 07:31:53,320 with another way of trying to do this classification. 9033 07:31:53,320 --> 07:31:55,500 And one way of trying to do the classification 9034 07:31:55,500 --> 07:31:57,680 was looking at what are the neighboring points. 9035 07:31:57,680 --> 07:32:01,120 But another way might be to try to look at all of the data 9036 07:32:01,120 --> 07:32:05,240 and see if we can come up with some decision boundary, some boundary that 9037 07:32:05,240 --> 07:32:08,760 will separate the rainy days from the not rainy days. 9038 07:32:08,760 --> 07:32:11,720 And in the case of two dimensions, we can do that by drawing a line, 9039 07:32:11,720 --> 07:32:12,560 for example. 9040 07:32:12,560 --> 07:32:15,840 So what we might want to try to do is just find some line, 9041 07:32:15,840 --> 07:32:20,440 find some separator that divides the rainy days, the blue points over here, 9042 07:32:20,440 --> 07:32:22,960 from the not rainy days, the red points over there. 9043 07:32:22,960 --> 07:32:25,600 We're now trying a different approach in contrast 9044 07:32:25,600 --> 07:32:27,840 with the nearest neighbor approach, which just 9045 07:32:27,840 --> 07:32:31,320 looked at local data around the input data point that we cared about. 9046 07:32:31,320 --> 07:32:35,080 Now what we're doing is trying to use a technique known as linear regression 9047 07:32:35,080 --> 07:32:39,800 to find some sort of line that will separate the two halves from each other. 9048 07:32:39,800 --> 07:32:42,080 Now sometimes it'll actually be possible to come up 9049 07:32:42,080 --> 07:32:45,120 with some line that perfectly separates all the rainy days 9050 07:32:45,120 --> 07:32:46,520 from the not rainy days. 9051 07:32:46,520 --> 07:32:49,040 Realistically, though, this is probably cleaner 9052 07:32:49,040 --> 07:32:50,960 than many data sets will actually be. 9053 07:32:50,960 --> 07:32:52,400 Oftentimes, data is messier. 9054 07:32:52,400 --> 07:32:53,280 There are outliers. 9055 07:32:53,280 --> 07:32:56,760 There's random noise that happens inside of a particular system. 9056 07:32:56,760 --> 07:32:59,160 And what we'd like to do is still be able to figure out 9057 07:32:59,160 --> 07:33:00,560 what a line might look like. 9058 07:33:00,560 --> 07:33:04,960 So in practice, the data will not always be linearly separable. 9059 07:33:04,960 --> 07:33:07,680 Or linearly separable refers to some data set 9060 07:33:07,680 --> 07:33:11,680 where I could draw a line just to separate the two halves of it perfectly. 9061 07:33:11,680 --> 07:33:13,520 Instead, you might have a situation like this, 9062 07:33:13,520 --> 07:33:16,960 where there are some rainy points that are on this side of the line 9063 07:33:16,960 --> 07:33:19,480 and some not rainy points that are on that side of the line. 9064 07:33:19,480 --> 07:33:23,800 And there may not be a line that perfectly separates 9065 07:33:23,800 --> 07:33:25,920 what path of the inputs from the other half, 9066 07:33:25,920 --> 07:33:29,440 that perfectly separates all the rainy days from the not rainy days. 9067 07:33:29,440 --> 07:33:33,000 But we can still say that this line does a pretty good job. 9068 07:33:33,000 --> 07:33:34,880 And we'll try to formalize a little bit later 9069 07:33:34,880 --> 07:33:38,400 what we mean when we say something like this line does a pretty good job 9070 07:33:38,400 --> 07:33:40,000 of trying to make that prediction. 9071 07:33:40,000 --> 07:33:42,640 But for now, let's just say we're looking for a line that 9072 07:33:42,640 --> 07:33:47,680 does as good of a job as we can at trying to separate one category of things 9073 07:33:47,680 --> 07:33:49,560 from another category of things. 9074 07:33:49,560 --> 07:33:53,080 So let's now try to formalize this a little bit more mathematically. 9075 07:33:53,080 --> 07:33:56,400 We want to come up with some sort of function, some way we can define this 9076 07:33:56,400 --> 07:33:57,200 line. 9077 07:33:57,200 --> 07:34:01,840 And our inputs are things like humidity and pressure in this case. 9078 07:34:01,840 --> 07:34:05,760 So our inputs we might call x1 is going to represent humidity, 9079 07:34:05,760 --> 07:34:08,320 and x2 is going to represent pressure. 9080 07:34:08,320 --> 07:34:11,440 These are inputs that we are going to provide to our machine learning 9081 07:34:11,440 --> 07:34:12,160 algorithm. 9082 07:34:12,160 --> 07:34:14,800 And given those inputs, we would like for our model 9083 07:34:14,800 --> 07:34:17,160 to be able to predict some sort of output. 9084 07:34:17,160 --> 07:34:20,360 And we are going to predict that using our hypothesis function, which 9085 07:34:20,360 --> 07:34:21,520 we called h. 9086 07:34:21,520 --> 07:34:26,600 Our hypothesis function is going to take as input x1 and x2, humidity 9087 07:34:26,600 --> 07:34:27,720 and pressure in this case. 9088 07:34:27,720 --> 07:34:29,680 And you can imagine if we didn't just have two inputs, 9089 07:34:29,680 --> 07:34:31,760 we had three or four or five inputs or more, 9090 07:34:31,760 --> 07:34:35,200 we could have this hypothesis function take all of those as input. 9091 07:34:35,200 --> 07:34:38,440 And we'll see examples of that a little bit later as well. 9092 07:34:38,440 --> 07:34:42,280 And now the question is, what does this hypothesis function do? 9093 07:34:42,280 --> 07:34:46,880 Well, it really just needs to measure, is this data point 9094 07:34:46,880 --> 07:34:51,560 on one side of the boundary, or is it on the other side of the boundary? 9095 07:34:51,560 --> 07:34:53,520 And how do we formalize that boundary? 9096 07:34:53,520 --> 07:34:55,920 Well, the boundary is generally going to be 9097 07:34:55,920 --> 07:34:59,600 a linear combination of these input variables, 9098 07:34:59,600 --> 07:35:01,120 at least in this particular case. 9099 07:35:01,120 --> 07:35:03,840 So what we're trying to do when we say linear combination 9100 07:35:03,840 --> 07:35:06,440 is take each of these inputs and multiply them 9101 07:35:06,440 --> 07:35:08,520 by some number that we're going to have to figure out. 9102 07:35:08,520 --> 07:35:11,600 We'll generally call that number a weight for how important 9103 07:35:11,600 --> 07:35:14,760 should these variables be in trying to determine the answer. 9104 07:35:14,760 --> 07:35:17,400 So we'll weight each of these variables with some weight, 9105 07:35:17,400 --> 07:35:19,880 and we might add a constant to it just to try and make 9106 07:35:19,880 --> 07:35:21,560 the function a little bit different. 9107 07:35:21,560 --> 07:35:23,240 And the result, we just need to compare. 9108 07:35:23,240 --> 07:35:26,300 Is it greater than 0, or is it less than 0 to say, 9109 07:35:26,300 --> 07:35:30,120 does it belong on one side of the line or the other side of the line? 9110 07:35:30,120 --> 07:35:33,960 So what that mathematical expression might look like is this. 9111 07:35:33,960 --> 07:35:38,920 We would take each of my variables, x1 and x2, multiply them by some weight. 9112 07:35:38,920 --> 07:35:40,600 I don't yet know what that weight is, but it's 9113 07:35:40,600 --> 07:35:43,600 going to be some number, weight 1 and weight 2. 9114 07:35:43,600 --> 07:35:46,440 And maybe we just want to add some other weight 0 to it, 9115 07:35:46,440 --> 07:35:50,080 because the function might require us to shift the entire value up or down 9116 07:35:50,080 --> 07:35:51,480 by a certain amount. 9117 07:35:51,480 --> 07:35:52,720 And then we just compare. 9118 07:35:52,720 --> 07:35:55,760 If we do all this math, is it greater than or equal to 0? 9119 07:35:55,760 --> 07:35:58,720 If so, we might categorize that data point as a rainy day. 9120 07:35:58,720 --> 07:36:02,360 And otherwise, we might say, no rain. 9121 07:36:02,360 --> 07:36:05,160 So the key here, then, is that this expression 9122 07:36:05,160 --> 07:36:08,540 is how we are going to calculate whether it's a rainy day or not. 9123 07:36:08,540 --> 07:36:11,520 We're going to do a bunch of math where we take each of the variables, 9124 07:36:11,520 --> 07:36:14,560 multiply them by a weight, maybe add an extra weight to it, 9125 07:36:14,560 --> 07:36:17,000 see if the result is greater than or equal to 0. 9126 07:36:17,000 --> 07:36:19,160 And using that result of that expression, 9127 07:36:19,160 --> 07:36:22,580 we're able to determine whether it's raining or not raining. 9128 07:36:22,580 --> 07:36:26,000 This expression here is in this case going to refer to just some line. 9129 07:36:26,000 --> 07:36:29,040 If you were to plot that graphically, it would just be some line. 9130 07:36:29,040 --> 07:36:33,240 And what the line actually looks like depends upon these weights. 9131 07:36:33,240 --> 07:36:35,640 x1 and x2 are the inputs, but these weights 9132 07:36:35,640 --> 07:36:39,160 are really what determine the shape of that line, the slope of that line, 9133 07:36:39,160 --> 07:36:42,040 and what that line actually looks like. 9134 07:36:42,040 --> 07:36:45,200 So we then would like to figure out what these weights should be. 9135 07:36:45,200 --> 07:36:47,460 We can choose whatever weights we want, but we 9136 07:36:47,460 --> 07:36:51,280 want to choose weights in such a way that if you pass in a rainy day's 9137 07:36:51,280 --> 07:36:53,800 humidity and pressure, then you end up with a result that 9138 07:36:53,800 --> 07:36:55,240 is greater than or equal to 0. 9139 07:36:55,240 --> 07:36:57,960 And we would like it such that if we passed into our hypothesis 9140 07:36:57,960 --> 07:37:01,880 function a not rainy day's inputs, then the output that we get 9141 07:37:01,880 --> 07:37:03,880 should be not raining. 9142 07:37:03,880 --> 07:37:06,880 So before we get there, let's try and formalize this a little bit more 9143 07:37:06,880 --> 07:37:10,280 mathematically just to get a sense for how it is that you'll often see this 9144 07:37:10,280 --> 07:37:12,960 if you ever go further into supervised machine learning 9145 07:37:12,960 --> 07:37:14,320 and explore this idea. 9146 07:37:14,320 --> 07:37:16,480 One thing is that generally for these categories, 9147 07:37:16,480 --> 07:37:20,240 we'll sometimes just use the names of the categories like rain and not rain. 9148 07:37:20,240 --> 07:37:23,520 Often mathematically, if we're trying to do comparisons between these things, 9149 07:37:23,520 --> 07:37:25,960 it's easier just to deal in the world of numbers. 9150 07:37:25,960 --> 07:37:30,600 So we could just say 1 and 0, 1 for raining, 0 for not raining. 9151 07:37:30,600 --> 07:37:31,880 So we do all this math. 9152 07:37:31,880 --> 07:37:34,360 And if the result is greater than or equal to 0, 9153 07:37:34,360 --> 07:37:37,960 we'll go ahead and say our hypothesis function outputs 1, meaning raining. 9154 07:37:37,960 --> 07:37:41,040 And otherwise, it outputs 0, meaning not raining. 9155 07:37:41,040 --> 07:37:45,000 And oftentimes, this type of expression will instead 9156 07:37:45,000 --> 07:37:47,840 express using vector mathematics. 9157 07:37:47,840 --> 07:37:50,240 And all a vector is, if you're not familiar with the term, 9158 07:37:50,240 --> 07:37:53,240 is it refers to a sequence of numerical values. 9159 07:37:53,240 --> 07:37:56,480 You could represent that in Python using a list of numerical values 9160 07:37:56,480 --> 07:37:59,160 or a tuple with numerical values. 9161 07:37:59,160 --> 07:38:02,920 And here, we have a couple of sequences of numerical values. 9162 07:38:02,920 --> 07:38:06,160 One of our vectors, one of our sequences of numerical values, 9163 07:38:06,160 --> 07:38:11,200 are all of these individual weights, w0, w1, and w2. 9164 07:38:11,200 --> 07:38:14,240 So we could construct what we'll call a weight vector, 9165 07:38:14,240 --> 07:38:16,400 and we'll see why this is useful in a moment, 9166 07:38:16,400 --> 07:38:19,960 called w, generally represented using a boldface w, that 9167 07:38:19,960 --> 07:38:23,080 is just a sequence of these three weights, weight 0, weight 1, 9168 07:38:23,080 --> 07:38:24,480 and weight 2. 9169 07:38:24,480 --> 07:38:26,720 And to be able to calculate, based on those weights, 9170 07:38:26,720 --> 07:38:30,440 whether we think a day is raining or not raining, 9171 07:38:30,440 --> 07:38:35,320 we're going to multiply each of those weights by one of our input variables. 9172 07:38:35,320 --> 07:38:39,480 That w2, this weight, is going to be multiplied by input variable x2. 9173 07:38:39,480 --> 07:38:42,640 w1 is going to be multiplied by input variable x1. 9174 07:38:42,640 --> 07:38:46,120 And w0, well, it's not being multiplied by anything. 9175 07:38:46,120 --> 07:38:48,120 But to make sure the vectors are the same length, 9176 07:38:48,120 --> 07:38:50,200 and we'll see why that's useful in just a second, 9177 07:38:50,200 --> 07:38:54,080 we'll just go ahead and say w0 is being multiplied by 1. 9178 07:38:54,080 --> 07:38:55,840 Because you can multiply by something by 1, 9179 07:38:55,840 --> 07:38:58,040 and you end up getting the exact same number. 9180 07:38:58,040 --> 07:39:00,480 So in addition to the weight vector w, we'll 9181 07:39:00,480 --> 07:39:05,480 also have an input vector that we'll call x that has three values, 1, 9182 07:39:05,480 --> 07:39:11,080 again, because we're just multiplying w0 by 1 eventually, and then x1 and x2. 9183 07:39:11,080 --> 07:39:14,800 So here, then, we've represented two distinct vectors, a vector of weights 9184 07:39:14,800 --> 07:39:16,520 that we need to somehow learn. 9185 07:39:16,520 --> 07:39:18,640 The goal of our machine learning algorithm 9186 07:39:18,640 --> 07:39:21,160 is to learn what this weight vector is supposed to be. 9187 07:39:21,160 --> 07:39:23,440 We could choose any arbitrary set of numbers, 9188 07:39:23,440 --> 07:39:26,400 and it would produce a function that tries to predict rain or not rain, 9189 07:39:26,400 --> 07:39:28,080 but it probably wouldn't be very good. 9190 07:39:28,080 --> 07:39:32,120 What we want to do is come up with a good choice of these weights 9191 07:39:32,120 --> 07:39:34,920 so that we're able to do the accurate predictions. 9192 07:39:34,920 --> 07:39:38,720 And then this input vector represents a particular input 9193 07:39:38,720 --> 07:39:41,880 to the function, a data point for which we would like to estimate, 9194 07:39:41,880 --> 07:39:45,200 is that day a rainy day, or is that day a not rainy day? 9195 07:39:45,200 --> 07:39:47,040 And so that's going to vary just depending 9196 07:39:47,040 --> 07:39:49,200 on what input is provided to our function, what 9197 07:39:49,200 --> 07:39:51,240 it is that we are trying to estimate. 9198 07:39:51,240 --> 07:39:55,240 And then to do the calculation, we want to calculate this expression here, 9199 07:39:55,240 --> 07:39:59,200 and it turns out that expression is what we would call the dot product 9200 07:39:59,200 --> 07:40:00,320 of these two vectors. 9201 07:40:00,320 --> 07:40:04,600 The dot product of two vectors just means taking each of the terms 9202 07:40:04,600 --> 07:40:08,120 in the vectors and multiplying them together, w0 multiply it by 1, 9203 07:40:08,120 --> 07:40:11,720 w1 multiply it by x1, w2 multiply it by x2, 9204 07:40:11,720 --> 07:40:14,360 and that's why these vectors need to be the same length. 9205 07:40:14,360 --> 07:40:17,400 And then we just add all of the results together. 9206 07:40:17,400 --> 07:40:22,960 So the dot product of w and x, our weight vector and our input vector, 9207 07:40:22,960 --> 07:40:26,640 that's just going to be w0 times 1, or just w0, 9208 07:40:26,640 --> 07:40:30,680 plus w1 times x1, multiplying these two terms together, 9209 07:40:30,680 --> 07:40:35,760 plus w2 times x2, multiplying those terms together. 9210 07:40:35,760 --> 07:40:38,120 So we have our weight vector, which we need to figure out. 9211 07:40:38,120 --> 07:40:39,960 We need our machine learning algorithm to figure out 9212 07:40:39,960 --> 07:40:41,200 what the weights should be. 9213 07:40:41,200 --> 07:40:44,280 We have the input vector representing the data point 9214 07:40:44,280 --> 07:40:47,560 that we're trying to predict a category for, predict a label for. 9215 07:40:47,560 --> 07:40:51,080 And we're able to do that calculation by taking this dot product, which 9216 07:40:51,080 --> 07:40:53,120 you'll often see represented in vector form. 9217 07:40:53,120 --> 07:40:54,880 But if you haven't seen vectors before, you 9218 07:40:54,880 --> 07:40:57,760 can think of it as identical to just this mathematical expression, 9219 07:40:57,760 --> 07:41:01,120 just doing the multiplication, adding the results together, 9220 07:41:01,120 --> 07:41:04,400 and then seeing whether the result is greater than or equal to 0 or not. 9221 07:41:04,400 --> 07:41:07,480 This expression here is identical to the expression 9222 07:41:07,480 --> 07:41:09,760 that we're calculating to see whether or not 9223 07:41:09,760 --> 07:41:14,200 that answer is greater than or equal to 0 in this case. 9224 07:41:14,200 --> 07:41:17,280 And so for that reason, you'll often see the hypothesis function 9225 07:41:17,280 --> 07:41:20,520 written as something like this, a simpler representation where 9226 07:41:20,520 --> 07:41:25,360 the hypothesis takes as input some input vector x, some humidity 9227 07:41:25,360 --> 07:41:26,960 and pressure for some day. 9228 07:41:26,960 --> 07:41:30,720 And we want to predict an output like rain or no rain or 1 or 0 9229 07:41:30,720 --> 07:41:33,360 if we choose to represent things numerically. 9230 07:41:33,360 --> 07:41:37,520 And the way we do that is by taking the dot product of the weights 9231 07:41:37,520 --> 07:41:38,640 and our input. 9232 07:41:38,640 --> 07:41:42,080 If it's greater than or equal to 0, we'll go ahead and say the output is 1. 9233 07:41:42,080 --> 07:41:44,960 Otherwise, the output is going to be 0. 9234 07:41:44,960 --> 07:41:49,080 And this hypothesis, we say, is parameterized by the weights. 9235 07:41:49,080 --> 07:41:51,280 Depending on what weights we choose, we'll 9236 07:41:51,280 --> 07:41:53,400 end up getting a different hypothesis. 9237 07:41:53,400 --> 07:41:55,480 If we choose the weights randomly, we're probably 9238 07:41:55,480 --> 07:41:57,480 not going to get a very good hypothesis function. 9239 07:41:57,480 --> 07:41:58,840 We'll get a 1 or a 0. 9240 07:41:58,840 --> 07:42:01,120 But it's probably not accurately going to reflect 9241 07:42:01,120 --> 07:42:04,280 whether we think a day is going to be rainy or not rainy. 9242 07:42:04,280 --> 07:42:06,860 But if we choose the weights right, we can often 9243 07:42:06,860 --> 07:42:09,960 do a pretty good job of trying to estimate whether we think 9244 07:42:09,960 --> 07:42:13,800 the output of the function should be a 1 or a 0. 9245 07:42:13,800 --> 07:42:16,080 And so the question, then, is how to figure out 9246 07:42:16,080 --> 07:42:19,800 what these weights should be, how to be able to tune those parameters. 9247 07:42:19,800 --> 07:42:21,800 And there are a number of ways you can do that. 9248 07:42:21,800 --> 07:42:25,680 One of the most common is known as the perceptron learning rule. 9249 07:42:25,680 --> 07:42:27,160 And we'll see more of this later. 9250 07:42:27,160 --> 07:42:29,120 But the idea of the perceptron learning rule, 9251 07:42:29,120 --> 07:42:30,860 and we're not going to get too deep into the mathematics, 9252 07:42:30,860 --> 07:42:33,240 we'll mostly just introduce it more conceptually, 9253 07:42:33,240 --> 07:42:37,640 is to say that given some data point that we would like to learn from, 9254 07:42:37,640 --> 07:42:41,520 some data point that has an input x and an output y, where 9255 07:42:41,520 --> 07:42:46,180 y is like 1 for rain or 0 for not rain, then we're going to update the weights. 9256 07:42:46,180 --> 07:42:48,100 And we'll look at the formula in just a moment. 9257 07:42:48,100 --> 07:42:51,880 But the big picture idea is that we can start with random weights, 9258 07:42:51,880 --> 07:42:53,720 but then learn from the data. 9259 07:42:53,720 --> 07:42:55,640 Take the data points one at a time. 9260 07:42:55,640 --> 07:42:58,600 And for each one of the data points, figure out, all right, 9261 07:42:58,600 --> 07:43:02,200 what parameters do we need to change inside of the weights 9262 07:43:02,200 --> 07:43:05,120 in order to better match that input point. 9263 07:43:05,120 --> 07:43:07,840 And so that is the value of having access to a lot of data 9264 07:43:07,840 --> 07:43:09,800 in the supervised machine learning algorithm, 9265 07:43:09,800 --> 07:43:13,080 is that you take each of the data points and maybe look at them multiple times 9266 07:43:13,080 --> 07:43:15,600 and constantly try and figure out whether you 9267 07:43:15,600 --> 07:43:19,920 need to shift your weights in order to better create some weight vector that 9268 07:43:19,920 --> 07:43:24,000 is able to correctly or more accurately try to estimate what the output should 9269 07:43:24,000 --> 07:43:25,840 be, whether we think it's going to be raining 9270 07:43:25,840 --> 07:43:28,640 or whether we think it's not going to be raining. 9271 07:43:28,640 --> 07:43:30,360 So what does that weight update look like? 9272 07:43:30,360 --> 07:43:32,240 Without going into too much of the mathematics, 9273 07:43:32,240 --> 07:43:35,960 we're going to update each of the weights to be the result of the original 9274 07:43:35,960 --> 07:43:39,360 weight plus some additional expression. 9275 07:43:39,360 --> 07:43:41,920 And to understand this expression, y, well, 9276 07:43:41,920 --> 07:43:44,720 y is what the actual output is. 9277 07:43:44,720 --> 07:43:50,200 And hypothesis of x, the input, that's going to be what we thought the input 9278 07:43:50,200 --> 07:43:51,000 was. 9279 07:43:51,000 --> 07:43:55,040 And so I can replace this by saying what the actual value was minus what 9280 07:43:55,040 --> 07:43:56,720 our estimate was. 9281 07:43:56,720 --> 07:44:01,360 And based on the difference between the actual value and what our estimate was, 9282 07:44:01,360 --> 07:44:04,120 we might want to change our hypothesis, change the way 9283 07:44:04,120 --> 07:44:06,240 that we do that estimation. 9284 07:44:06,240 --> 07:44:08,800 If the actual value and the estimate were the same thing, 9285 07:44:08,800 --> 07:44:11,440 meaning we were correctly able to predict what category 9286 07:44:11,440 --> 07:44:14,920 this data point belonged to, well, then actual value minus estimate, 9287 07:44:14,920 --> 07:44:18,280 that's just going to be 0, which means this whole term on the right-hand side 9288 07:44:18,280 --> 07:44:20,720 goes to be 0, and the weight doesn't change. 9289 07:44:20,720 --> 07:44:24,120 Weight i, where i is like weight 1 or weight 2 or weight 0, 9290 07:44:24,120 --> 07:44:26,440 weight i just stays at weight i. 9291 07:44:26,440 --> 07:44:29,840 And none of the weights change if we were able to correctly predict 9292 07:44:29,840 --> 07:44:32,240 what category the input belonged to. 9293 07:44:32,240 --> 07:44:36,040 But if our hypothesis didn't correctly predict what category the input 9294 07:44:36,040 --> 07:44:40,320 belonged to, well, then maybe then we need to make some changes, adjust 9295 07:44:40,320 --> 07:44:43,280 the weights so that we're better able to predict this kind of data 9296 07:44:43,280 --> 07:44:45,040 point in the future. 9297 07:44:45,040 --> 07:44:47,000 And what is the way we might do that? 9298 07:44:47,000 --> 07:44:51,080 Well, if the actual value was bigger than the estimate, then, 9299 07:44:51,080 --> 07:44:54,520 and for now we'll go ahead and assume that these x's are positive values, 9300 07:44:54,520 --> 07:44:57,360 then if the actual value was bigger than the estimate, 9301 07:44:57,360 --> 07:45:00,280 well, that means we need to increase the weight in order 9302 07:45:00,280 --> 07:45:02,480 to make it such that the output is bigger, 9303 07:45:02,480 --> 07:45:06,040 and therefore we're more likely to get to the right actual value. 9304 07:45:06,040 --> 07:45:08,400 And so if the actual value is bigger than the estimate, 9305 07:45:08,400 --> 07:45:11,320 then actual value minus estimate, that'll be a positive number. 9306 07:45:11,320 --> 07:45:14,200 And so you imagine we're just adding some positive number to the weight 9307 07:45:14,200 --> 07:45:16,680 just to increase it ever so slightly. 9308 07:45:16,680 --> 07:45:19,640 And likewise, the inverse case is true, that if the actual value 9309 07:45:19,640 --> 07:45:23,400 was less than the estimate, the actual value was 0, 9310 07:45:23,400 --> 07:45:26,400 but we estimated 1, meaning it actually was not raining, 9311 07:45:26,400 --> 07:45:28,520 but we predicted it was going to be raining. 9312 07:45:28,520 --> 07:45:31,120 Well, then we want to decrease the value of the weight, 9313 07:45:31,120 --> 07:45:33,880 because then in that case, we want to try and lower 9314 07:45:33,880 --> 07:45:36,520 the total value of computing that dot product in order 9315 07:45:36,520 --> 07:45:39,640 to make it less likely that we would predict that it would actually 9316 07:45:39,640 --> 07:45:40,920 be raining. 9317 07:45:40,920 --> 07:45:43,680 So no need to get too deep into the mathematics of that, 9318 07:45:43,680 --> 07:45:46,840 but the general idea is that every time we encounter some data point, 9319 07:45:46,840 --> 07:45:49,600 we can adjust these weights accordingly to try and make 9320 07:45:49,600 --> 07:45:53,760 the weights better line up with the actual data that we have access to. 9321 07:45:53,760 --> 07:45:56,520 And you can repeat this process with data point after data point 9322 07:45:56,520 --> 07:45:58,600 until eventually, hopefully, your algorithm 9323 07:45:58,600 --> 07:46:02,360 converges to some set of weights that do a pretty good job of trying 9324 07:46:02,360 --> 07:46:05,960 to figure out whether a day is going to be rainy or not raining. 9325 07:46:05,960 --> 07:46:08,640 And just as a final point about this particular equation, 9326 07:46:08,640 --> 07:46:12,400 this value alpha here is generally what we'll call the learning rate. 9327 07:46:12,400 --> 07:46:15,040 It's just some parameter, some number we choose 9328 07:46:15,040 --> 07:46:18,600 for how quickly we're actually going to be updating these weight values. 9329 07:46:18,600 --> 07:46:20,360 So that if alpha is bigger, then we're going 9330 07:46:20,360 --> 07:46:22,280 to update these weight values by a lot. 9331 07:46:22,280 --> 07:46:25,280 And if alpha is smaller, then we'll update the weight values by less. 9332 07:46:25,280 --> 07:46:26,840 And you can choose a value of alpha. 9333 07:46:26,840 --> 07:46:29,080 Depending on the problem, different values 9334 07:46:29,080 --> 07:46:32,880 might suit the situation better or worse than others. 9335 07:46:32,880 --> 07:46:36,360 So after all of that, after we've done this training process of take 9336 07:46:36,360 --> 07:46:38,800 all this data and using this learning rule, 9337 07:46:38,800 --> 07:46:43,160 look at all the pieces of data and use each piece of data as an indication 9338 07:46:43,160 --> 07:46:45,960 to us of do the weights stay the same, do we increase the weights, 9339 07:46:45,960 --> 07:46:48,880 do we decrease the weights, and if so, by how much? 9340 07:46:48,880 --> 07:46:52,840 What you end up with is effectively a threshold function. 9341 07:46:52,840 --> 07:46:56,120 And we can look at what the threshold function looks like like this. 9342 07:46:56,120 --> 07:46:58,800 On the x-axis here, we have the output of that function, 9343 07:46:58,800 --> 07:47:03,080 taking the weights, taking the dot product of it with the input. 9344 07:47:03,080 --> 07:47:05,880 And on the y-axis, we have what the output is going to be, 9345 07:47:05,880 --> 07:47:08,880 0, which in this case represented not raining, 9346 07:47:08,880 --> 07:47:11,880 and 1, which in this case represented raining. 9347 07:47:11,880 --> 07:47:16,480 And the way that our hypothesis function works is it calculates this value. 9348 07:47:16,480 --> 07:47:20,320 And if it's greater than 0 or greater than some threshold value, 9349 07:47:20,320 --> 07:47:22,400 then we declare that it's a rainy day. 9350 07:47:22,400 --> 07:47:25,280 And otherwise, we declare that it's a not rainy day. 9351 07:47:25,280 --> 07:47:28,600 And this then graphically is what that function looks like, 9352 07:47:28,600 --> 07:47:32,280 that initially when the value of this dot product is small, it's not raining, 9353 07:47:32,280 --> 07:47:33,800 it's not raining, it's not raining. 9354 07:47:33,800 --> 07:47:36,220 But as soon as it crosses that threshold, 9355 07:47:36,220 --> 07:47:39,600 we suddenly say, OK, now it's raining, now it's raining, now it's raining. 9356 07:47:39,600 --> 07:47:42,160 And the way to interpret this kind of representation 9357 07:47:42,160 --> 07:47:44,600 is that anything on this side of the line, that 9358 07:47:44,600 --> 07:47:47,960 would be the category of data points where we say, yes, it's raining. 9359 07:47:47,960 --> 07:47:49,880 Anything that falls on this side of the line 9360 07:47:49,880 --> 07:47:52,440 are the data points where we would say, it's not raining. 9361 07:47:52,440 --> 07:47:54,920 And again, we want to choose some value for the weights 9362 07:47:54,920 --> 07:47:57,840 that results in a function that does a pretty good job of trying 9363 07:47:57,840 --> 07:48:00,200 to do this estimation. 9364 07:48:00,200 --> 07:48:04,040 But one tricky thing with this type of hard threshold 9365 07:48:04,040 --> 07:48:07,240 is that it only leaves two possible outcomes. 9366 07:48:07,240 --> 07:48:09,800 We plug in some data as input. 9367 07:48:09,800 --> 07:48:13,080 And the output we get is raining or not raining. 9368 07:48:13,080 --> 07:48:15,840 And there's no room for anywhere in between. 9369 07:48:15,840 --> 07:48:17,080 And maybe that's what you want. 9370 07:48:17,080 --> 07:48:19,440 Maybe all you want is given some data point, 9371 07:48:19,440 --> 07:48:22,520 you would like to be able to classify it into one or two or more 9372 07:48:22,520 --> 07:48:24,920 of these various different categories. 9373 07:48:24,920 --> 07:48:28,200 But it might also be the case that you care about knowing 9374 07:48:28,200 --> 07:48:31,040 how strong that prediction is, for example. 9375 07:48:31,040 --> 07:48:34,040 So if we go back to this instance here, where we have rainy days 9376 07:48:34,040 --> 07:48:38,040 on this side of the line, not rainy days on that side of the line, 9377 07:48:38,040 --> 07:48:41,900 you might imagine that let's look now at these two white data points. 9378 07:48:41,900 --> 07:48:46,040 This data point here that we would like to predict a label or a category for. 9379 07:48:46,040 --> 07:48:48,380 And this data point over here that we would also 9380 07:48:48,380 --> 07:48:51,440 like to predict a label or a category for. 9381 07:48:51,440 --> 07:48:53,560 It seems likely that you could pretty confidently 9382 07:48:53,560 --> 07:48:56,360 say that this data point, that should be a rainy day. 9383 07:48:56,360 --> 07:48:58,400 Seems close to the other rainy days if we're 9384 07:48:58,400 --> 07:49:00,240 going by the nearest neighbor strategy. 9385 07:49:00,240 --> 07:49:04,720 It's on this side of the line if we're going by the strategy of just saying, 9386 07:49:04,720 --> 07:49:07,040 which side of the line does it fall on by figuring out 9387 07:49:07,040 --> 07:49:08,600 what those weights should be. 9388 07:49:08,600 --> 07:49:11,520 And if we're using the line strategy of just which side of the line 9389 07:49:11,520 --> 07:49:14,400 does it fall on, which side of this decision boundary, 9390 07:49:14,400 --> 07:49:18,240 well, we'd also say that this point here is also a rainy day 9391 07:49:18,240 --> 07:49:23,560 because it falls on the side of the line that corresponds to rainy days. 9392 07:49:23,560 --> 07:49:25,920 But it's likely that even in this case, we 9393 07:49:25,920 --> 07:49:29,680 would know that we don't feel nearly as confident about this data 9394 07:49:29,680 --> 07:49:33,120 point on the left as compared to this data point on the right. 9395 07:49:33,120 --> 07:49:35,520 That for this one on the right, we can feel very confident 9396 07:49:35,520 --> 07:49:37,000 that yes, it's a rainy day. 9397 07:49:37,000 --> 07:49:41,360 This one, it's pretty close to the line if we're judging just by distance. 9398 07:49:41,360 --> 07:49:44,200 And so you might be less sure. 9399 07:49:44,200 --> 07:49:48,320 But our threshold function doesn't allow for a notion of less sure 9400 07:49:48,320 --> 07:49:50,000 or more sure about something. 9401 07:49:50,000 --> 07:49:51,920 It's what we would call a hard threshold. 9402 07:49:51,920 --> 07:49:55,000 It's once you've crossed this line, then immediately we say, 9403 07:49:55,000 --> 07:49:57,480 yes, this is going to be a rainy day. 9404 07:49:57,480 --> 07:50:00,520 Anywhere before it, we're going to say it's not a rainy day. 9405 07:50:00,520 --> 07:50:03,160 And that may not be helpful in a number of cases. 9406 07:50:03,160 --> 07:50:06,440 One, this is not a particularly easy function to deal with. 9407 07:50:06,440 --> 07:50:08,640 As you get deeper into the world of machine learning 9408 07:50:08,640 --> 07:50:11,280 and are trying to do things like taking derivatives of these curves 9409 07:50:11,280 --> 07:50:14,160 with this type of function makes things challenging. 9410 07:50:14,160 --> 07:50:16,120 But the other challenge is that we don't really 9411 07:50:16,120 --> 07:50:17,960 have any notion of gradation between things. 9412 07:50:17,960 --> 07:50:21,400 We don't have a notion of yes, this is a very strong belief 9413 07:50:21,400 --> 07:50:25,560 that it's going to be raining as opposed to it's probably more likely than not 9414 07:50:25,560 --> 07:50:30,040 that it's going to be raining, but maybe not totally sure about that either. 9415 07:50:30,040 --> 07:50:32,560 So what we can do by taking advantage of a technique known 9416 07:50:32,560 --> 07:50:36,160 as logistic regression is instead of using this hard threshold 9417 07:50:36,160 --> 07:50:39,920 type of function, we can use instead a logistic function, something 9418 07:50:39,920 --> 07:50:41,840 we might call a soft threshold. 9419 07:50:41,840 --> 07:50:45,000 And that's going to transform this into looking something 9420 07:50:45,000 --> 07:50:48,160 a little more like this, something that more nicely curves. 9421 07:50:48,160 --> 07:50:52,760 And as a result, the possible output values are no longer just 0 and 1, 9422 07:50:52,760 --> 07:50:55,000 0 for not raining, 1 for raining. 9423 07:50:55,000 --> 07:50:59,320 But you can actually get any real numbered value between 0 and 1. 9424 07:50:59,320 --> 07:51:03,080 But if you're way over on this side, then you get a value of 0. 9425 07:51:03,080 --> 07:51:05,600 OK, it's not going to be raining, and we're pretty sure about that. 9426 07:51:05,600 --> 07:51:07,680 And if you're over on this side, you get a value of 1. 9427 07:51:07,680 --> 07:51:10,280 And yes, we're very sure that it's going to be raining. 9428 07:51:10,280 --> 07:51:13,040 But in between, you could get some real numbered value, 9429 07:51:13,040 --> 07:51:17,200 where a value like 0.7 might mean we think it's going to rain. 9430 07:51:17,200 --> 07:51:20,680 It's more probable that it's going to rain than not based on the data. 9431 07:51:20,680 --> 07:51:25,080 But we're not as confident as some of the other data points might be. 9432 07:51:25,080 --> 07:51:27,520 So one of the advantages of the soft threshold 9433 07:51:27,520 --> 07:51:30,880 is that it allows us to have an output that could be some real number that 9434 07:51:30,880 --> 07:51:34,400 potentially reflects some sort of probability, the likelihood that we 9435 07:51:34,400 --> 07:51:39,480 think that this particular data point belongs to that particular category. 9436 07:51:39,480 --> 07:51:43,920 And there are some other nice mathematical properties of that as well. 9437 07:51:43,920 --> 07:51:46,100 So that then is two different approaches to trying 9438 07:51:46,100 --> 07:51:48,680 to solve this type of classification problem. 9439 07:51:48,680 --> 07:51:51,400 One is this nearest neighbor type of approach, 9440 07:51:51,400 --> 07:51:54,800 where you just take a data point and look at the data points that are nearby 9441 07:51:54,800 --> 07:51:58,440 to try and estimate what category we think it belongs to. 9442 07:51:58,440 --> 07:52:01,160 And the other approach is the approach of saying, all right, 9443 07:52:01,160 --> 07:52:03,600 let's just try and use linear regression, 9444 07:52:03,600 --> 07:52:06,400 figure out what these weights should be, adjust the weights in order 9445 07:52:06,400 --> 07:52:09,920 to figure out what line or what decision boundary is going 9446 07:52:09,920 --> 07:52:12,720 to best separate these two categories. 9447 07:52:12,720 --> 07:52:15,480 It turns out that another popular approach, a very popular approach 9448 07:52:15,480 --> 07:52:17,440 if you just have a data set and you want to start 9449 07:52:17,440 --> 07:52:20,800 trying to do some learning on it, is what we call the support vector machine. 9450 07:52:20,800 --> 07:52:23,600 And we're not going to go too much into the mathematics of the support vector 9451 07:52:23,600 --> 07:52:26,480 machine, but we'll at least explore it graphically to see what it is 9452 07:52:26,480 --> 07:52:27,600 that it looks like. 9453 07:52:27,600 --> 07:52:31,160 And the idea or the motivation behind the support vector machine 9454 07:52:31,160 --> 07:52:34,320 is the idea that there are actually a lot of different lines 9455 07:52:34,320 --> 07:52:37,000 that we could draw, a lot of different decision boundaries 9456 07:52:37,000 --> 07:52:39,240 that we could draw to separate two groups. 9457 07:52:39,240 --> 07:52:41,960 So for example, I had the red data points over here 9458 07:52:41,960 --> 07:52:43,640 and the blue data points over here. 9459 07:52:43,640 --> 07:52:47,560 One possible line I could draw is a line like this, 9460 07:52:47,560 --> 07:52:50,520 that this line here would separate the red points from the blue points. 9461 07:52:50,520 --> 07:52:51,600 And it does so perfectly. 9462 07:52:51,600 --> 07:52:54,000 All the red points are on one side of the line. 9463 07:52:54,000 --> 07:52:56,760 All the blue points are on the other side of the line. 9464 07:52:56,760 --> 07:52:59,800 But this should probably make you a little bit nervous. 9465 07:52:59,800 --> 07:53:02,240 If you come up with a model and the model comes up 9466 07:53:02,240 --> 07:53:03,760 with a line that looks like this. 9467 07:53:03,760 --> 07:53:06,440 And the reason why is that you worry about how well 9468 07:53:06,440 --> 07:53:10,520 it's going to generalize to other data points that are not necessarily 9469 07:53:10,520 --> 07:53:12,720 in the data set that we have access to. 9470 07:53:12,720 --> 07:53:15,440 For example, if there was a point that fell like right here, 9471 07:53:15,440 --> 07:53:19,400 for example, on the right side of the line, well, then based on that, 9472 07:53:19,400 --> 07:53:23,120 we might want to guess that it is, in fact, a red point, 9473 07:53:23,120 --> 07:53:25,760 but it falls on the side of the line where instead we 9474 07:53:25,760 --> 07:53:29,160 would estimate that it's a blue point instead. 9475 07:53:29,160 --> 07:53:32,680 And so based on that, this line is probably not a great choice 9476 07:53:32,680 --> 07:53:36,600 just because it is so close to these various data points. 9477 07:53:36,600 --> 07:53:38,640 We might instead prefer like a diagonal line 9478 07:53:38,640 --> 07:53:41,680 that just goes diagonally through the data set like we've seen before. 9479 07:53:41,680 --> 07:53:44,800 But there too, there's a lot of diagonal lines that we could draw as well. 9480 07:53:44,800 --> 07:53:48,960 For example, I could draw this diagonal line here, which also successfully 9481 07:53:48,960 --> 07:53:51,720 separates all the red points from all of the blue points. 9482 07:53:51,720 --> 07:53:54,480 From the perspective of something like just trying 9483 07:53:54,480 --> 07:53:56,400 to figure out some setting of weights that allows 9484 07:53:56,400 --> 07:53:58,680 us to predict the correct output, this line 9485 07:53:58,680 --> 07:54:02,000 will predict the correct output for this particular set of data 9486 07:54:02,000 --> 07:54:04,800 every single time because the red points are on one side, 9487 07:54:04,800 --> 07:54:06,400 the blue points are on the other. 9488 07:54:06,400 --> 07:54:08,640 But yet again, you should probably be a little nervous 9489 07:54:08,640 --> 07:54:11,480 because this line is so close to these red points, 9490 07:54:11,480 --> 07:54:15,280 even though we're able to correctly predict on the input data, 9491 07:54:15,280 --> 07:54:18,840 if there was a point that fell somewhere in this general area, 9492 07:54:18,840 --> 07:54:22,720 our algorithm, this model, would say that, yeah, we think it's a blue point, 9493 07:54:22,720 --> 07:54:26,880 when in actuality, it might belong to the red category instead 9494 07:54:26,880 --> 07:54:29,760 just because it looks like it's close to the other red points. 9495 07:54:29,760 --> 07:54:33,600 What we really want to be able to say, given this data, how can you generalize 9496 07:54:33,600 --> 07:54:37,160 this as best as possible, is to come up with a line like this that 9497 07:54:37,160 --> 07:54:39,240 seems like the intuitive line to draw. 9498 07:54:39,240 --> 07:54:41,680 And the reason why it's intuitive is because it 9499 07:54:41,680 --> 07:54:47,240 seems to be as far apart as possible from the red data and the blue data. 9500 07:54:47,240 --> 07:54:49,280 So that if we generalize a little bit and assume 9501 07:54:49,280 --> 07:54:51,920 that maybe we have some points that are different from the input 9502 07:54:51,920 --> 07:54:54,120 but still slightly further away, we can still 9503 07:54:54,120 --> 07:54:58,000 say that something on this side probably red, something on that side 9504 07:54:58,000 --> 07:55:01,480 probably blue, and we can make those judgments that way. 9505 07:55:01,480 --> 07:55:04,040 And that is what support vector machines are designed to do. 9506 07:55:04,040 --> 07:55:08,520 They're designed to try and find what we call the maximum margin separator, 9507 07:55:08,520 --> 07:55:10,720 where the maximum margin separator is just 9508 07:55:10,720 --> 07:55:14,680 some boundary that maximizes the distance between the groups of points 9509 07:55:14,680 --> 07:55:16,600 rather than come up with some boundary that's 9510 07:55:16,600 --> 07:55:19,480 very close to one set or the other, where in the case 9511 07:55:19,480 --> 07:55:20,720 before, we wouldn't have cared. 9512 07:55:20,720 --> 07:55:24,000 As long as we're categorizing the input well, that seems all we need to do. 9513 07:55:24,000 --> 07:55:28,720 The support vector machine will try and find this maximum margin separator, 9514 07:55:28,720 --> 07:55:31,920 some way of trying to maximize that particular distance. 9515 07:55:31,920 --> 07:55:35,520 And it does so by finding what we call the support vectors, which 9516 07:55:35,520 --> 07:55:37,600 are the vectors that are closest to the line, 9517 07:55:37,600 --> 07:55:40,880 and trying to maximize the distance between the line 9518 07:55:40,880 --> 07:55:42,760 and those particular points. 9519 07:55:42,760 --> 07:55:44,520 And it works that way in two dimensions. 9520 07:55:44,520 --> 07:55:46,520 It also works in higher dimensions, where we're not 9521 07:55:46,520 --> 07:55:49,560 looking for some line that separates the two data points, 9522 07:55:49,560 --> 07:55:52,640 but instead looking for what we generally call a hyperplane, 9523 07:55:52,640 --> 07:55:57,760 some decision boundary, effectively, that separates one set of data 9524 07:55:57,760 --> 07:55:59,080 from the other set of data. 9525 07:55:59,080 --> 07:56:00,800 And this ability of support vector machines 9526 07:56:00,800 --> 07:56:04,000 to work in higher dimensions actually has a number of other applications 9527 07:56:04,000 --> 07:56:04,600 as well. 9528 07:56:04,600 --> 07:56:07,520 But one is that it helpfully deals with cases 9529 07:56:07,520 --> 07:56:10,560 where data may not be linearly separable. 9530 07:56:10,560 --> 07:56:12,720 So we talked about linear separability before, 9531 07:56:12,720 --> 07:56:16,880 this idea that you can take data and just draw a line or some linear 9532 07:56:16,880 --> 07:56:20,040 combination of the inputs that allows us to perfectly separate 9533 07:56:20,040 --> 07:56:21,560 the two sets from each other. 9534 07:56:21,560 --> 07:56:24,880 There are some data sets that are not linearly separable. 9535 07:56:24,880 --> 07:56:26,560 And some were even two. 9536 07:56:26,560 --> 07:56:29,760 You would not be able to find a good line at all 9537 07:56:29,760 --> 07:56:32,200 that would try to do that kind of separation. 9538 07:56:32,200 --> 07:56:34,320 Something like this, for example. 9539 07:56:34,320 --> 07:56:37,440 Or if you imagine here are the red points and the blue points 9540 07:56:37,440 --> 07:56:38,720 around it. 9541 07:56:38,720 --> 07:56:43,480 If you try to find a line that divides the red points from the blue points, 9542 07:56:43,480 --> 07:56:45,920 it's actually going to be difficult, if not impossible, 9543 07:56:45,920 --> 07:56:49,480 to do that any line you choose, well, if you draw a line here, 9544 07:56:49,480 --> 07:56:52,160 then you ignore all of these blue points that should actually 9545 07:56:52,160 --> 07:56:53,360 be blue and not red. 9546 07:56:53,360 --> 07:56:56,160 Anywhere else you draw a line, there's going to be a lot of error, 9547 07:56:56,160 --> 07:56:58,200 a lot of mistakes, a lot of what we'll soon 9548 07:56:58,200 --> 07:57:02,360 call loss to that line that you draw, a lot of points 9549 07:57:02,360 --> 07:57:04,960 that you're going to categorize incorrectly. 9550 07:57:04,960 --> 07:57:08,080 What we really want is to be able to find a better decision boundary that 9551 07:57:08,080 --> 07:57:12,680 may not be just a straight line through this two dimensional space. 9552 07:57:12,680 --> 07:57:14,760 And what support vector machines can do is 9553 07:57:14,760 --> 07:57:16,860 they can begin to operate in higher dimensions 9554 07:57:16,860 --> 07:57:19,840 and be able to find some other decision boundary, 9555 07:57:19,840 --> 07:57:21,800 like the circle in this case, that actually 9556 07:57:21,800 --> 07:57:24,880 is able to separate one of these sets of data 9557 07:57:24,880 --> 07:57:26,800 from the other set of data a lot better. 9558 07:57:26,800 --> 07:57:30,400 So oftentimes in data sets where the data is not linearly separable, 9559 07:57:30,400 --> 07:57:33,080 support vector machines by working in higher dimensions 9560 07:57:33,080 --> 07:57:37,240 can actually figure out a way to solve that kind of problem effectively. 9561 07:57:37,240 --> 07:57:39,600 So that then, three different approaches to trying 9562 07:57:39,600 --> 07:57:41,280 to solve these sorts of problems. 9563 07:57:41,280 --> 07:57:42,960 We've seen support vector machines. 9564 07:57:42,960 --> 07:57:46,640 We've seen trying to use linear regression and the perceptron learning 9565 07:57:46,640 --> 07:57:49,840 rule to be able to figure out how to categorize inputs and outputs. 9566 07:57:49,840 --> 07:57:51,520 We've seen the nearest neighbor approach. 9567 07:57:51,520 --> 07:57:54,160 No one necessarily better than any other again. 9568 07:57:54,160 --> 07:57:57,440 It's going to depend on the data set, the information you have access to. 9569 07:57:57,440 --> 07:58:00,560 It's going to depend on what the function looks like that you're ultimately 9570 07:58:00,560 --> 07:58:01,280 trying to predict. 9571 07:58:01,280 --> 07:58:04,080 And this is where a lot of research and experimentation 9572 07:58:04,080 --> 07:58:06,600 can be involved in trying to figure out how it 9573 07:58:06,600 --> 07:58:09,640 is to best perform that kind of estimation. 9574 07:58:09,640 --> 07:58:12,180 But classification is only one of the tasks 9575 07:58:12,180 --> 07:58:14,720 that you might encounter in supervised machine learning. 9576 07:58:14,720 --> 07:58:17,720 Because in classification, what we're trying to predict 9577 07:58:17,720 --> 07:58:19,520 is some discrete category. 9578 07:58:19,520 --> 07:58:22,800 We're trying to predict red or blue, rain or not rain, 9579 07:58:22,800 --> 07:58:24,920 authentic or counterfeit. 9580 07:58:24,920 --> 07:58:28,360 But sometimes what we want to predict is a real numbered value. 9581 07:58:28,360 --> 07:58:31,280 And for that, we have a related problem, not classification, 9582 07:58:31,280 --> 07:58:33,440 but instead known as regression. 9583 07:58:33,440 --> 07:58:35,880 And regression is the supervised learning problem 9584 07:58:35,880 --> 07:58:39,680 where we try and learn a function mapping inputs to outputs same as before. 9585 07:58:39,680 --> 07:58:43,000 But instead of the outputs being discrete categories, things 9586 07:58:43,000 --> 07:58:46,160 like rain or not rain, in a regression problem, 9587 07:58:46,160 --> 07:58:50,520 the output values are generally continuous values, some real number 9588 07:58:50,520 --> 07:58:51,960 that we would like to predict. 9589 07:58:51,960 --> 07:58:53,480 This happens all the time as well. 9590 07:58:53,480 --> 07:58:55,680 You might imagine that a company might take this approach 9591 07:58:55,680 --> 07:58:58,080 if it's trying to figure out, for instance, what 9592 07:58:58,080 --> 07:58:59,840 the effect of its advertising is. 9593 07:58:59,840 --> 07:59:02,800 How do advertising dollars spent translate 9594 07:59:02,800 --> 07:59:05,960 into sales for the company's product, for example? 9595 07:59:05,960 --> 07:59:08,960 And so they might like to try to predict some function that 9596 07:59:08,960 --> 07:59:11,680 takes as input the amount of money spent on advertising. 9597 07:59:11,680 --> 07:59:13,160 And here, we're just going to use one input. 9598 07:59:13,160 --> 07:59:15,900 But again, you could scale this up to many more inputs as well 9599 07:59:15,900 --> 07:59:18,720 if you have a lot of different kinds of data you have access to. 9600 07:59:18,720 --> 07:59:21,400 And the goal is to learn a function that given this amount of spending 9601 07:59:21,400 --> 07:59:23,800 on advertising, we're going to get this amount in sales. 9602 07:59:23,800 --> 07:59:27,040 And you might judge, based on having access to a whole bunch of data, 9603 07:59:27,040 --> 07:59:30,760 like for every past month, here is how much we spent on advertising, 9604 07:59:30,760 --> 07:59:32,320 and here is what sales were. 9605 07:59:32,320 --> 07:59:36,280 And we would like to predict some sort of hypothesis function 9606 07:59:36,280 --> 07:59:39,200 that, again, given the amount spent on advertising, 9607 07:59:39,200 --> 07:59:43,000 we can predict, in this case, some real number, some number estimate 9608 07:59:43,000 --> 07:59:47,800 of how much sales we expect that company to do in this month 9609 07:59:47,800 --> 07:59:49,880 or in this quarter or whatever unit of time 9610 07:59:49,880 --> 07:59:51,920 we're choosing to measure things in. 9611 07:59:51,920 --> 07:59:54,760 And so again, the approach to solving this type of problem, 9612 07:59:54,760 --> 07:59:58,760 we could try using a linear regression type approach where we take this data 9613 07:59:58,760 --> 07:59:59,960 and we just plot it. 9614 07:59:59,960 --> 08:00:02,680 On the x-axis, we have advertising dollars spent. 9615 08:00:02,680 --> 08:00:04,440 On the y-axis, we have sales. 9616 08:00:04,440 --> 08:00:07,080 And we might just want to try and draw a line that 9617 08:00:07,080 --> 08:00:09,600 does a pretty good job of trying to estimate 9618 08:00:09,600 --> 08:00:12,880 this relationship between advertising and sales. 9619 08:00:12,880 --> 08:00:14,760 And in this case, unlike before, we're not 9620 08:00:14,760 --> 08:00:17,960 trying to separate the data points into discrete categories. 9621 08:00:17,960 --> 08:00:19,760 But instead, in this case, we're just trying 9622 08:00:19,760 --> 08:00:24,360 to find a line that approximates this relationship between advertising 9623 08:00:24,360 --> 08:00:27,760 and sales so that if we want to figure out what the estimated sales are 9624 08:00:27,760 --> 08:00:31,440 for a particular advertising budget, you just look it up in this line, 9625 08:00:31,440 --> 08:00:33,360 figure out for this amount of advertising, 9626 08:00:33,360 --> 08:00:35,680 we would have this amount of sales and just try 9627 08:00:35,680 --> 08:00:37,440 and make the estimate that way. 9628 08:00:37,440 --> 08:00:39,720 And so you can try and come up with a line, again, 9629 08:00:39,720 --> 08:00:42,760 figuring out how to modify the weights using various different techniques 9630 08:00:42,760 --> 08:00:47,800 to try and make it so that this line fits as well as possible. 9631 08:00:47,800 --> 08:00:51,040 So with all of these approaches, then, to trying to solve machine learning 9632 08:00:51,040 --> 08:00:54,840 style problems, the question becomes, how do we evaluate these approaches? 9633 08:00:54,840 --> 08:00:58,160 How do we evaluate the various different hypotheses 9634 08:00:58,160 --> 08:00:59,280 that we could come up with? 9635 08:00:59,280 --> 08:01:02,800 Because each of these algorithms will give us some sort of hypothesis, 9636 08:01:02,800 --> 08:01:05,520 some function that maps inputs to outputs, 9637 08:01:05,520 --> 08:01:09,640 and we want to know, how well does that function work? 9638 08:01:09,640 --> 08:01:11,920 And you can think of evaluating these hypotheses 9639 08:01:11,920 --> 08:01:16,400 and trying to get a better hypothesis as kind of like an optimization problem. 9640 08:01:16,400 --> 08:01:19,400 In an optimization problem, as you recall from before, 9641 08:01:19,400 --> 08:01:23,000 we were either trying to maximize some objective function 9642 08:01:23,000 --> 08:01:26,440 by trying to find a global maximum, or we 9643 08:01:26,440 --> 08:01:30,240 were trying to minimize some cost function by trying to find some global 9644 08:01:30,240 --> 08:01:31,040 minimum. 9645 08:01:31,040 --> 08:01:34,800 And in the case of evaluating these hypotheses, one thing we might say 9646 08:01:34,800 --> 08:01:38,200 is that this cost function, the thing we're trying to minimize, 9647 08:01:38,200 --> 08:01:42,120 we might be trying to minimize what we would call a loss function. 9648 08:01:42,120 --> 08:01:44,560 And what a loss function is, is it is a function 9649 08:01:44,560 --> 08:01:49,120 that is going to estimate for us how poorly our function performs. 9650 08:01:49,120 --> 08:01:51,160 More formally, it's like a loss of utility 9651 08:01:51,160 --> 08:01:55,680 by whenever we predict something that is wrong, that is a loss of utility. 9652 08:01:55,680 --> 08:01:59,360 That's going to add to the output of our loss function. 9653 08:01:59,360 --> 08:02:01,120 And you could come up with any loss function 9654 08:02:01,120 --> 08:02:03,960 that you want, just some mathematical way of estimating, 9655 08:02:03,960 --> 08:02:06,960 given each of these data points, given what the actual output is, 9656 08:02:06,960 --> 08:02:10,040 and given what our projected output is, our estimate, 9657 08:02:10,040 --> 08:02:12,800 you could calculate some sort of numerical loss for it. 9658 08:02:12,800 --> 08:02:14,920 But there are a couple of popular loss functions 9659 08:02:14,920 --> 08:02:18,160 that are worth discussing, just so that you've seen them before. 9660 08:02:18,160 --> 08:02:21,680 When it comes to discrete categories, things like rain or not rain, 9661 08:02:21,680 --> 08:02:26,520 counterfeit or not counterfeit, one approaches the 0, 1 loss function. 9662 08:02:26,520 --> 08:02:29,520 And the way that works is for each of the data points, 9663 08:02:29,520 --> 08:02:32,720 our loss function takes as input what the actual output is, 9664 08:02:32,720 --> 08:02:35,240 like whether it was actually raining or not raining, 9665 08:02:35,240 --> 08:02:37,560 and takes our prediction into account. 9666 08:02:37,560 --> 08:02:41,920 Did we predict, given this data point, that it was raining or not raining? 9667 08:02:41,920 --> 08:02:45,800 And if the actual value equals the prediction, well, then the 0, 1 loss 9668 08:02:45,800 --> 08:02:47,480 function will just say the loss is 0. 9669 08:02:47,480 --> 08:02:51,800 There was no loss of utility, because we were able to predict correctly. 9670 08:02:51,800 --> 08:02:54,760 And otherwise, if the actual value was not the same thing 9671 08:02:54,760 --> 08:02:58,160 as what we predicted, well, then in that case, our loss is 1. 9672 08:02:58,160 --> 08:03:01,800 We lost something, lost some utility, because what we predicted 9673 08:03:01,800 --> 08:03:05,480 was the output of the function, was not what it actually was. 9674 08:03:05,480 --> 08:03:07,360 And the goal, then, in a situation like this 9675 08:03:07,360 --> 08:03:11,160 would be to come up with some hypothesis that minimizes 9676 08:03:11,160 --> 08:03:14,520 the total empirical loss, the total amount that we've lost, 9677 08:03:14,520 --> 08:03:17,960 if you add up for all these data points what the actual output is 9678 08:03:17,960 --> 08:03:21,000 and what your hypothesis would have predicted. 9679 08:03:21,000 --> 08:03:24,520 So in this case, for example, if we go back to classifying days as raining 9680 08:03:24,520 --> 08:03:27,600 or not raining, and we came up with this decision boundary, 9681 08:03:27,600 --> 08:03:29,560 how would we evaluate this decision boundary? 9682 08:03:29,560 --> 08:03:33,360 How much better is it than drawing the line here or drawing the line there? 9683 08:03:33,360 --> 08:03:35,680 Well, we could take each of the input data points, 9684 08:03:35,680 --> 08:03:38,680 and each input data point has a label, whether it was raining 9685 08:03:38,680 --> 08:03:40,120 or whether it was not raining. 9686 08:03:40,120 --> 08:03:41,960 And we could compare it to the prediction, 9687 08:03:41,960 --> 08:03:44,440 whether we predicted it would be raining or not raining, 9688 08:03:44,440 --> 08:03:47,920 and assign it a numerical value as a result. 9689 08:03:47,920 --> 08:03:51,560 So for example, these points over here, they were all rainy days, 9690 08:03:51,560 --> 08:03:53,360 and we predicted they would be raining, because they 9691 08:03:53,360 --> 08:03:55,080 fall on the bottom side of the line. 9692 08:03:55,080 --> 08:03:58,400 So they have a loss of 0, nothing lost from those situations. 9693 08:03:58,400 --> 08:04:01,080 And likewise, same is true for some of these points over here, 9694 08:04:01,080 --> 08:04:05,160 where it was not raining and we predicted it would not be raining either. 9695 08:04:05,160 --> 08:04:09,760 Where we do have loss are points like this point here and that point there, 9696 08:04:09,760 --> 08:04:13,000 where we predicted that it would not be raining, 9697 08:04:13,000 --> 08:04:14,680 but in actuality, it's a blue point. 9698 08:04:14,680 --> 08:04:15,760 It was raining. 9699 08:04:15,760 --> 08:04:18,840 Or likewise here, we predicted that it would be raining, 9700 08:04:18,840 --> 08:04:20,760 but in actuality, it's a red point. 9701 08:04:20,760 --> 08:04:21,960 It was not raining. 9702 08:04:21,960 --> 08:04:25,160 And so as a result, we miscategorized these data points 9703 08:04:25,160 --> 08:04:27,120 that we were trying to train on. 9704 08:04:27,120 --> 08:04:29,240 And as a result, there is some loss here. 9705 08:04:29,240 --> 08:04:33,000 One loss here, there, here, and there, for a total loss of 4, 9706 08:04:33,000 --> 08:04:34,840 for example, in this case. 9707 08:04:34,840 --> 08:04:37,680 And that might be how we would estimate or how we would say 9708 08:04:37,680 --> 08:04:41,560 that this line is better than a line that goes somewhere else 9709 08:04:41,560 --> 08:04:45,680 or a line that's further down, because this line might minimize the loss. 9710 08:04:45,680 --> 08:04:50,040 So there is no way to do better than just these four points of loss 9711 08:04:50,040 --> 08:04:54,000 if you're just drawing a straight line through our space. 9712 08:04:54,000 --> 08:04:56,280 So the 0, 1 loss function checks. 9713 08:04:56,280 --> 08:04:57,040 Did we get it right? 9714 08:04:57,040 --> 08:04:57,960 Did we get it wrong? 9715 08:04:57,960 --> 08:05:00,600 If we got it right, the loss is 0, nothing lost. 9716 08:05:00,600 --> 08:05:04,400 If we got it wrong, then our loss function for that data point says 1. 9717 08:05:04,400 --> 08:05:07,680 And we add up all of those losses across all of our data points 9718 08:05:07,680 --> 08:05:10,240 to get some sort of empirical loss, how much we 9719 08:05:10,240 --> 08:05:13,360 have lost across all of these original data points 9720 08:05:13,360 --> 08:05:16,360 that our algorithm had access to. 9721 08:05:16,360 --> 08:05:19,480 There are other forms of loss as well that work especially well when 9722 08:05:19,480 --> 08:05:21,920 we deal with more real valued cases, cases 9723 08:05:21,920 --> 08:05:24,840 like the mapping between advertising budget and amount 9724 08:05:24,840 --> 08:05:26,680 that we do in sales, for example. 9725 08:05:26,680 --> 08:05:30,720 Because in that case, you care not just that you get the number exactly right, 9726 08:05:30,720 --> 08:05:33,600 but you care how close you were to the actual value. 9727 08:05:33,600 --> 08:05:37,640 If the actual value is you did like $2,800 in sales 9728 08:05:37,640 --> 08:05:40,880 and you predicted that you would do $2,900 in sales, 9729 08:05:40,880 --> 08:05:42,160 maybe that's pretty good. 9730 08:05:42,160 --> 08:05:45,320 That's much better than if you had predicted you'd do $1,000 in sales, 9731 08:05:45,320 --> 08:05:46,480 for example. 9732 08:05:46,480 --> 08:05:48,760 And so we would like our loss function to be 9733 08:05:48,760 --> 08:05:53,200 able to take that into account as well, take into account not just 9734 08:05:53,200 --> 08:05:57,640 whether the actual value and the expected value are exactly the same, 9735 08:05:57,640 --> 08:06:01,800 but also take into account how far apart they were. 9736 08:06:01,800 --> 08:06:05,360 And so for that one approach is what we call L1 loss. 9737 08:06:05,360 --> 08:06:08,040 L1 loss doesn't just look at whether actual and predicted 9738 08:06:08,040 --> 08:06:11,980 are equal to each other, but we take the absolute value 9739 08:06:11,980 --> 08:06:15,000 of the actual value minus the predicted value. 9740 08:06:15,000 --> 08:06:19,280 In other words, we just ask how far apart were the actual and predicted 9741 08:06:19,280 --> 08:06:23,000 values, and we sum that up across all of the data points 9742 08:06:23,000 --> 08:06:26,800 to be able to get what our answer ultimately is. 9743 08:06:26,800 --> 08:06:29,600 So what might this actually look like for our data set? 9744 08:06:29,600 --> 08:06:31,520 Well, if we go back to this representation 9745 08:06:31,520 --> 08:06:35,640 where we had advertising along the x-axis, sales along the y-axis, 9746 08:06:35,640 --> 08:06:38,840 our line was our prediction, our estimate for any given 9747 08:06:38,840 --> 08:06:42,920 amount of advertising, what we predicted sales was going to be. 9748 08:06:42,920 --> 08:06:48,240 And our L1 loss is just how far apart vertically along the sales axis 9749 08:06:48,240 --> 08:06:51,000 our prediction was from each of the data points. 9750 08:06:51,000 --> 08:06:53,240 So we could figure out exactly how far apart 9751 08:06:53,240 --> 08:06:55,200 our prediction was from each of the data points 9752 08:06:55,200 --> 08:06:59,120 and figure out as a result of that what our loss is overall 9753 08:06:59,120 --> 08:07:02,160 for this particular hypothesis just by adding up 9754 08:07:02,160 --> 08:07:05,440 all of these various different individual losses for each of these data 9755 08:07:05,440 --> 08:07:06,040 points. 9756 08:07:06,040 --> 08:07:08,720 And our goal then is to try and minimize that loss, 9757 08:07:08,720 --> 08:07:13,480 to try and come up with some line that minimizes what the utility loss is 9758 08:07:13,480 --> 08:07:16,200 by judging how far away our estimate amount of sales 9759 08:07:16,200 --> 08:07:18,920 is from the actual amount of sales. 9760 08:07:18,920 --> 08:07:21,080 And turns out there are other loss functions as well. 9761 08:07:21,080 --> 08:07:23,680 One that's quite popular is the L2 loss. 9762 08:07:23,680 --> 08:07:26,760 The L2 loss, instead of just using the absolute value, 9763 08:07:26,760 --> 08:07:30,280 like how far away the actual value is from the predicted value, 9764 08:07:30,280 --> 08:07:33,280 it uses the square of actual minus predicted. 9765 08:07:33,280 --> 08:07:36,160 So how far apart are the actual and predicted value? 9766 08:07:36,160 --> 08:07:41,520 And it squares that value, effectively penalizing much more harshly anything 9767 08:07:41,520 --> 08:07:43,120 that is a worse prediction. 9768 08:07:43,120 --> 08:07:45,560 So you imagine if you have two data points 9769 08:07:45,560 --> 08:07:50,080 that you predict as being one value away from their actual value, 9770 08:07:50,080 --> 08:07:53,760 as opposed to one data point that you predict as being two away 9771 08:07:53,760 --> 08:07:56,840 from its actual value, the L2 loss function 9772 08:07:56,840 --> 08:08:00,120 will more harshly penalize that one that is two away, 9773 08:08:00,120 --> 08:08:03,040 because it's going to square, however, much the differences 9774 08:08:03,040 --> 08:08:05,360 between the actual value and the predicted value. 9775 08:08:05,360 --> 08:08:07,280 And depending on the situation, you might 9776 08:08:07,280 --> 08:08:10,440 want to choose a loss function depending on what you care about minimizing. 9777 08:08:10,440 --> 08:08:14,040 If you really care about minimizing the error on more outlier cases, 9778 08:08:14,040 --> 08:08:15,880 then you might want to consider something like this. 9779 08:08:15,880 --> 08:08:18,040 But if you've got a lot of outliers, and you don't necessarily 9780 08:08:18,040 --> 08:08:21,560 care about modeling them, then maybe an L1 loss function is preferable. 9781 08:08:21,560 --> 08:08:23,720 But there are trade-offs here that you need to decide, 9782 08:08:23,720 --> 08:08:26,560 based on a particular set of data. 9783 08:08:26,560 --> 08:08:29,480 But what you do run the risk of with any of these loss functions, 9784 08:08:29,480 --> 08:08:33,320 with anything that we're trying to do, is a problem known as overfitting. 9785 08:08:33,320 --> 08:08:36,320 And overfitting is a big problem that you can encounter in machine learning, 9786 08:08:36,320 --> 08:08:41,280 which happens anytime a model fits too closely with a data set, 9787 08:08:41,280 --> 08:08:44,360 and as a result, fails to generalize. 9788 08:08:44,360 --> 08:08:48,040 We would like our model to be able to accurately predict 9789 08:08:48,040 --> 08:08:52,280 data and inputs and output pairs for the data that we have access to. 9790 08:08:52,280 --> 08:08:55,160 But the reason we wanted to do so is because we 9791 08:08:55,160 --> 08:08:59,520 want our model to generalize well to data that we haven't seen before. 9792 08:08:59,520 --> 08:09:01,760 I would like to take data from the past year 9793 08:09:01,760 --> 08:09:03,760 of whether it was raining or not raining, 9794 08:09:03,760 --> 08:09:06,360 and use that data to generalize it towards the future. 9795 08:09:06,360 --> 08:09:09,080 Say, in the future, is it going to be raining or not raining? 9796 08:09:09,080 --> 08:09:12,520 Or if I have a whole bunch of data on what counterfeit and not counterfeit 9797 08:09:12,520 --> 08:09:16,560 US dollar bills look like in the past when people have encountered them, 9798 08:09:16,560 --> 08:09:19,440 I'd like to train a computer to be able to, in the future, 9799 08:09:19,440 --> 08:09:24,840 generalize to other dollar bills that I might see as well. 9800 08:09:24,840 --> 08:09:28,080 And the problem with overfitting is that if you try and tie yourself 9801 08:09:28,080 --> 08:09:32,240 too closely to the data set that you're training your model on, 9802 08:09:32,240 --> 08:09:35,000 you can end up not generalizing very well. 9803 08:09:35,000 --> 08:09:36,120 So what does this look like? 9804 08:09:36,120 --> 08:09:38,520 Well, we might imagine the rainy day and not rainy day 9805 08:09:38,520 --> 08:09:41,640 example again from here, where the blue points indicate rainy days 9806 08:09:41,640 --> 08:09:43,920 and the red points indicate not rainy days. 9807 08:09:43,920 --> 08:09:47,160 And we decided that we felt pretty comfortable with drawing a line 9808 08:09:47,160 --> 08:09:52,000 like this as the decision boundary between rainy days and not rainy days. 9809 08:09:52,000 --> 08:09:55,000 So we can pretty comfortably say that points on this side 9810 08:09:55,000 --> 08:09:57,960 more likely to be rainy days, points on that side more 9811 08:09:57,960 --> 08:09:59,800 likely to be not rainy days. 9812 08:09:59,800 --> 08:10:04,360 But the loss, the empirical loss, isn't zero in this particular case 9813 08:10:04,360 --> 08:10:07,040 because we didn't categorize everything perfectly. 9814 08:10:07,040 --> 08:10:10,600 There was this one outlier, this one day that it wasn't raining, 9815 08:10:10,600 --> 08:10:13,520 but yet our model still predicts that it is raining. 9816 08:10:13,520 --> 08:10:15,640 But that doesn't necessarily mean our model is bad. 9817 08:10:15,640 --> 08:10:18,760 It just means the model isn't 100% accurate. 9818 08:10:18,760 --> 08:10:21,620 If you really wanted to try and find a hypothesis that 9819 08:10:21,620 --> 08:10:25,000 resulted in minimizing the loss, you could come up 9820 08:10:25,000 --> 08:10:26,500 with a different decision boundary. 9821 08:10:26,500 --> 08:10:30,040 It wouldn't be a line, but it would look something like this. 9822 08:10:30,040 --> 08:10:34,040 This decision boundary does separate all of the red points 9823 08:10:34,040 --> 08:10:37,720 from all of the blue points because the red points fall 9824 08:10:37,720 --> 08:10:40,320 on this side of this decision boundary, the blue points 9825 08:10:40,320 --> 08:10:42,640 fall on the other side of the decision boundary. 9826 08:10:42,640 --> 08:10:47,480 But this, we would probably argue, is not as good of a prediction. 9827 08:10:47,480 --> 08:10:50,400 Even though it seems to be more accurate based 9828 08:10:50,400 --> 08:10:53,120 on all of the available training data that we 9829 08:10:53,120 --> 08:10:55,520 have for training this machine learning model, 9830 08:10:55,520 --> 08:10:58,280 we might say that it's probably not going to generalize well. 9831 08:10:58,280 --> 08:11:00,680 That if there were other data points like here and there, 9832 08:11:00,680 --> 08:11:03,600 we might still want to consider those to be rainy days 9833 08:11:03,600 --> 08:11:06,600 because we think this was probably just an outlier. 9834 08:11:06,600 --> 08:11:10,400 So if the only thing you care about is minimizing the loss on the data 9835 08:11:10,400 --> 08:11:13,280 you have available to you, you run the risk of overfitting. 9836 08:11:13,280 --> 08:11:15,480 And this can happen in the classification case. 9837 08:11:15,480 --> 08:11:18,400 It can also happen in the regression case, 9838 08:11:18,400 --> 08:11:21,720 that here we predicted what we thought was a pretty good line relating 9839 08:11:21,720 --> 08:11:24,600 advertising to sales, trying to predict what sales were going 9840 08:11:24,600 --> 08:11:26,840 to be for a given amount of advertising. 9841 08:11:26,840 --> 08:11:29,560 But I could come up with a line that does a better job of predicting 9842 08:11:29,560 --> 08:11:32,640 the training data, and it would be something that looks like this, 9843 08:11:32,640 --> 08:11:35,560 just connecting all of the various different data points. 9844 08:11:35,560 --> 08:11:37,680 And now there is no loss at all. 9845 08:11:37,680 --> 08:11:41,520 Now I've perfectly predicted, given any advertising, what sales are. 9846 08:11:41,520 --> 08:11:45,360 And for all the data available to me, it's going to be accurate. 9847 08:11:45,360 --> 08:11:47,920 But it's probably not going to generalize very well. 9848 08:11:47,920 --> 08:11:52,920 I have overfit my model on the training data that is available to me. 9849 08:11:52,920 --> 08:11:54,960 And so in general, we want to avoid overfitting. 9850 08:11:54,960 --> 08:11:58,480 We'd like strategies to make sure that we haven't overfit our model 9851 08:11:58,480 --> 08:12:00,120 to a particular data set. 9852 08:12:00,120 --> 08:12:02,720 And there are a number of ways that you could try to do this. 9853 08:12:02,720 --> 08:12:05,760 One way is by examining what it is that we're optimizing for. 9854 08:12:05,760 --> 08:12:10,000 In an optimization problem, all we do is we say, there is some cost, 9855 08:12:10,000 --> 08:12:12,520 and I want to minimize that cost. 9856 08:12:12,520 --> 08:12:17,360 And so far, we've defined that cost function, the cost of a hypothesis, 9857 08:12:17,360 --> 08:12:21,160 just as being equal to the empirical loss of that hypothesis, 9858 08:12:21,160 --> 08:12:25,120 like how far away are the actual data points, the outputs, 9859 08:12:25,120 --> 08:12:29,440 away from what I predicted them to be based on that particular hypothesis. 9860 08:12:29,440 --> 08:12:32,400 And if all you're trying to do is minimize cost, meaning minimizing 9861 08:12:32,400 --> 08:12:36,960 the loss in this case, then the result is going to be that you might overfit, 9862 08:12:36,960 --> 08:12:41,360 that to minimize cost, you're going to try and find a way to perfectly match 9863 08:12:41,360 --> 08:12:42,760 all the input data. 9864 08:12:42,760 --> 08:12:46,000 And that might happen as a result of overfitting 9865 08:12:46,000 --> 08:12:48,560 on that particular input data. 9866 08:12:48,560 --> 08:12:52,600 So in order to address this, you could add something to the cost function. 9867 08:12:52,600 --> 08:12:56,440 What counts as cost will not just loss, but also 9868 08:12:56,440 --> 08:12:59,560 some measure of the complexity of the hypothesis. 9869 08:12:59,560 --> 08:13:02,040 The word the complexity of the hypothesis is something 9870 08:13:02,040 --> 08:13:06,160 that you would need to define for how complicated does our line look. 9871 08:13:06,160 --> 08:13:08,600 This is sort of an Occam's razor-style approach 9872 08:13:08,600 --> 08:13:12,080 where we want to give preference to a simpler decision boundary, 9873 08:13:12,080 --> 08:13:15,920 like a straight line, for example, some simpler curve, as opposed 9874 08:13:15,920 --> 08:13:19,760 to something far more complex that might represent the training data better 9875 08:13:19,760 --> 08:13:21,400 but might not generalize as well. 9876 08:13:21,400 --> 08:13:26,280 We'll generally say that a simpler solution is probably the better solution 9877 08:13:26,280 --> 08:13:31,280 and probably the one that is more likely to generalize well to other inputs. 9878 08:13:31,280 --> 08:13:34,960 So we measure what the loss is, but we also measure the complexity. 9879 08:13:34,960 --> 08:13:38,720 And now that all gets taken into account when we consider the overall cost, 9880 08:13:38,720 --> 08:13:42,000 that yes, something might have less loss if it better predicts the training 9881 08:13:42,000 --> 08:13:45,400 data, but if it's much more complex, it still 9882 08:13:45,400 --> 08:13:48,080 might not be the best option that we have. 9883 08:13:48,080 --> 08:13:51,880 And we need to come up with some balance between loss and complexity. 9884 08:13:51,880 --> 08:13:54,120 And for that reason, you'll often see this represented 9885 08:13:54,120 --> 08:13:58,400 as multiplying the complexity by some parameter that we have to choose, 9886 08:13:58,400 --> 08:14:02,520 parameter lambda in this case, where we're saying if lambda is a greater 9887 08:14:02,520 --> 08:14:06,840 value, then we really want to penalize more complex hypotheses. 9888 08:14:06,840 --> 08:14:10,560 Whereas if lambda is smaller, we're going to penalize more complex hypotheses 9889 08:14:10,560 --> 08:14:14,560 a little bit, and it's up to the machine learning programmer 9890 08:14:14,560 --> 08:14:17,400 to decide where they want to set that value of lambda 9891 08:14:17,400 --> 08:14:21,360 for how much do I want to penalize a more complex hypothesis that 9892 08:14:21,360 --> 08:14:23,360 might fit the data a little better. 9893 08:14:23,360 --> 08:14:25,920 And again, there's no one right answer to a lot of these things, 9894 08:14:25,920 --> 08:14:29,320 but depending on the data set, depending on the data you have available to you 9895 08:14:29,320 --> 08:14:32,240 and the problem you're trying to solve, your choice of these parameters 9896 08:14:32,240 --> 08:14:34,600 may vary, and you may need to experiment a little bit 9897 08:14:34,600 --> 08:14:38,720 to figure out what the right choice of that is ultimately going to be. 9898 08:14:38,720 --> 08:14:41,600 This process, then, of considering not only loss, 9899 08:14:41,600 --> 08:14:45,920 but also some measure of the complexity is known as regularization. 9900 08:14:45,920 --> 08:14:49,680 Regularization is the process of penalizing a hypothesis that 9901 08:14:49,680 --> 08:14:54,200 is more complex in order to favor a simpler hypothesis that is more 9902 08:14:54,200 --> 08:14:56,600 likely to generalize well, more likely to be 9903 08:14:56,600 --> 08:15:01,120 able to apply to other situations that are dealing with other input points 9904 08:15:01,120 --> 08:15:04,480 unlike the ones that we've necessarily seen before. 9905 08:15:04,480 --> 08:15:08,440 So oftentimes, you'll see us add some regularizing term 9906 08:15:08,440 --> 08:15:14,120 to what we're trying to minimize in order to avoid this problem of overfitting. 9907 08:15:14,120 --> 08:15:17,240 Now, another way of making sure we don't overfit 9908 08:15:17,240 --> 08:15:20,400 is to run some experiments and to see whether or not 9909 08:15:20,400 --> 08:15:25,320 we are able to generalize our model that we've created to other data sets 9910 08:15:25,320 --> 08:15:26,160 as well. 9911 08:15:26,160 --> 08:15:28,480 And it's for that reason that oftentimes when you're 9912 08:15:28,480 --> 08:15:30,720 doing a machine learning experiment, when you've got some data 9913 08:15:30,720 --> 08:15:33,360 and you want to try and come up with some function that predicts, 9914 08:15:33,360 --> 08:15:36,120 given some input, what the output is going to be, 9915 08:15:36,120 --> 08:15:39,720 you don't necessarily want to do your training on all of the data 9916 08:15:39,720 --> 08:15:42,120 you have available to you that you could employ 9917 08:15:42,120 --> 08:15:45,360 a method known as holdout cross-validation, 9918 08:15:45,360 --> 08:15:48,400 where in holdout cross-validation, we split up our data. 9919 08:15:48,400 --> 08:15:53,400 We split up our data into a training set and a testing set. 9920 08:15:53,400 --> 08:15:55,240 The training set is the set of data that we're 9921 08:15:55,240 --> 08:15:57,800 going to use to train our machine learning model. 9922 08:15:57,800 --> 08:16:00,460 And the testing set is the set of data that we're 9923 08:16:00,460 --> 08:16:04,160 going to use in order to test to see how well our machine learning 9924 08:16:04,160 --> 08:16:06,600 model actually performed. 9925 08:16:06,600 --> 08:16:08,680 So the learning happens on the training set. 9926 08:16:08,680 --> 08:16:10,520 We figure out what the parameters should be. 9927 08:16:10,520 --> 08:16:12,600 We figure out what the right model is. 9928 08:16:12,600 --> 08:16:15,280 And then we see, all right, now that we've trained the model, 9929 08:16:15,280 --> 08:16:17,920 we'll see how well it does at predicting things 9930 08:16:17,920 --> 08:16:22,200 inside of the testing set, some set of data that we haven't seen before. 9931 08:16:22,200 --> 08:16:24,040 And the hope then is that we're going to be 9932 08:16:24,040 --> 08:16:26,360 able to predict the testing set pretty well 9933 08:16:26,360 --> 08:16:29,380 if we're able to generalize based on the training 9934 08:16:29,380 --> 08:16:31,000 data that's available to us. 9935 08:16:31,000 --> 08:16:32,760 If we've overfit the training data, though, 9936 08:16:32,760 --> 08:16:36,360 and we're not able to generalize, well, then when we look at the testing set, 9937 08:16:36,360 --> 08:16:38,000 it's likely going to be the case that we're not 9938 08:16:38,000 --> 08:16:42,000 going to predict things in the testing set nearly as effectively. 9939 08:16:42,000 --> 08:16:44,160 So this is one method of cross-validation, 9940 08:16:44,160 --> 08:16:46,720 validating to make sure that the work we have done 9941 08:16:46,720 --> 08:16:49,680 is actually going to generalize to other data sets as well. 9942 08:16:49,680 --> 08:16:52,520 And there are other statistical techniques we can use as well. 9943 08:16:52,520 --> 08:16:55,800 One of the downsides of this just hold out cross-validation 9944 08:16:55,800 --> 08:17:00,160 is if you say I just split it 50-50, I train using 50% of the data 9945 08:17:00,160 --> 08:17:04,000 and test using the other 50%, or you could choose other percentages as well, 9946 08:17:04,000 --> 08:17:08,560 is that there is a fair amount of data that I am now not using to train, 9947 08:17:08,560 --> 08:17:12,560 that I might be able to get a better model as a result, for example. 9948 08:17:12,560 --> 08:17:16,440 So one approach is known as k-fold cross-validation. 9949 08:17:16,440 --> 08:17:20,640 In k-fold cross-validation, rather than just divide things into two sets 9950 08:17:20,640 --> 08:17:24,920 and run one experiment, we divide things into k different sets. 9951 08:17:24,920 --> 08:17:27,720 So maybe I divide things up into 10 different sets 9952 08:17:27,720 --> 08:17:30,320 and then run 10 different experiments. 9953 08:17:30,320 --> 08:17:33,680 So if I split up my data into 10 different sets of data, 9954 08:17:33,680 --> 08:17:37,360 then what I'll do is each time for each of my 10 experiments, 9955 08:17:37,360 --> 08:17:40,360 I will hold out one of those sets of data, where I'll say, 9956 08:17:40,360 --> 08:17:43,240 let me train my model on these nine sets, 9957 08:17:43,240 --> 08:17:47,000 and then test to see how well it predicts on set number 10. 9958 08:17:47,000 --> 08:17:50,120 And then pick another set of nine sets to train on, 9959 08:17:50,120 --> 08:17:52,240 and then test it on the other one that I held out, 9960 08:17:52,240 --> 08:17:55,400 where each time I train the model on everything 9961 08:17:55,400 --> 08:17:57,840 minus the one set that I'm holding out, and then 9962 08:17:57,840 --> 08:18:02,040 test to see how well our model performs on the test that I did hold out. 9963 08:18:02,040 --> 08:18:04,240 And what you end up getting is 10 different results, 9964 08:18:04,240 --> 08:18:07,400 10 different answers for how accurately our model worked. 9965 08:18:07,400 --> 08:18:09,800 And oftentimes, you could just take the average of those 10 9966 08:18:09,800 --> 08:18:14,040 to get an approximation for how well we think our model performs overall. 9967 08:18:14,040 --> 08:18:18,200 But the key idea is separating the training data from the testing data, 9968 08:18:18,200 --> 08:18:20,600 because you want to test your model on data 9969 08:18:20,600 --> 08:18:23,360 that is different from what you trained the model on. 9970 08:18:23,360 --> 08:18:25,360 Because the training, you want to avoid overfitting. 9971 08:18:25,360 --> 08:18:26,880 You want to be able to generalize. 9972 08:18:26,880 --> 08:18:29,480 And the way you test whether you're able to generalize 9973 08:18:29,480 --> 08:18:32,520 is by looking at some data that you haven't seen before 9974 08:18:32,520 --> 08:18:36,200 and seeing how well we're actually able to perform. 9975 08:18:36,200 --> 08:18:38,960 And so if we want to actually implement any of these techniques 9976 08:18:38,960 --> 08:18:42,720 inside of a programming language like Python, number of ways we could do that. 9977 08:18:42,720 --> 08:18:45,000 We could write this from scratch on our own, 9978 08:18:45,000 --> 08:18:46,760 but there are libraries out there that allow 9979 08:18:46,760 --> 08:18:50,240 us to take advantage of existing implementations of these algorithms, 9980 08:18:50,240 --> 08:18:53,000 that we can use the same types of algorithms 9981 08:18:53,000 --> 08:18:54,880 in a lot of different situations. 9982 08:18:54,880 --> 08:18:58,280 And so there's a library, very popular one, known as Scikit-learn, 9983 08:18:58,280 --> 08:19:01,520 which allows us in Python to be able to very quickly get 9984 08:19:01,520 --> 08:19:03,920 set up with a lot of these different machine learning models. 9985 08:19:03,920 --> 08:19:06,440 This library has already written an algorithm 9986 08:19:06,440 --> 08:19:09,360 for nearest neighbor classification, for doing perceptron learning, 9987 08:19:09,360 --> 08:19:12,800 for doing a bunch of other types of inference and supervised learning 9988 08:19:12,800 --> 08:19:14,360 that we haven't yet talked about. 9989 08:19:14,360 --> 08:19:19,760 But using it, we can begin to try actually testing how these methods work 9990 08:19:19,760 --> 08:19:22,240 and how accurately they perform. 9991 08:19:22,240 --> 08:19:24,480 So let's go ahead and take a look at one approach 9992 08:19:24,480 --> 08:19:26,840 to trying to solve this type of problem. 9993 08:19:26,840 --> 08:19:30,360 All right, so I'm first going to pull up banknotes.csv, which 9994 08:19:30,360 --> 08:19:33,020 is a whole bunch of data provided by UC Irvine, which 9995 08:19:33,020 --> 08:19:36,080 is information about various different banknotes 9996 08:19:36,080 --> 08:19:38,360 that people took pictures of various different banknotes 9997 08:19:38,360 --> 08:19:41,440 and measured various different properties of those banknotes. 9998 08:19:41,440 --> 08:19:45,120 And in particular, some human categorized each of those banknotes 9999 08:19:45,120 --> 08:19:48,720 as either a counterfeit banknote or as not counterfeit. 10000 08:19:48,720 --> 08:19:52,480 And so what you're looking at here is each row represents one banknote. 10001 08:19:52,480 --> 08:19:55,960 This is formatted as a CSV spreadsheet, where just comma separated values 10002 08:19:55,960 --> 08:19:58,680 separating each of these various different fields. 10003 08:19:58,680 --> 08:20:03,000 We have four different input values for each of these data points, 10004 08:20:03,000 --> 08:20:06,400 just information, some measurement that was made on the banknote. 10005 08:20:06,400 --> 08:20:09,280 And what those measurements exactly are aren't as important as the fact 10006 08:20:09,280 --> 08:20:11,280 that we do have access to this data. 10007 08:20:11,280 --> 08:20:14,880 But more importantly, we have access for each of these data points 10008 08:20:14,880 --> 08:20:19,160 to a label, where 0 indicates something like this was not a counterfeit bill, 10009 08:20:19,160 --> 08:20:20,840 meaning it was an authentic bill. 10010 08:20:20,840 --> 08:20:25,440 And a data point labeled 1 means that it is a counterfeit bill, 10011 08:20:25,440 --> 08:20:29,080 at least according to the human researcher who labeled this particular data. 10012 08:20:29,080 --> 08:20:31,280 So we have a whole bunch of data representing 10013 08:20:31,280 --> 08:20:33,860 a whole bunch of different data points, each of which 10014 08:20:33,860 --> 08:20:35,600 has these various different measurements that 10015 08:20:35,600 --> 08:20:38,000 were made on that particular bill, and each of which 10016 08:20:38,000 --> 08:20:44,200 has an output value, 0 or 1, 0 meaning it was a genuine bill, 1 meaning 10017 08:20:44,200 --> 08:20:46,000 it was a counterfeit bill. 10018 08:20:46,000 --> 08:20:48,560 And what we would like to do is use supervised learning 10019 08:20:48,560 --> 08:20:51,600 to begin to predict or model some sort of function that 10020 08:20:51,600 --> 08:20:55,480 can take these four values as input and predict what the output would be. 10021 08:20:55,480 --> 08:20:58,600 We want our learning algorithm to find some sort of pattern 10022 08:20:58,600 --> 08:21:01,040 that is able to predict based on these measurements, something 10023 08:21:01,040 --> 08:21:03,640 that you could measure just by taking a photo of a bill, 10024 08:21:03,640 --> 08:21:09,200 predict whether that bill is authentic or whether that bill is counterfeit. 10025 08:21:09,200 --> 08:21:10,560 And so how can we do that? 10026 08:21:10,560 --> 08:21:13,700 Well, I'm first going to open up banknote0.py 10027 08:21:13,700 --> 08:21:15,960 and see how it is that we do this. 10028 08:21:15,960 --> 08:21:18,960 I'm first importing a lot of things from Scikit-learn, 10029 08:21:18,960 --> 08:21:23,480 but importantly, I'm going to set my model equal to the perceptron model, 10030 08:21:23,480 --> 08:21:25,360 which is one of those models that we talked about before. 10031 08:21:25,360 --> 08:21:28,080 We're just going to try and figure out some setting of weights 10032 08:21:28,080 --> 08:21:31,880 that is able to divide our data into two different groups. 10033 08:21:31,880 --> 08:21:36,200 Then I'm going to go ahead and read data in for my file from banknotes.csv. 10034 08:21:36,200 --> 08:21:39,600 And basically, for every row, I'm going to separate that row 10035 08:21:39,600 --> 08:21:44,400 into the first four values of that row, which is the evidence for that row. 10036 08:21:44,400 --> 08:21:49,860 And then the label, where if the final column in that row is a 0, 10037 08:21:49,860 --> 08:21:51,240 the label is authentic. 10038 08:21:51,240 --> 08:21:53,680 And otherwise, it's going to be counterfeit. 10039 08:21:53,680 --> 08:21:56,820 So I'm effectively reading data in from the CSV file, 10040 08:21:56,820 --> 08:22:00,280 dividing into a whole bunch of rows where each row has some evidence, 10041 08:22:00,280 --> 08:22:04,320 those four input values that are going to be inputs to my hypothesis function. 10042 08:22:04,320 --> 08:22:07,680 And then the label, the output, whether it is authentic or counterfeit, 10043 08:22:07,680 --> 08:22:10,120 that is the thing that I am then trying to predict. 10044 08:22:10,120 --> 08:22:12,880 So the next step is that I would like to split up my data set 10045 08:22:12,880 --> 08:22:15,960 into a training set and a testing set, some set of data 10046 08:22:15,960 --> 08:22:18,320 that I would like to train my machine learning model on, 10047 08:22:18,320 --> 08:22:21,040 and some set of data that I would like to use to test that model, 10048 08:22:21,040 --> 08:22:22,440 see how well it performed. 10049 08:22:22,440 --> 08:22:25,360 So what I'll do is I'll go ahead and figure out length of the data, 10050 08:22:25,360 --> 08:22:27,080 how many data points do I have. 10051 08:22:27,080 --> 08:22:30,400 I'll go ahead and take half of them, save that number as a number called holdout. 10052 08:22:30,400 --> 08:22:33,440 That is how many items I'm going to hold out for my data set 10053 08:22:33,440 --> 08:22:35,320 to save for the testing phase. 10054 08:22:35,320 --> 08:22:38,180 I'll randomly shuffle the data so it's in some random order. 10055 08:22:38,180 --> 08:22:43,360 And then I'll say my testing set will be all of the data up to the holdout. 10056 08:22:43,360 --> 08:22:47,720 So I'll take holdout many data items, and that will be my testing set. 10057 08:22:47,720 --> 08:22:51,000 My training data will be everything else, the information 10058 08:22:51,000 --> 08:22:53,800 that I'm going to train my model on. 10059 08:22:53,800 --> 08:22:58,960 And then I'll say I need to divide my training data into two different sets. 10060 08:22:58,960 --> 08:23:03,680 I need to divide it into my x values, where x here represents the inputs. 10061 08:23:03,680 --> 08:23:06,600 So the x values, the x values that I'm going to train on, 10062 08:23:06,600 --> 08:23:09,040 are basically for every row in my training set, 10063 08:23:09,040 --> 08:23:12,080 I'm going to get the evidence for that row, those four values, 10064 08:23:12,080 --> 08:23:14,840 where it's basically a vector of four numbers, where 10065 08:23:14,840 --> 08:23:16,920 that is going to be all of the input. 10066 08:23:16,920 --> 08:23:18,400 And then I need the y values. 10067 08:23:18,400 --> 08:23:20,400 What are the outputs that I want to learn from, 10068 08:23:20,400 --> 08:23:23,780 the labels that belong to each of these various different input points? 10069 08:23:23,780 --> 08:23:26,600 Well, that's going to be the same thing for each row in the training data. 10070 08:23:26,600 --> 08:23:29,360 But this time, I take that row and get what its label is, 10071 08:23:29,360 --> 08:23:31,640 whether it is authentic or counterfeit. 10072 08:23:31,640 --> 08:23:36,200 So I end up with one list of all of these vectors of my input data, 10073 08:23:36,200 --> 08:23:38,720 and one list, which follows the same order, 10074 08:23:38,720 --> 08:23:42,640 but is all of the labels that correspond with each of those vectors. 10075 08:23:42,640 --> 08:23:46,720 And then to train my model, which in this case is just this perceptron model, 10076 08:23:46,720 --> 08:23:49,960 I just call model.fit, pass in the training data, 10077 08:23:49,960 --> 08:23:52,640 and what the labels for those training data are. 10078 08:23:52,640 --> 08:23:54,960 And scikit-learn will take care of fitting the model, 10079 08:23:54,960 --> 08:23:57,080 will do the entire algorithm for me. 10080 08:23:57,080 --> 08:24:01,240 And then when it's done, I can then test to see how well that model performed. 10081 08:24:01,240 --> 08:24:04,200 So I can say, let me get all of these input vectors 10082 08:24:04,200 --> 08:24:05,880 for what I want to test on. 10083 08:24:05,880 --> 08:24:09,800 So for each row in my testing data set, go ahead and get the evidence. 10084 08:24:09,800 --> 08:24:13,400 And the y values, those are what the actual values were 10085 08:24:13,400 --> 08:24:17,520 for each of the rows in the testing data set, what the actual label is. 10086 08:24:17,520 --> 08:24:19,800 But then I'm going to generate some predictions. 10087 08:24:19,800 --> 08:24:22,280 I'm going to use this model and try and predict, 10088 08:24:22,280 --> 08:24:26,840 based on the testing vectors, I want to predict what the output is. 10089 08:24:26,840 --> 08:24:31,160 And my goal then is to now compare y testing with predictions. 10090 08:24:31,160 --> 08:24:34,360 I want to see how well my predictions, based on the model, 10091 08:24:34,360 --> 08:24:38,240 actually reflect what the y values were, what the output is, 10092 08:24:38,240 --> 08:24:39,480 that were actually labeled. 10093 08:24:39,480 --> 08:24:44,320 Because I now have this label data, I can assess how well the algorithm worked. 10094 08:24:44,320 --> 08:24:47,060 And so now I can just compute how well we did. 10095 08:24:47,060 --> 08:24:49,960 I'm going to, this zip function basically just lets 10096 08:24:49,960 --> 08:24:53,440 me look through two different lists, one by one at the same time. 10097 08:24:53,440 --> 08:24:57,160 So for each actual value and for each predicted value, 10098 08:24:57,160 --> 08:24:59,200 if the actual is the same thing as what I predicted, 10099 08:24:59,200 --> 08:25:01,400 I'll go ahead and increment the counter by one. 10100 08:25:01,400 --> 08:25:04,760 Otherwise, I'll increment my incorrect counter by one. 10101 08:25:04,760 --> 08:25:06,880 And so at the end, I can print out, here are the results, 10102 08:25:06,880 --> 08:25:09,380 here's how many I got right, here's how many I got wrong, 10103 08:25:09,380 --> 08:25:12,560 and here was my overall accuracy, for example. 10104 08:25:12,560 --> 08:25:14,000 So I can go ahead and run this. 10105 08:25:14,000 --> 08:25:17,720 I can run python banknote0.py. 10106 08:25:17,720 --> 08:25:20,000 And it's going to train on half the data set 10107 08:25:20,000 --> 08:25:21,760 and then test on half the data set. 10108 08:25:21,760 --> 08:25:24,040 And here are the results for my perceptron model. 10109 08:25:24,040 --> 08:25:29,020 In this case, it correctly was able to classify 679 bills as correctly 10110 08:25:29,020 --> 08:25:33,400 either authentic or counterfeit and incorrectly classified seven of them 10111 08:25:33,400 --> 08:25:37,000 for an overall accuracy of close to 99% accurate. 10112 08:25:37,000 --> 08:25:40,160 So on this particular data set, using this perceptron model, 10113 08:25:40,160 --> 08:25:44,240 we were able to predict very well what the output was going to be. 10114 08:25:44,240 --> 08:25:46,600 And we can try different models, too, that scikit-learn 10115 08:25:46,600 --> 08:25:50,880 makes it very easy just to swap out one model for another model. 10116 08:25:50,880 --> 08:25:55,640 So instead of the perceptron model, I can use the support vector machine 10117 08:25:55,640 --> 08:25:59,440 using the SVC, otherwise known as a support vector classifier, 10118 08:25:59,440 --> 08:26:01,880 using a support vector machine to classify things 10119 08:26:01,880 --> 08:26:03,640 into two different groups. 10120 08:26:03,640 --> 08:26:07,120 And now see, all right, how well does this perform? 10121 08:26:07,120 --> 08:26:10,560 And all right, this time, we were able to correctly predict 682 10122 08:26:10,560 --> 08:26:15,200 and incorrectly predicted four for accuracy of 99.4%. 10123 08:26:15,200 --> 08:26:20,680 And we could even try the k-neighbors classifier as the model instead. 10124 08:26:20,680 --> 08:26:24,160 And this takes a parameter, n neighbors, for how many neighbors 10125 08:26:24,160 --> 08:26:25,160 do you want to look at? 10126 08:26:25,160 --> 08:26:27,480 Let's just look at one neighbor, the one nearest neighbor, 10127 08:26:27,480 --> 08:26:29,000 and use that to predict. 10128 08:26:29,000 --> 08:26:31,080 Go ahead and run this as well. 10129 08:26:31,080 --> 08:26:33,520 And it looks like, based on the k-neighbors classifier, 10130 08:26:33,520 --> 08:26:36,400 looking at just one neighbor, we were able to correctly classify 10131 08:26:36,400 --> 08:26:40,360 685 data points, incorrectly classified one. 10132 08:26:40,360 --> 08:26:43,560 Maybe let's try three neighbors instead, instead of just using one neighbor. 10133 08:26:43,560 --> 08:26:45,360 Do more of a k-nearest neighbors approach, 10134 08:26:45,360 --> 08:26:48,640 where I look at the three nearest neighbors and see how that performs. 10135 08:26:48,640 --> 08:26:54,240 And that one, in this case, seems to have gotten 100% of all of the predictions 10136 08:26:54,240 --> 08:26:58,280 correctly described as either authentic banknotes 10137 08:26:58,280 --> 08:27:00,280 or as counterfeit banknotes. 10138 08:27:00,280 --> 08:27:02,400 And we could run these experiments multiple times, 10139 08:27:02,400 --> 08:27:05,120 because I'm randomly reorganizing the data every time. 10140 08:27:05,120 --> 08:27:07,640 We're technically training these on slightly different data sets. 10141 08:27:07,640 --> 08:27:10,440 And so you might want to run multiple experiments to really see 10142 08:27:10,440 --> 08:27:12,200 how well they're actually going to perform. 10143 08:27:12,200 --> 08:27:14,160 But in short, they all perform very well. 10144 08:27:14,160 --> 08:27:16,560 And while some of them perform slightly better than others here, 10145 08:27:16,560 --> 08:27:19,160 that might not always be the case for every data set. 10146 08:27:19,160 --> 08:27:22,180 But you can begin to test now by very quickly putting together 10147 08:27:22,180 --> 08:27:24,720 these machine learning models using Scikit-learn 10148 08:27:24,720 --> 08:27:27,120 to be able to train on some training set and then 10149 08:27:27,120 --> 08:27:29,920 test on some testing set as well. 10150 08:27:29,920 --> 08:27:33,040 And this splitting up into training groups and testing groups and testing 10151 08:27:33,040 --> 08:27:37,000 happens so often that Scikit-learn has functions built in for trying to do it. 10152 08:27:37,000 --> 08:27:39,040 I did it all by hand just now. 10153 08:27:39,040 --> 08:27:41,520 But if we take a look at banknotes one, we 10154 08:27:41,520 --> 08:27:45,920 take advantage of some other features that exist in Scikit-learn, 10155 08:27:45,920 --> 08:27:48,320 where we can really simplify a lot of our logic, 10156 08:27:48,320 --> 08:27:52,440 that there is a function built into Scikit-learn called train test split, 10157 08:27:52,440 --> 08:27:56,080 which will automatically split data into a training group and a testing group. 10158 08:27:56,080 --> 08:27:59,680 I just have to say what proportion should be in the testing group, something 10159 08:27:59,680 --> 08:28:02,920 like 0.5, half the data inside the testing group. 10160 08:28:02,920 --> 08:28:05,320 Then I can fit the model on the training data, 10161 08:28:05,320 --> 08:28:08,800 make the predictions on the testing data, and then just count up. 10162 08:28:08,800 --> 08:28:11,760 And Scikit-learn has some nice methods for just counting up 10163 08:28:11,760 --> 08:28:15,040 how many times our testing data match the predictions, 10164 08:28:15,040 --> 08:28:18,280 how many times our testing data didn't match the predictions. 10165 08:28:18,280 --> 08:28:21,600 So very quickly, you can write programs with not all that many lines of code. 10166 08:28:21,600 --> 08:28:25,480 It's maybe like 40 lines of code to get through all of these predictions. 10167 08:28:25,480 --> 08:28:28,440 And then as a result, see how well we're able to do. 10168 08:28:28,440 --> 08:28:31,520 So these types of libraries can allow us, without really knowing 10169 08:28:31,520 --> 08:28:33,920 the implementation details of these algorithms, 10170 08:28:33,920 --> 08:28:36,920 to be able to use the algorithms in a very practical way 10171 08:28:36,920 --> 08:28:40,120 to be able to solve these types of problems. 10172 08:28:40,120 --> 08:28:42,880 So that then was supervised learning, this task 10173 08:28:42,880 --> 08:28:45,960 of given a whole set of data, some input output pairs, 10174 08:28:45,960 --> 08:28:50,040 we would like to learn some function that maps those inputs to those outputs. 10175 08:28:50,040 --> 08:28:52,560 But turns out there are other forms of learning as well. 10176 08:28:52,560 --> 08:28:55,840 And another popular type of machine learning, especially nowadays, 10177 08:28:55,840 --> 08:28:58,080 is known as reinforcement learning. 10178 08:28:58,080 --> 08:29:00,920 And the idea of reinforcement learning is rather than just 10179 08:29:00,920 --> 08:29:04,160 being given a whole data set at the beginning of input output pairs, 10180 08:29:04,160 --> 08:29:07,600 reinforcement learning is all about learning from experience. 10181 08:29:07,600 --> 08:29:10,320 In reinforcement learning, our agent, whether it's 10182 08:29:10,320 --> 08:29:13,000 like a physical robot that's trying to make actions in the world 10183 08:29:13,000 --> 08:29:16,680 or just some virtual agent that is a program running somewhere, 10184 08:29:16,680 --> 08:29:20,480 our agent is going to be given a set of rewards or punishments 10185 08:29:20,480 --> 08:29:22,040 in the form of numerical values. 10186 08:29:22,040 --> 08:29:24,360 But you can think of them as reward or punishment. 10187 08:29:24,360 --> 08:29:28,440 And based on that, it learns what actions to take in the future, 10188 08:29:28,440 --> 08:29:32,400 that our agent, our AI, will be put in some sort of environment. 10189 08:29:32,400 --> 08:29:33,640 It will make some actions. 10190 08:29:33,640 --> 08:29:36,280 And based on the actions that it makes, it learns something. 10191 08:29:36,280 --> 08:29:38,480 It either gets a reward when it does something well, 10192 08:29:38,480 --> 08:29:40,640 it gets a punishment when it does something poorly, 10193 08:29:40,640 --> 08:29:44,640 and it learns what to do or what not to do in the future 10194 08:29:44,640 --> 08:29:47,880 based on those individual experiences. 10195 08:29:47,880 --> 08:29:50,400 And so what this will often look like is it will often 10196 08:29:50,400 --> 08:29:54,000 start with some agent, some AI, which might, again, be a physical robot, 10197 08:29:54,000 --> 08:29:56,200 if you're imagining a physical robot moving around, 10198 08:29:56,200 --> 08:29:58,120 but it can also just be a program. 10199 08:29:58,120 --> 08:30:01,160 And our agent is situated in their environment, 10200 08:30:01,160 --> 08:30:04,040 where the environment is where they're going to make their actions, 10201 08:30:04,040 --> 08:30:06,760 and it's what's going to give them rewards or punishments 10202 08:30:06,760 --> 08:30:09,080 for various actions that they're in. 10203 08:30:09,080 --> 08:30:12,160 So for example, the environment is going to start off 10204 08:30:12,160 --> 08:30:14,920 by putting our agent inside of a state. 10205 08:30:14,920 --> 08:30:17,280 Our agent has some state that, in a game, 10206 08:30:17,280 --> 08:30:19,840 might be the state of the game that the agent is playing. 10207 08:30:19,840 --> 08:30:21,800 In a world that the agent is exploring might 10208 08:30:21,800 --> 08:30:24,760 be some position inside of a grid representing the world 10209 08:30:24,760 --> 08:30:25,720 that they're exploring. 10210 08:30:25,720 --> 08:30:28,000 But the agent is in some sort of state. 10211 08:30:28,000 --> 08:30:32,080 And in that state, the agent needs to choose to take an action. 10212 08:30:32,080 --> 08:30:34,600 The agent likely has multiple actions they can choose from, 10213 08:30:34,600 --> 08:30:36,240 but they pick an action. 10214 08:30:36,240 --> 08:30:39,240 So they take an action in a particular state. 10215 08:30:39,240 --> 08:30:42,080 And as a result of that, the agent will generally 10216 08:30:42,080 --> 08:30:44,960 get two things in response as we model them. 10217 08:30:44,960 --> 08:30:47,680 The agent gets a new state that they find themselves in. 10218 08:30:47,680 --> 08:30:50,040 After being in this state, taking one action, 10219 08:30:50,040 --> 08:30:52,120 they end up in some other state. 10220 08:30:52,120 --> 08:30:55,300 And they're also given some sort of numerical reward, 10221 08:30:55,300 --> 08:30:58,560 positive meaning reward, meaning it was a good thing, 10222 08:30:58,560 --> 08:31:00,920 negative generally meaning they did something bad, 10223 08:31:00,920 --> 08:31:03,200 they received some sort of punishment. 10224 08:31:03,200 --> 08:31:06,100 And that is all the information the agent has. 10225 08:31:06,100 --> 08:31:08,160 It's told what state it's in. 10226 08:31:08,160 --> 08:31:10,040 It makes some sort of action. 10227 08:31:10,040 --> 08:31:12,040 And based on that, it ends up in another state. 10228 08:31:12,040 --> 08:31:14,440 And it ends up getting some particular reward. 10229 08:31:14,440 --> 08:31:17,440 And it needs to learn, based on that information, what actions 10230 08:31:17,440 --> 08:31:19,640 to begin to take in the future. 10231 08:31:19,640 --> 08:31:21,640 And so you could imagine generalizing this to a lot 10232 08:31:21,640 --> 08:31:22,880 of different situations. 10233 08:31:22,880 --> 08:31:26,240 This is oftentimes how you train if you've ever seen those robots that 10234 08:31:26,240 --> 08:31:29,040 are now able to walk around the way humans do. 10235 08:31:29,040 --> 08:31:32,400 It would be quite difficult to program the robot in exactly the right way 10236 08:31:32,400 --> 08:31:34,240 to get it to walk the way humans do. 10237 08:31:34,240 --> 08:31:36,840 You could instead train it through reinforcement learning, 10238 08:31:36,840 --> 08:31:40,320 give it some sort of numerical reward every time it does something good, 10239 08:31:40,320 --> 08:31:43,640 like take steps forward, and punish it every time it does something 10240 08:31:43,640 --> 08:31:46,520 bad, like fall over, and then let the AI just 10241 08:31:46,520 --> 08:31:48,880 learn based on that sequence of rewards, based 10242 08:31:48,880 --> 08:31:51,260 on trying to take various different actions. 10243 08:31:51,260 --> 08:31:54,480 You can begin to have the agent learn what to do in the future 10244 08:31:54,480 --> 08:31:56,120 and what not to do. 10245 08:31:56,120 --> 08:31:59,480 So in order to begin to formalize this, the first thing we need to do 10246 08:31:59,480 --> 08:32:03,620 is formalize this notion of what we mean about states and actions and rewards, 10247 08:32:03,620 --> 08:32:05,720 like what does this world look like? 10248 08:32:05,720 --> 08:32:07,920 And oftentimes, we'll formulate this world 10249 08:32:07,920 --> 08:32:11,720 as what's known as a Markov decision process, similar in spirit 10250 08:32:11,720 --> 08:32:14,360 to Markov chains, which you might recall from before. 10251 08:32:14,360 --> 08:32:16,940 But a Markov decision process is a model that we 10252 08:32:16,940 --> 08:32:19,700 can use for decision making, for an agent trying 10253 08:32:19,700 --> 08:32:21,500 to make decisions in its environment. 10254 08:32:21,500 --> 08:32:25,200 And it's a model that allows us to represent the various different states 10255 08:32:25,200 --> 08:32:28,840 that an agent can be in, the various different actions that they can take, 10256 08:32:28,840 --> 08:32:35,120 and also what the reward is for taking one action as opposed to another action. 10257 08:32:35,120 --> 08:32:37,520 So what then does it actually look like? 10258 08:32:37,520 --> 08:32:40,580 Well, if you recall a Markov chain from before, 10259 08:32:40,580 --> 08:32:43,200 a Markov chain looked a little something like this, 10260 08:32:43,200 --> 08:32:45,760 where we had a whole bunch of these individual states, 10261 08:32:45,760 --> 08:32:48,760 and each state immediately transitioned to another state 10262 08:32:48,760 --> 08:32:50,840 based on some probability distribution. 10263 08:32:50,840 --> 08:32:54,000 We saw this in the context of the weather before, where if it was sunny, 10264 08:32:54,000 --> 08:32:56,720 we said with some probability, it'll be sunny the next day. 10265 08:32:56,720 --> 08:32:59,840 With some other probability, it'll be rainy, for example. 10266 08:32:59,840 --> 08:33:02,320 But we could also imagine generalizing this. 10267 08:33:02,320 --> 08:33:04,000 It's not just sun and rain anymore. 10268 08:33:04,000 --> 08:33:07,160 We just have these states, where one state leads to another state 10269 08:33:07,160 --> 08:33:09,760 according to some probability distribution. 10270 08:33:09,760 --> 08:33:12,280 But in this original model, there was no agent 10271 08:33:12,280 --> 08:33:14,440 that had any control over this process. 10272 08:33:14,440 --> 08:33:17,720 It was just entirely probability based, where with some probability, 10273 08:33:17,720 --> 08:33:18,960 we moved to this next state. 10274 08:33:18,960 --> 08:33:22,400 But maybe it's going to be some other state with some other probability. 10275 08:33:22,400 --> 08:33:26,280 What we'll now have is the ability for the agent in this state 10276 08:33:26,280 --> 08:33:29,480 to choose from a set of actions, where maybe instead of just one path 10277 08:33:29,480 --> 08:33:33,240 forward, they have three different choices of actions that each lead up 10278 08:33:33,240 --> 08:33:34,120 down different paths. 10279 08:33:34,120 --> 08:33:36,480 And even this is a bit of an oversimplification, 10280 08:33:36,480 --> 08:33:39,240 because in each of these states, you might imagine more branching points 10281 08:33:39,240 --> 08:33:42,040 where there are more decisions that can be taken as well. 10282 08:33:42,040 --> 08:33:46,360 So we've extended the Markov chain to say that from a state, 10283 08:33:46,360 --> 08:33:48,360 you now have available action choices. 10284 08:33:48,360 --> 08:33:50,760 And each of those actions might be associated 10285 08:33:50,760 --> 08:33:55,880 with its own probability distribution of going to various different states. 10286 08:33:55,880 --> 08:33:58,840 Then in addition, we'll add another extension, 10287 08:33:58,840 --> 08:34:01,840 where any time you move from a state, taking an action, 10288 08:34:01,840 --> 08:34:07,000 going into this other state, we can associate a reward with that outcome, 10289 08:34:07,000 --> 08:34:10,120 saying either r is positive, meaning some positive reward, 10290 08:34:10,120 --> 08:34:13,320 or r is negative, meaning there was some sort of punishment. 10291 08:34:13,320 --> 08:34:16,440 And this then is what we'll consider to be a Markov decision process. 10292 08:34:16,440 --> 08:34:18,960 That a Markov decision process has some initial set 10293 08:34:18,960 --> 08:34:21,600 of states, of states in the world that we can be in. 10294 08:34:21,600 --> 08:34:24,560 We have some set of actions that, given a state, 10295 08:34:24,560 --> 08:34:28,040 I can say, what are the actions that are available to me in that state, 10296 08:34:28,040 --> 08:34:30,560 an action that I can choose from? 10297 08:34:30,560 --> 08:34:32,480 Then we have some transition model. 10298 08:34:32,480 --> 08:34:36,160 The transition model before just said that, given my current state, 10299 08:34:36,160 --> 08:34:39,880 what is the probability that I end up in that next state or this other state? 10300 08:34:39,880 --> 08:34:44,080 The transition model now has effectively two things we're conditioning on. 10301 08:34:44,080 --> 08:34:48,080 We're saying, given that I'm in this state and that I take this action, 10302 08:34:48,080 --> 08:34:52,280 what's the probability that I end up in this next state? 10303 08:34:52,280 --> 08:34:56,120 Now maybe we live in a very deterministic world in this Markov decision process. 10304 08:34:56,120 --> 08:34:58,120 We're given a state and given an action. 10305 08:34:58,120 --> 08:35:00,680 We know for sure what next state we'll end up in. 10306 08:35:00,680 --> 08:35:02,480 But maybe there's some randomness in the world 10307 08:35:02,480 --> 08:35:04,560 that when you take in a state and you take an action, 10308 08:35:04,560 --> 08:35:07,200 you might not always end up in the exact same state. 10309 08:35:07,200 --> 08:35:09,800 There might be some probabilities involved there as well. 10310 08:35:09,800 --> 08:35:14,200 The Markov decision process can handle both of those possible cases. 10311 08:35:14,200 --> 08:35:18,280 And then finally, we have a reward function, generally called r, 10312 08:35:18,280 --> 08:35:21,960 that in this case says, what is the reward for being in this state, 10313 08:35:21,960 --> 08:35:26,760 taking this action, and then getting to s prime this next state? 10314 08:35:26,760 --> 08:35:27,960 So I'm in this original state. 10315 08:35:27,960 --> 08:35:28,720 I take this action. 10316 08:35:28,720 --> 08:35:29,880 I get to this next state. 10317 08:35:29,880 --> 08:35:32,600 What is the reward for doing that process? 10318 08:35:32,600 --> 08:35:35,360 And you can add up these rewards every time you take an action 10319 08:35:35,360 --> 08:35:38,080 to get the total amount of rewards that an agent might 10320 08:35:38,080 --> 08:35:41,440 get from interacting in a particular environment 10321 08:35:41,440 --> 08:35:44,080 modeled using this Markov decision process. 10322 08:35:44,080 --> 08:35:46,760 So what might this actually look like in practice? 10323 08:35:46,760 --> 08:35:49,200 Well, let's just create a little simulated world here 10324 08:35:49,200 --> 08:35:52,160 where I have this agent that is just trying to navigate its way. 10325 08:35:52,160 --> 08:35:55,040 This agent is this yellow dot here, like a robot in the world, 10326 08:35:55,040 --> 08:35:57,160 trying to navigate its way through this grid. 10327 08:35:57,160 --> 08:36:00,160 And ultimately, it's trying to find its way to the goal. 10328 08:36:00,160 --> 08:36:04,280 And if it gets to the green goal, then it's going to get some sort of reward. 10329 08:36:04,280 --> 08:36:08,200 But then we might also have some red squares that are places 10330 08:36:08,200 --> 08:36:11,280 where you get some sort of punishment, some bad place where we don't want 10331 08:36:11,280 --> 08:36:12,400 the agent to go. 10332 08:36:12,400 --> 08:36:14,940 And if it ends up in the red square, then our agent 10333 08:36:14,940 --> 08:36:18,240 is going to get some sort of punishment as a result of that. 10334 08:36:18,240 --> 08:36:21,560 But the agent originally doesn't know all of these details. 10335 08:36:21,560 --> 08:36:24,280 It doesn't know that these states are associated with punishments. 10336 08:36:24,280 --> 08:36:27,120 But maybe it does know that this state is associated with a reward. 10337 08:36:27,120 --> 08:36:28,120 Maybe it doesn't. 10338 08:36:28,120 --> 08:36:30,680 But it just needs to sort of interact with the environment 10339 08:36:30,680 --> 08:36:33,960 to try and figure out what to do and what not to do. 10340 08:36:33,960 --> 08:36:35,800 So the first thing the agent might do is, 10341 08:36:35,800 --> 08:36:39,120 given no additional information, if it doesn't know what the punishments are, 10342 08:36:39,120 --> 08:36:43,080 it doesn't know where the rewards are, it just might try and take an action. 10343 08:36:43,080 --> 08:36:45,640 And it takes an action and ends up realizing 10344 08:36:45,640 --> 08:36:47,560 that it got some sort of punishment. 10345 08:36:47,560 --> 08:36:49,760 And so what does it learn from that experience? 10346 08:36:49,760 --> 08:36:53,480 Well, it might learn that when you're in this state in the future, 10347 08:36:53,480 --> 08:36:57,200 don't take the action move to the right, that that is a bad action to take. 10348 08:36:57,200 --> 08:36:59,840 That in the future, if you ever find yourself back in the state, 10349 08:36:59,840 --> 08:37:02,200 don't take this action of going to the right 10350 08:37:02,200 --> 08:37:05,280 when you're in this particular state, because that leads to punishment. 10351 08:37:05,280 --> 08:37:06,780 That might be the intuition at least. 10352 08:37:06,780 --> 08:37:08,560 And so you could try doing other actions. 10353 08:37:08,560 --> 08:37:11,160 You move up, all right, that didn't lead to any immediate rewards. 10354 08:37:11,160 --> 08:37:12,840 Maybe try something else. 10355 08:37:12,840 --> 08:37:14,680 Then maybe try something else. 10356 08:37:14,680 --> 08:37:17,160 And all right, now you found that you got another punishment. 10357 08:37:17,160 --> 08:37:18,840 And so you learn something from that experience. 10358 08:37:18,840 --> 08:37:20,800 So the next time you do this whole process, 10359 08:37:20,800 --> 08:37:22,960 you know that if you ever end up in this square, 10360 08:37:22,960 --> 08:37:26,040 you shouldn't take the down action, because being in this state 10361 08:37:26,040 --> 08:37:30,800 and taking that action ultimately leads to some sort of punishment, 10362 08:37:30,800 --> 08:37:33,040 a negative reward, in other words. 10363 08:37:33,040 --> 08:37:34,080 And this process repeats. 10364 08:37:34,080 --> 08:37:37,200 You might imagine just letting our agent explore the world, 10365 08:37:37,200 --> 08:37:41,200 learning over time what states tend to correspond with poor actions, 10366 08:37:41,200 --> 08:37:43,960 learning over time what states correspond with poor actions, 10367 08:37:43,960 --> 08:37:47,240 until eventually, if it tries enough things randomly, 10368 08:37:47,240 --> 08:37:50,600 it might find that eventually when you get to this state, 10369 08:37:50,600 --> 08:37:53,120 if you take the up action in this state, it 10370 08:37:53,120 --> 08:37:56,120 might find that you actually get a reward from that. 10371 08:37:56,120 --> 08:37:59,800 And what it can learn from that is that if you're in this state, 10372 08:37:59,800 --> 08:38:02,560 you should take the up action, because that leads to a reward. 10373 08:38:02,560 --> 08:38:05,160 And over time, you can also learn that if you're in this state, 10374 08:38:05,160 --> 08:38:08,520 you should take the left action, because that leads to this state that also 10375 08:38:08,520 --> 08:38:10,280 lets you eventually get to the reward. 10376 08:38:10,280 --> 08:38:14,080 So you begin to learn over time not only which actions 10377 08:38:14,080 --> 08:38:18,360 are good in particular states, but also which actions are bad, 10378 08:38:18,360 --> 08:38:20,680 such that once you know some sequence of good actions that 10379 08:38:20,680 --> 08:38:24,960 leads you to some sort of reward, our agent can just follow those 10380 08:38:24,960 --> 08:38:27,680 instructions, follow the experience that it has learned. 10381 08:38:27,680 --> 08:38:30,240 We didn't tell the agent what the goal was. 10382 08:38:30,240 --> 08:38:32,800 We didn't tell the agent where the punishments were. 10383 08:38:32,800 --> 08:38:35,680 But the agent can begin to learn from this experience 10384 08:38:35,680 --> 08:38:40,720 and learn to begin to perform these sorts of tasks better in the future. 10385 08:38:40,720 --> 08:38:43,840 And so let's now try to formalize this idea, formalize the idea 10386 08:38:43,840 --> 08:38:47,440 that we would like to be able to learn in this state taking this action, 10387 08:38:47,440 --> 08:38:49,120 is that a good thing or a bad thing? 10388 08:38:49,120 --> 08:38:51,760 There are lots of different models for reinforcement learning. 10389 08:38:51,760 --> 08:38:53,600 We're just going to look at one of them today. 10390 08:38:53,600 --> 08:38:57,280 And the one that we're going to look at is a method known as Q-learning. 10391 08:38:57,280 --> 08:38:59,880 And what Q-learning is all about is about learning 10392 08:38:59,880 --> 08:39:05,440 a function, a function Q, that takes inputs S and A, where S is a state 10393 08:39:05,440 --> 08:39:07,760 and A is an action that you take in that state. 10394 08:39:07,760 --> 08:39:12,280 And what this Q function is going to do is it is going to estimate the value. 10395 08:39:12,280 --> 08:39:18,880 How much reward will I get from taking this action in this state? 10396 08:39:18,880 --> 08:39:21,800 Originally, we don't know what this Q function should be. 10397 08:39:21,800 --> 08:39:24,800 But over time, based on experience, based on trying things out 10398 08:39:24,800 --> 08:39:28,160 and seeing what the result is, I would like to try and learn 10399 08:39:28,160 --> 08:39:32,680 what Q of SA is for any particular state and any particular action 10400 08:39:32,680 --> 08:39:34,680 that I might take in that state. 10401 08:39:34,680 --> 08:39:35,800 So what is the approach? 10402 08:39:35,800 --> 08:39:40,960 Well, the approach originally is we'll start with Q SA equal to 0 for all 10403 08:39:40,960 --> 08:39:43,840 states S and for all actions A. That initially, 10404 08:39:43,840 --> 08:39:47,200 before I've ever started anything, before I've had any experiences, 10405 08:39:47,200 --> 08:39:50,760 I don't know the value of taking any action in any given state. 10406 08:39:50,760 --> 08:39:55,240 So I'm going to assume that the value is just 0 all across the board. 10407 08:39:55,240 --> 08:39:59,720 But then as I interact with the world, as I experience rewards or punishments, 10408 08:39:59,720 --> 08:40:03,400 or maybe I go to a cell where I don't get either reward or a punishment, 10409 08:40:03,400 --> 08:40:07,240 I want to somehow update my estimate of Q SA. 10410 08:40:07,240 --> 08:40:10,160 I want to continually update my estimate of Q SA 10411 08:40:10,160 --> 08:40:13,680 based on the experiences and rewards and punishments that I've received, 10412 08:40:13,680 --> 08:40:17,160 such that in the future, my knowledge of what actions are good 10413 08:40:17,160 --> 08:40:19,160 and what states will be better. 10414 08:40:19,160 --> 08:40:22,240 So when we take an action and receive some sort of reward, 10415 08:40:22,240 --> 08:40:25,680 I want to estimate the new value of Q SA. 10416 08:40:25,680 --> 08:40:28,360 And I estimate that based on a couple of different things. 10417 08:40:28,360 --> 08:40:32,040 I estimate it based on the reward that I'm getting from taking this action 10418 08:40:32,040 --> 08:40:33,760 and getting into the next state. 10419 08:40:33,760 --> 08:40:37,520 But assuming the situation isn't over, assuming there are still 10420 08:40:37,520 --> 08:40:40,000 future actions that I might take as well, 10421 08:40:40,000 --> 08:40:44,640 I also need to take into account the expected future rewards. 10422 08:40:44,640 --> 08:40:47,520 That if you imagine an agent interacting with the environment, 10423 08:40:47,520 --> 08:40:49,960 then sometimes you'll take an action and get a reward, 10424 08:40:49,960 --> 08:40:52,920 but then you can keep taking more actions and get more rewards, 10425 08:40:52,920 --> 08:40:55,240 that these both are relevant, both the current reward 10426 08:40:55,240 --> 08:40:58,520 I'm getting from this current step and also my future reward. 10427 08:40:58,520 --> 08:41:01,160 And it might be the case that I'll want to take a step that 10428 08:41:01,160 --> 08:41:05,080 doesn't immediately lead to a reward, because later on down the line, 10429 08:41:05,080 --> 08:41:07,600 I know it will lead to more rewards as well. 10430 08:41:07,600 --> 08:41:10,480 So there's a balancing act between current rewards 10431 08:41:10,480 --> 08:41:13,400 that the agent experiences and future rewards 10432 08:41:13,400 --> 08:41:16,800 that the agent experiences as well. 10433 08:41:16,800 --> 08:41:19,560 And then we need to update QSA. 10434 08:41:19,560 --> 08:41:22,560 So we estimate the value of QSA based on the current reward 10435 08:41:22,560 --> 08:41:24,360 and the expected future rewards. 10436 08:41:24,360 --> 08:41:26,920 And then we need to update this Q function 10437 08:41:26,920 --> 08:41:29,480 to take into account this new estimate. 10438 08:41:29,480 --> 08:41:31,680 Now, we already, as we go through this process, 10439 08:41:31,680 --> 08:41:35,120 we'll already have an estimate for what we think the value is. 10440 08:41:35,120 --> 08:41:37,120 Now we have a new estimate, and then somehow we 10441 08:41:37,120 --> 08:41:39,520 need to combine these two estimates together, 10442 08:41:39,520 --> 08:41:43,040 and we'll look at more formal ways that we can actually begin to do that. 10443 08:41:43,040 --> 08:41:45,720 So to actually show you what this formula looks like, 10444 08:41:45,720 --> 08:41:47,760 here is the approach we'll take with Q learning. 10445 08:41:47,760 --> 08:41:52,440 We're going to, again, start with Q of S and A being equal to 0 for all states. 10446 08:41:52,440 --> 08:41:59,760 And then every time we take an action A in state S and observer reward R, 10447 08:41:59,760 --> 08:42:04,160 we're going to update our value, our estimate, for Q of SA. 10448 08:42:04,160 --> 08:42:06,720 And the idea is that we're going to figure out 10449 08:42:06,720 --> 08:42:12,120 what the new value estimate is minus what our existing value estimate is. 10450 08:42:12,120 --> 08:42:15,720 And so we have some preconceived notion for what the value is 10451 08:42:15,720 --> 08:42:17,400 for taking this action in this state. 10452 08:42:17,400 --> 08:42:21,400 Maybe our expectation is we currently think the value is 10. 10453 08:42:21,400 --> 08:42:24,480 But then we're going to estimate what we now think it's going to be. 10454 08:42:24,480 --> 08:42:27,200 Maybe the new value estimate is something like 20. 10455 08:42:27,200 --> 08:42:30,520 So there's a delta of 10 that our new value estimate 10456 08:42:30,520 --> 08:42:35,200 is 10 points higher than what our current value estimate happens to be. 10457 08:42:35,200 --> 08:42:37,120 And so we have a couple of options here. 10458 08:42:37,120 --> 08:42:40,020 We need to decide how much we want to adjust 10459 08:42:40,020 --> 08:42:42,800 our current expectation of what the value is 10460 08:42:42,800 --> 08:42:45,640 of taking this action in this particular state. 10461 08:42:45,640 --> 08:42:49,560 And what that difference is, how much we add or subtract 10462 08:42:49,560 --> 08:42:52,720 from our existing notion of how much do we expect the value to be, 10463 08:42:52,720 --> 08:42:56,680 is dependent on this parameter alpha, also called a learning rate. 10464 08:42:56,680 --> 08:43:01,200 And alpha represents, in effect, how much we value new information 10465 08:43:01,200 --> 08:43:04,680 compared to how much we value old information. 10466 08:43:04,680 --> 08:43:08,320 An alpha value of 1 means we really value new information. 10467 08:43:08,320 --> 08:43:10,520 But if we have a new estimate, then it doesn't 10468 08:43:10,520 --> 08:43:12,160 matter what our old estimate is. 10469 08:43:12,160 --> 08:43:14,080 We're only going to consider our new estimate 10470 08:43:14,080 --> 08:43:18,240 because we always just want to take into consideration our new information. 10471 08:43:18,240 --> 08:43:21,960 So the way that works is that if you imagine alpha being 1, 10472 08:43:21,960 --> 08:43:25,760 well, then we're taking the old value of QSA 10473 08:43:25,760 --> 08:43:29,840 and then adding 1 times the new value minus the old value. 10474 08:43:29,840 --> 08:43:31,800 And that just leaves us with the new value. 10475 08:43:31,800 --> 08:43:34,940 So when alpha is 1, all we take into consideration 10476 08:43:34,940 --> 08:43:37,600 is what our new estimate happens to be. 10477 08:43:37,600 --> 08:43:40,400 But over time, as we go through a lot of experiences, 10478 08:43:40,400 --> 08:43:42,520 we already have some existing information. 10479 08:43:42,520 --> 08:43:46,000 We might have tried taking this action nine times already. 10480 08:43:46,000 --> 08:43:48,000 And now we just tried it a 10th time. 10481 08:43:48,000 --> 08:43:51,160 And we don't only want to consider this 10th experience. 10482 08:43:51,160 --> 08:43:54,800 I also want to consider the fact that my prior nine experiences, those 10483 08:43:54,800 --> 08:43:55,640 were meaningful, too. 10484 08:43:55,640 --> 08:43:58,240 And that's data I don't necessarily want to lose. 10485 08:43:58,240 --> 08:44:01,080 And so this alpha controls that decision, 10486 08:44:01,080 --> 08:44:03,400 controls how important is the new information. 10487 08:44:03,400 --> 08:44:06,480 0 would mean ignore all the new information. 10488 08:44:06,480 --> 08:44:09,320 Just keep this Q value the same. 10489 08:44:09,320 --> 08:44:13,120 1 means replace the old information entirely with the new information. 10490 08:44:13,120 --> 08:44:17,920 And somewhere in between, keep some sort of balance between these two values. 10491 08:44:17,920 --> 08:44:21,000 We can put this equation a little bit more formally as well. 10492 08:44:21,000 --> 08:44:23,880 The old value estimate is our old estimate 10493 08:44:23,880 --> 08:44:27,600 for what the value is of taking this action in a particular state. 10494 08:44:27,600 --> 08:44:30,040 That's just Q of SNA. 10495 08:44:30,040 --> 08:44:33,120 So we have it once here, and we're going to add something to it. 10496 08:44:33,120 --> 08:44:35,580 We're going to add alpha times the new value estimate 10497 08:44:35,580 --> 08:44:37,680 minus the old value estimate. 10498 08:44:37,680 --> 08:44:42,280 But the old value estimate, we just look up by calling this Q function. 10499 08:44:42,280 --> 08:44:44,240 And what then is the new value estimate? 10500 08:44:44,240 --> 08:44:46,440 Based on this experience we have just taken, 10501 08:44:46,440 --> 08:44:48,800 what is our new estimate for the value of taking 10502 08:44:48,800 --> 08:44:51,480 this action in this particular state? 10503 08:44:51,480 --> 08:44:54,000 Well, it's going to be composed of two parts. 10504 08:44:54,000 --> 08:44:56,940 It's going to be composed of what reward did I just 10505 08:44:56,940 --> 08:45:00,000 get from taking this action in this state. 10506 08:45:00,000 --> 08:45:03,320 And then it's going to be, what can I expect my future rewards 10507 08:45:03,320 --> 08:45:05,600 to be from this point forward? 10508 08:45:05,600 --> 08:45:10,200 So it's going to be R, some reward I'm getting right now, 10509 08:45:10,200 --> 08:45:14,280 plus whatever I estimate I'm going to get in the future. 10510 08:45:14,280 --> 08:45:16,760 And how do I estimate what I'm going to get in the future? 10511 08:45:16,760 --> 08:45:19,960 Well, it's a bit of another call to this Q function. 10512 08:45:19,960 --> 08:45:23,940 It's going to be take the maximum across all possible actions 10513 08:45:23,940 --> 08:45:27,920 I could take next and say, all right, of all of these possible actions 10514 08:45:27,920 --> 08:45:31,480 I could take, which one is going to have the highest reward? 10515 08:45:31,480 --> 08:45:33,600 And so this then looks a little bit complicated. 10516 08:45:33,600 --> 08:45:35,400 This is going to be our notion for how we're 10517 08:45:35,400 --> 08:45:37,680 going to perform this kind of update. 10518 08:45:37,680 --> 08:45:41,680 I have some estimate, some old estimate, for what the value is 10519 08:45:41,680 --> 08:45:44,040 of taking this action in this state. 10520 08:45:44,040 --> 08:45:46,920 And I'm going to update it based on new information 10521 08:45:46,920 --> 08:45:48,680 that I experience some reward. 10522 08:45:48,680 --> 08:45:51,240 I predict what my future reward is going to be. 10523 08:45:51,240 --> 08:45:54,600 And using that I update what I estimate the reward will 10524 08:45:54,600 --> 08:45:57,880 be for taking this action in this particular state. 10525 08:45:57,880 --> 08:46:00,760 And there are other additions you might make to this algorithm as well. 10526 08:46:00,760 --> 08:46:03,200 Sometimes it might not be the case that future rewards 10527 08:46:03,200 --> 08:46:05,760 you want to wait equally to current rewards. 10528 08:46:05,760 --> 08:46:10,360 Maybe you want an agent that values reward now over reward later. 10529 08:46:10,360 --> 08:46:13,940 And so sometimes you can even add another term in here, some other parameter, 10530 08:46:13,940 --> 08:46:17,800 where you discount future rewards and say future rewards are not 10531 08:46:17,800 --> 08:46:19,840 as valuable as rewards immediately. 10532 08:46:19,840 --> 08:46:21,640 That getting reward in the current time step 10533 08:46:21,640 --> 08:46:24,520 is better than waiting a year and getting rewards later. 10534 08:46:24,520 --> 08:46:26,200 But that's something up to the programmer 10535 08:46:26,200 --> 08:46:29,240 to decide what that parameter ought to be. 10536 08:46:29,240 --> 08:46:32,060 But the big picture idea of this entire formula 10537 08:46:32,060 --> 08:46:35,600 is to say that every time we experience some new reward, 10538 08:46:35,600 --> 08:46:36,840 we take that into account. 10539 08:46:36,840 --> 08:46:40,760 We update our estimate of how good is this action. 10540 08:46:40,760 --> 08:46:44,040 And then in the future, we can make decisions based on that algorithm. 10541 08:46:44,040 --> 08:46:48,160 Once we have some good estimate for every state and for every action, 10542 08:46:48,160 --> 08:46:50,920 what the value is of taking that action, then we 10543 08:46:50,920 --> 08:46:54,920 can do something like implement a greedy decision making policy. 10544 08:46:54,920 --> 08:46:57,960 That if I am in a state and I want to know what action 10545 08:46:57,960 --> 08:47:00,160 should I take in that state, well, then I 10546 08:47:00,160 --> 08:47:05,600 consider for all of my possible actions, what is the value of QSA? 10547 08:47:05,600 --> 08:47:08,960 What is my estimated value of taking that action in that state? 10548 08:47:08,960 --> 08:47:12,920 And I will just pick the action that has the highest value 10549 08:47:12,920 --> 08:47:15,360 after I evaluate that expression. 10550 08:47:15,360 --> 08:47:17,560 So I pick the action that has the highest value. 10551 08:47:17,560 --> 08:47:19,960 And based on that, that tells me what action I should take. 10552 08:47:19,960 --> 08:47:24,880 At any given state that I'm in, I can just greedily say across all my actions, 10553 08:47:24,880 --> 08:47:27,960 this action gives me the highest expected value. 10554 08:47:27,960 --> 08:47:33,320 And so I'll go ahead and choose that action as the action that I take as well. 10555 08:47:33,320 --> 08:47:36,160 But there is a downside to this kind of approach. 10556 08:47:36,160 --> 08:47:38,760 And then downside comes up in a situation like this, 10557 08:47:38,760 --> 08:47:44,080 where we know that there is some solution that gets me to the reward. 10558 08:47:44,080 --> 08:47:46,400 And our agent has been able to figure that out. 10559 08:47:46,400 --> 08:47:49,920 But it might not necessarily be the best way or the fastest way. 10560 08:47:49,920 --> 08:47:52,640 If the agent is allowed to explore a little bit more, 10561 08:47:52,640 --> 08:47:55,160 it might find that it can get the reward faster 10562 08:47:55,160 --> 08:47:59,600 by taking some other route instead, by going through this particular path 10563 08:47:59,600 --> 08:48:04,240 that is a faster way to get to that ultimate goal. 10564 08:48:04,240 --> 08:48:07,640 And maybe we would like for the agent to be able to figure that out as well. 10565 08:48:07,640 --> 08:48:11,560 But if the agent always takes the actions that it knows to be best, 10566 08:48:11,560 --> 08:48:13,840 well, when it gets to this particular square, 10567 08:48:13,840 --> 08:48:17,680 it doesn't know that this is a good action because it's never really tried it. 10568 08:48:17,680 --> 08:48:21,840 But it knows that going down eventually leads its way to this reward. 10569 08:48:21,840 --> 08:48:25,160 So it might learn in the future that it should just always take this route 10570 08:48:25,160 --> 08:48:29,760 and it's never going to explore and go along that route instead. 10571 08:48:29,760 --> 08:48:32,080 So in reinforcement learning, there is this tension 10572 08:48:32,080 --> 08:48:35,360 between exploration and exploitation. 10573 08:48:35,360 --> 08:48:40,240 And exploitation generally refers to using knowledge that the AI already has. 10574 08:48:40,240 --> 08:48:43,520 The AI already knows that this is a move that leads to reward. 10575 08:48:43,520 --> 08:48:45,400 So we'll go ahead and use that move. 10576 08:48:45,400 --> 08:48:49,280 And exploration is all about exploring other actions 10577 08:48:49,280 --> 08:48:51,720 that we may not have explored as thoroughly before 10578 08:48:51,720 --> 08:48:54,920 because maybe one of these actions, even if I don't know anything about it, 10579 08:48:54,920 --> 08:49:00,200 might lead to better rewards faster or to more rewards in the future. 10580 08:49:00,200 --> 08:49:04,440 And so an agent that only ever exploits information and never explores 10581 08:49:04,440 --> 08:49:07,680 might be able to get reward, but it might not maximize its rewards 10582 08:49:07,680 --> 08:49:10,800 because it doesn't know what other possibilities are out there, 10583 08:49:10,800 --> 08:49:15,840 possibilities that we only know about by taking advantage of exploration. 10584 08:49:15,840 --> 08:49:17,640 And so how can we try and address this? 10585 08:49:17,640 --> 08:49:21,480 Well, one possible solution is known as the Epsilon greedy algorithm, 10586 08:49:21,480 --> 08:49:26,000 where we set Epsilon equal to how often we want to just make a random move, 10587 08:49:26,000 --> 08:49:29,600 where occasionally we will just make a random move in order to say, 10588 08:49:29,600 --> 08:49:33,200 let's try to explore and see what happens. 10589 08:49:33,200 --> 08:49:38,200 And then the logic of the algorithm will be with probability 1 minus Epsilon, 10590 08:49:38,200 --> 08:49:40,760 choose the estimated best move. 10591 08:49:40,760 --> 08:49:43,600 In a greedy case, we'd always choose the best move. 10592 08:49:43,600 --> 08:49:46,960 But in Epsilon greedy, we're most of the time 10593 08:49:46,960 --> 08:49:50,480 going to choose the best move or sometimes going to choose the best move. 10594 08:49:50,480 --> 08:49:53,040 But sometimes with probability Epsilon, we're 10595 08:49:53,040 --> 08:49:56,120 going to choose a random move instead. 10596 08:49:56,120 --> 08:49:58,760 So every time we're faced with the ability to take an action, 10597 08:49:58,760 --> 08:50:00,880 sometimes we're going to choose the best move. 10598 08:50:00,880 --> 08:50:03,840 Sometimes we're just going to choose a random move. 10599 08:50:03,840 --> 08:50:07,520 So this type of algorithm can be quite powerful in a reinforcement learning 10600 08:50:07,520 --> 08:50:11,480 context by not always just choosing the best possible move right now, 10601 08:50:11,480 --> 08:50:14,480 but sometimes, especially early on, allowing yourself 10602 08:50:14,480 --> 08:50:18,160 to make random moves that allow you to explore various different possible 10603 08:50:18,160 --> 08:50:20,920 states and actions more, and maybe over time, 10604 08:50:20,920 --> 08:50:23,280 you might decrease your value of Epsilon. 10605 08:50:23,280 --> 08:50:25,240 More and more often, choosing the best move 10606 08:50:25,240 --> 08:50:27,440 after you're more confident that you've explored 10607 08:50:27,440 --> 08:50:30,640 what all of the possibilities actually are. 10608 08:50:30,640 --> 08:50:32,160 So we can put this into practice. 10609 08:50:32,160 --> 08:50:34,760 And one very common application of reinforcement learning 10610 08:50:34,760 --> 08:50:38,320 is in game playing, that if you want to teach an agent how to play a game, 10611 08:50:38,320 --> 08:50:41,200 you just let the agent play the game a whole bunch. 10612 08:50:41,200 --> 08:50:44,160 And then the reward signal happens at the end of the game. 10613 08:50:44,160 --> 08:50:47,360 When the game is over, if our AI won the game, 10614 08:50:47,360 --> 08:50:49,640 it gets a reward of like 1, for example. 10615 08:50:49,640 --> 08:50:53,040 And if it lost the game, it gets a reward of negative 1. 10616 08:50:53,040 --> 08:50:56,080 And from that, it begins to learn what actions are good 10617 08:50:56,080 --> 08:50:57,080 and what actions are bad. 10618 08:50:57,080 --> 08:50:59,560 You don't have to tell the AI what's good and what's bad, 10619 08:50:59,560 --> 08:51:01,840 but the AI figures it out based on that reward. 10620 08:51:01,840 --> 08:51:04,960 Winning the game is some signal, losing the game is some signal, 10621 08:51:04,960 --> 08:51:07,240 and based on all of that, it begins to figure out 10622 08:51:07,240 --> 08:51:09,960 what decisions it should actually make. 10623 08:51:09,960 --> 08:51:13,040 So one very simple game, which you may have played before, is a game called 10624 08:51:13,040 --> 08:51:13,800 Nim. 10625 08:51:13,800 --> 08:51:16,360 And in the game of Nim, you've got a whole bunch of objects 10626 08:51:16,360 --> 08:51:18,520 in a whole bunch of different piles, where here I've 10627 08:51:18,520 --> 08:51:20,880 represented each pile as an individual row. 10628 08:51:20,880 --> 08:51:22,720 So you've got one object in the first pile, 10629 08:51:22,720 --> 08:51:26,280 three in the second pile, five in the third pile, seven in the fourth pile. 10630 08:51:26,280 --> 08:51:28,360 And the game of Nim is a two player game 10631 08:51:28,360 --> 08:51:31,880 where players take turns removing objects from piles. 10632 08:51:31,880 --> 08:51:34,160 And the rule is that on any given turn, you 10633 08:51:34,160 --> 08:51:39,120 were allowed to remove as many objects as you want from any one of these piles, 10634 08:51:39,120 --> 08:51:40,240 any one of these rows. 10635 08:51:40,240 --> 08:51:42,160 You have to remove at least one object, but you 10636 08:51:42,160 --> 08:51:46,800 remove as many as you want from exactly one of the piles. 10637 08:51:46,800 --> 08:51:50,720 And whoever takes the last object loses. 10638 08:51:50,720 --> 08:51:54,600 So player one might remove four from this pile here. 10639 08:51:54,600 --> 08:51:57,640 Player two might remove four from this pile here. 10640 08:51:57,640 --> 08:52:00,520 So now we've got four piles left, one, three, one, and three. 10641 08:52:00,520 --> 08:52:03,960 Player one might remove the entirety of the second pile. 10642 08:52:03,960 --> 08:52:09,840 Player two, if they're being strategic, might remove two from the third pile. 10643 08:52:09,840 --> 08:52:13,080 Now we've got three piles left, each with one object left. 10644 08:52:13,080 --> 08:52:15,360 Player one might remove one from one pile. 10645 08:52:15,360 --> 08:52:17,720 Player two removes one from the other pile. 10646 08:52:17,720 --> 08:52:22,120 And now player one is left with choosing this one object from the last pile, 10647 08:52:22,120 --> 08:52:24,640 at which point player one loses the game. 10648 08:52:24,640 --> 08:52:25,920 So fairly simple game. 10649 08:52:25,920 --> 08:52:28,960 Piles of objects, any turn you choose how many objects 10650 08:52:28,960 --> 08:52:33,240 to remove from a pile, whoever removes the last object loses. 10651 08:52:33,240 --> 08:52:36,980 And this is the type of game you could encode into an AI fairly easily, 10652 08:52:36,980 --> 08:52:39,480 because the states are really just four numbers. 10653 08:52:39,480 --> 08:52:43,080 Every state is just how many objects in each of the four piles. 10654 08:52:43,080 --> 08:52:45,440 And the actions are things like, how many 10655 08:52:45,440 --> 08:52:49,040 am I going to remove from each one of these individual piles? 10656 08:52:49,040 --> 08:52:51,440 And the reward happens at the end, that if you 10657 08:52:51,440 --> 08:52:53,920 were the player that had to remove the last object, 10658 08:52:53,920 --> 08:52:55,920 then you get some sort of punishment. 10659 08:52:55,920 --> 08:52:57,760 But if you were not, and the other player 10660 08:52:57,760 --> 08:53:01,760 had to remove the last object, well, then you get some sort of reward. 10661 08:53:01,760 --> 08:53:04,600 So we could actually try and show a demonstration of this, 10662 08:53:04,600 --> 08:53:08,720 that I've implemented an AI to play the game of Nim. 10663 08:53:08,720 --> 08:53:11,640 All right, so here, what we're going to do is create an AI 10664 08:53:11,640 --> 08:53:15,160 as a result of training the AI on some number of games, 10665 08:53:15,160 --> 08:53:18,360 that the AI is going to play against itself, where the idea is the AI will 10666 08:53:18,360 --> 08:53:22,000 play games against itself, learn from each of those experiences, 10667 08:53:22,000 --> 08:53:23,640 and learn what to do in the future. 10668 08:53:23,640 --> 08:53:26,280 And then I, the human, will play against the AI. 10669 08:53:26,280 --> 08:53:28,320 So initially, we'll say train zero times, 10670 08:53:28,320 --> 08:53:32,120 meaning we're not going to let the AI play any practice games against itself 10671 08:53:32,120 --> 08:53:34,040 in order to learn from its experiences. 10672 08:53:34,040 --> 08:53:36,840 We're just going to see how well it plays. 10673 08:53:36,840 --> 08:53:38,400 And it looks like there are four piles. 10674 08:53:38,400 --> 08:53:41,520 I can choose how many I remove from any one of the piles. 10675 08:53:41,520 --> 08:53:46,920 So maybe from pile three, I will remove five objects, for example. 10676 08:53:46,920 --> 08:53:50,280 So now, AI chose to take one item from pile zero. 10677 08:53:50,280 --> 08:53:53,000 So I'm left with these piles now, for example. 10678 08:53:53,000 --> 08:53:55,440 And so here, I could choose maybe to say, I 10679 08:53:55,440 --> 08:54:00,200 would like to remove from pile two, I'll remove all five of them, 10680 08:54:00,200 --> 08:54:01,720 for example. 10681 08:54:01,720 --> 08:54:04,240 And so AI chose to take two away from pile one. 10682 08:54:04,240 --> 08:54:08,360 Now I'm left with one pile that has one object, one pile that has two objects. 10683 08:54:08,360 --> 08:54:11,880 So from pile three, I will remove two objects. 10684 08:54:11,880 --> 08:54:15,240 And now I've left the AI with no choice but to take that last one. 10685 08:54:15,240 --> 08:54:17,680 And so the game is over, and I was able to win. 10686 08:54:17,680 --> 08:54:20,120 But I did so because the AI was really just playing randomly. 10687 08:54:20,120 --> 08:54:23,040 It didn't have any prior experience that it was using in order 10688 08:54:23,040 --> 08:54:24,840 to make these sorts of judgments. 10689 08:54:24,840 --> 08:54:29,120 Now let me let the AI train itself on 10,000 games. 10690 08:54:29,120 --> 08:54:32,920 I'm going to let the AI play 10,000 games of nim against itself. 10691 08:54:32,920 --> 08:54:36,120 Every time it wins or loses, it's going to learn from that experience 10692 08:54:36,120 --> 08:54:39,760 and learn in the future what to do and what not to do. 10693 08:54:39,760 --> 08:54:42,560 So here then, I'll go ahead and run this again. 10694 08:54:42,560 --> 08:54:45,720 And now you see the AI running through a whole bunch of training games, 10695 08:54:45,720 --> 08:54:47,680 10,000 training games against itself. 10696 08:54:47,680 --> 08:54:50,440 And now it's going to let me make these sorts of decisions. 10697 08:54:50,440 --> 08:54:52,560 So now I'm going to play against the AI. 10698 08:54:52,560 --> 08:54:55,880 Maybe I'll remove one from pile three. 10699 08:54:55,880 --> 08:54:59,520 And the AI took everything from pile three, so I'm left with three piles. 10700 08:54:59,520 --> 08:55:04,640 I'll go ahead and from pile two maybe remove three items. 10701 08:55:04,640 --> 08:55:07,280 And the AI removes one item from pile zero. 10702 08:55:07,280 --> 08:55:10,480 I'm left with two piles, each of which has two items in it. 10703 08:55:10,480 --> 08:55:14,400 I'll remove one from pile one, I guess. 10704 08:55:14,400 --> 08:55:17,520 And the AI took two from pile two, leaving me with no choice 10705 08:55:17,520 --> 08:55:20,440 but to take one away from pile one. 10706 08:55:20,440 --> 08:55:24,600 So it seems like after playing 10,000 games of nim against itself, 10707 08:55:24,600 --> 08:55:28,960 the AI has learned something about what states and what actions tend to be good 10708 08:55:28,960 --> 08:55:31,280 and has begun to learn some sort of pattern for how 10709 08:55:31,280 --> 08:55:33,720 to predict what actions are going to be good 10710 08:55:33,720 --> 08:55:37,200 and what actions are going to be bad in any given state. 10711 08:55:37,200 --> 08:55:39,880 So reinforcement learning can be a very powerful technique 10712 08:55:39,880 --> 08:55:42,480 for achieving these sorts of game-playing agents, agents 10713 08:55:42,480 --> 08:55:45,960 that are able to play a game well just by learning from experience, 10714 08:55:45,960 --> 08:55:47,880 whether that's playing against other people 10715 08:55:47,880 --> 08:55:51,880 or by playing against itself and learning from those experiences as well. 10716 08:55:51,880 --> 08:55:55,440 Now, nim is a bit of an easy game to use reinforcement learning for 10717 08:55:55,440 --> 08:55:57,040 because there are so few states. 10718 08:55:57,040 --> 08:55:59,960 There are only states that are as many as how many different objects 10719 08:55:59,960 --> 08:56:02,080 are in each of these various different piles. 10720 08:56:02,080 --> 08:56:06,120 You might imagine that it's going to be harder if you think of a game like chess 10721 08:56:06,120 --> 08:56:09,960 or games where there are many, many more states and many, many more actions 10722 08:56:09,960 --> 08:56:11,840 that you can imagine taking, where it's not 10723 08:56:11,840 --> 08:56:15,600 going to be as easy to learn for every state and for every action 10724 08:56:15,600 --> 08:56:17,520 what the value is going to be. 10725 08:56:17,520 --> 08:56:20,040 So oftentimes in that case, we can't necessarily 10726 08:56:20,040 --> 08:56:23,960 learn exactly what the value is for every state and for every action, 10727 08:56:23,960 --> 08:56:25,400 but we can approximate it. 10728 08:56:25,400 --> 08:56:28,720 So much as we saw with minimax, so we could use a depth-limiting approach 10729 08:56:28,720 --> 08:56:31,760 to stop calculating at a certain point in time, 10730 08:56:31,760 --> 08:56:34,360 we can do a similar type of approximation known 10731 08:56:34,360 --> 08:56:37,640 as function approximation in a reinforcement learning context 10732 08:56:37,640 --> 08:56:42,800 where instead of learning a value of q for every state and every action, 10733 08:56:42,800 --> 08:56:46,160 we just have some function that estimates what the value is 10734 08:56:46,160 --> 08:56:49,000 for taking this action in this particular state that 10735 08:56:49,000 --> 08:56:53,240 might be based on various different features of the state 10736 08:56:53,240 --> 08:56:55,960 that the agent happens to be in, where you might have 10737 08:56:55,960 --> 08:56:58,400 to choose what those features actually are. 10738 08:56:58,400 --> 08:57:02,400 But you can begin to learn some patterns that generalize beyond one 10739 08:57:02,400 --> 08:57:05,840 specific state and one specific action that you can begin to learn 10740 08:57:05,840 --> 08:57:08,480 if certain features tend to be good things or bad things. 10741 08:57:08,480 --> 08:57:11,760 Reinforcement learning can allow you, using a very similar mechanism, 10742 08:57:11,760 --> 08:57:14,320 to generalize beyond one particular state and say, 10743 08:57:14,320 --> 08:57:17,080 if this other state looks kind of like this state, 10744 08:57:17,080 --> 08:57:20,000 then maybe the similar types of actions that worked in one state 10745 08:57:20,000 --> 08:57:23,240 will also work in another state as well. 10746 08:57:23,240 --> 08:57:25,360 And so this type of approach can be quite helpful 10747 08:57:25,360 --> 08:57:27,680 as you begin to deal with reinforcement learning that 10748 08:57:27,680 --> 08:57:31,600 exist in larger and larger state spaces where it's just not feasible 10749 08:57:31,600 --> 08:57:36,120 to explore all of the possible states that could actually exist. 10750 08:57:36,120 --> 08:57:39,400 So there, then, are two of the main categories of reinforcement learning. 10751 08:57:39,400 --> 08:57:42,760 Supervised learning, where you have labeled input and output pairs, 10752 08:57:42,760 --> 08:57:46,480 and reinforcement learning, where an agent learns from rewards or punishments 10753 08:57:46,480 --> 08:57:47,480 that it receives. 10754 08:57:47,480 --> 08:57:49,640 The third major category of machine learning 10755 08:57:49,640 --> 08:57:53,400 that we'll just touch on briefly is known as unsupervised learning. 10756 08:57:53,400 --> 08:57:56,520 And unsupervised learning happens when we have data 10757 08:57:56,520 --> 08:57:59,280 without any additional feedback, without labels, 10758 08:57:59,280 --> 08:58:02,760 that in the supervised learning case, all of our data had labels. 10759 08:58:02,760 --> 08:58:06,520 We labeled the data point with whether that was a rainy day or not rainy day. 10760 08:58:06,520 --> 08:58:09,560 And using those labels, we were able to infer what the pattern was. 10761 08:58:09,560 --> 08:58:13,160 Or we labeled data as a counterfeit banknote or not a counterfeit. 10762 08:58:13,160 --> 08:58:16,720 And using those labels, we were able to draw inferences and patterns 10763 08:58:16,720 --> 08:58:20,840 to figure out what does a banknote look like versus not. 10764 08:58:20,840 --> 08:58:25,240 In unsupervised learning, we don't have any access to any of those labels. 10765 08:58:25,240 --> 08:58:28,320 But we still would like to learn some of those patterns. 10766 08:58:28,320 --> 08:58:31,920 And one of the tasks that you might want to perform in unsupervised learning 10767 08:58:31,920 --> 08:58:34,680 is something like clustering, where clustering is just 10768 08:58:34,680 --> 08:58:37,800 the task of, given some set of objects, organize it 10769 08:58:37,800 --> 08:58:42,160 into distinct clusters, groups of objects that are similar to one another. 10770 08:58:42,160 --> 08:58:44,440 And there's lots of applications for clustering. 10771 08:58:44,440 --> 08:58:47,480 It comes up in genetic research, where you might have 10772 08:58:47,480 --> 08:58:50,840 a whole bunch of different genes and you want to cluster them into similar genes 10773 08:58:50,840 --> 08:58:54,480 if you're trying to analyze them across a population or across species. 10774 08:58:54,480 --> 08:58:57,480 It comes up in an image if you want to take all the pixels of an image, 10775 08:58:57,480 --> 08:58:59,520 cluster them into different parts of the image. 10776 08:58:59,520 --> 08:59:03,400 Comes a lot up in market research if you want to divide your consumers 10777 08:59:03,400 --> 08:59:06,640 into different groups so you know which groups to target with certain types 10778 08:59:06,640 --> 08:59:10,240 of product advertisements, for example, and a number of other contexts 10779 08:59:10,240 --> 08:59:13,280 as well in which clustering can be very applicable. 10780 08:59:13,280 --> 08:59:17,880 One technique for clustering is an algorithm known as k-means clustering. 10781 08:59:17,880 --> 08:59:20,240 And what k-means clustering is going to do 10782 08:59:20,240 --> 08:59:24,720 is it is going to divide all of our data points into k different clusters. 10783 08:59:24,720 --> 08:59:28,640 And it's going to do so by repeating this process of assigning points 10784 08:59:28,640 --> 08:59:32,640 to clusters and then moving around those clusters at centers. 10785 08:59:32,640 --> 08:59:36,600 We're going to define a cluster by its center, the middle of the cluster, 10786 08:59:36,600 --> 08:59:39,760 and then assign points to that cluster based on which 10787 08:59:39,760 --> 08:59:42,360 center is closest to that point. 10788 08:59:42,360 --> 08:59:44,560 And I'll show you an example of that now. 10789 08:59:44,560 --> 08:59:47,960 Here, for example, I have a whole bunch of unlabeled data, 10790 08:59:47,960 --> 08:59:51,760 just various data points that are in some sort of graphical space. 10791 08:59:51,760 --> 08:59:55,560 And I would like to group them into various different clusters. 10792 08:59:55,560 --> 08:59:57,400 But I don't know how to do that originally. 10793 08:59:57,400 --> 09:00:00,400 And let's say I want to assign like three clusters to this group. 10794 09:00:00,400 --> 09:00:03,400 And you have to choose how many clusters you want in k-means clustering 10795 09:00:03,400 --> 09:00:06,800 that you could try multiple and see how well those values perform. 10796 09:00:06,800 --> 09:00:09,960 But I'll start just by randomly picking some places 10797 09:00:09,960 --> 09:00:12,040 to put the centers of those clusters. 10798 09:00:12,040 --> 09:00:15,600 Maybe I have a blue cluster, a red cluster, and a green cluster. 10799 09:00:15,600 --> 09:00:18,040 And I'm going to start with the centers of those clusters 10800 09:00:18,040 --> 09:00:20,600 just being in these three locations here. 10801 09:00:20,600 --> 09:00:23,040 And what k-means clustering tells us to do 10802 09:00:23,040 --> 09:00:25,720 is once I have the centers of the clusters, 10803 09:00:25,720 --> 09:00:32,440 assign every point to a cluster based on which cluster center it is closest to. 10804 09:00:32,440 --> 09:00:35,920 So we end up with something like this, where all of these points 10805 09:00:35,920 --> 09:00:40,240 are closer to the blue cluster center than any other cluster center. 10806 09:00:40,240 --> 09:00:43,880 All of these points here are closer to the green cluster 10807 09:00:43,880 --> 09:00:45,800 center than any other cluster center. 10808 09:00:45,800 --> 09:00:48,360 And then these two points plus these points over here, 10809 09:00:48,360 --> 09:00:53,200 those are all closest to the red cluster center instead. 10810 09:00:53,200 --> 09:00:57,000 So here then is one possible assignment of all these points 10811 09:00:57,000 --> 09:00:58,880 to three different clusters. 10812 09:00:58,880 --> 09:01:01,560 But it's not great that it seems like in this red cluster, 10813 09:01:01,560 --> 09:01:02,960 these points are kind of far apart. 10814 09:01:02,960 --> 09:01:05,800 In this green cluster, these points are kind of far apart. 10815 09:01:05,800 --> 09:01:08,760 It might not be my ideal choice of how I would cluster 10816 09:01:08,760 --> 09:01:10,640 these various different data points. 10817 09:01:10,640 --> 09:01:13,720 But k-means clustering is an iterative process 10818 09:01:13,720 --> 09:01:16,360 that after I do this, there is a next step, which 10819 09:01:16,360 --> 09:01:19,920 is that after I've assigned all of the points to the cluster center 10820 09:01:19,920 --> 09:01:24,280 that it is nearest to, we are going to re-center the clusters, 10821 09:01:24,280 --> 09:01:27,560 meaning take the cluster centers, these diamond shapes here, 10822 09:01:27,560 --> 09:01:30,420 and move them to the middle, or the average, 10823 09:01:30,420 --> 09:01:33,960 effectively, of all of the points that are in that cluster. 10824 09:01:33,960 --> 09:01:36,080 So we'll take this blue point, this blue center, 10825 09:01:36,080 --> 09:01:39,800 and go ahead and move it to the middle or to the center of all 10826 09:01:39,800 --> 09:01:41,920 of the points that were assigned to the blue cluster, 10827 09:01:41,920 --> 09:01:43,920 moving it slightly to the right in this case. 10828 09:01:43,920 --> 09:01:45,040 And we'll do the same thing for red. 10829 09:01:45,040 --> 09:01:49,840 We'll move the cluster center to the middle of all of these points, 10830 09:01:49,840 --> 09:01:51,560 weighted by how many points there are. 10831 09:01:51,560 --> 09:01:55,040 There are more points over here, so the red center ends up 10832 09:01:55,040 --> 09:01:56,720 moving a little bit further that way. 10833 09:01:56,720 --> 09:01:59,300 And likewise, for the green center, there are many more points 10834 09:01:59,300 --> 09:02:01,200 on this side of the green center. 10835 09:02:01,200 --> 09:02:04,440 So the green center ends up being pulled a little bit further 10836 09:02:04,440 --> 09:02:06,160 in this direction. 10837 09:02:06,160 --> 09:02:10,020 So we re-center all of the clusters, and then we repeat the process. 10838 09:02:10,020 --> 09:02:14,480 We go ahead and now reassign all of the points to the cluster center 10839 09:02:14,480 --> 09:02:16,080 that they are now closest to. 10840 09:02:16,080 --> 09:02:18,560 And now that we've moved around the cluster centers, 10841 09:02:18,560 --> 09:02:20,640 these cluster assignments might change. 10842 09:02:20,640 --> 09:02:23,840 That this point originally was closer to the red cluster center, 10843 09:02:23,840 --> 09:02:26,840 but now it's actually closer to the blue cluster center. 10844 09:02:26,840 --> 09:02:28,520 Same goes for this point as well. 10845 09:02:28,520 --> 09:02:31,620 And these three points that were originally closer to the green cluster 10846 09:02:31,620 --> 09:02:36,600 center are now closer to the red cluster center instead. 10847 09:02:36,600 --> 09:02:41,320 So we can reassign what colors or which clusters each of these data points 10848 09:02:41,320 --> 09:02:43,960 belongs to, and then repeat the process again, 10849 09:02:43,960 --> 09:02:47,520 moving each of these cluster means and the middles of the clusterism 10850 09:02:47,520 --> 09:02:52,720 to the mean, the average, of all of the other points that happen to be there, 10851 09:02:52,720 --> 09:02:54,320 and repeat the process again. 10852 09:02:54,320 --> 09:02:57,060 Go ahead and assign each of the points to the cluster 10853 09:02:57,060 --> 09:02:58,440 that they are closest to. 10854 09:02:58,440 --> 09:03:01,600 So once we reach a point where we've assigned all the points to clusters 10855 09:03:01,600 --> 09:03:05,140 to the cluster that they are nearest to, and nothing changed, 10856 09:03:05,140 --> 09:03:07,840 we've reached a sort of equilibrium in this situation, 10857 09:03:07,840 --> 09:03:09,960 where no points are changing their allegiance. 10858 09:03:09,960 --> 09:03:12,800 And as a result, we can declare this algorithm is now over. 10859 09:03:12,800 --> 09:03:15,840 And we now have some assignment of each of these points 10860 09:03:15,840 --> 09:03:17,160 into three different clusters. 10861 09:03:17,160 --> 09:03:19,480 And it looks like we did a pretty good job of trying 10862 09:03:19,480 --> 09:03:22,880 to identify which points are more similar to one another 10863 09:03:22,880 --> 09:03:24,640 than they are to points in other groups. 10864 09:03:24,640 --> 09:03:27,920 So we have the green cluster down here, this blue cluster here, 10865 09:03:27,920 --> 09:03:30,480 and then this red cluster over there as well. 10866 09:03:30,480 --> 09:03:33,400 And we did so without any access to some labels 10867 09:03:33,400 --> 09:03:35,960 to tell us what these various different clusters were. 10868 09:03:35,960 --> 09:03:38,800 We just used an algorithm in an unsupervised sense 10869 09:03:38,800 --> 09:03:41,760 without any of those labels to figure out which points 10870 09:03:41,760 --> 09:03:43,640 belonged to which categories. 10871 09:03:43,640 --> 09:03:47,680 And again, lots of applications for this type of clustering technique. 10872 09:03:47,680 --> 09:03:50,760 And there are many more algorithms in each of these various different fields 10873 09:03:50,760 --> 09:03:54,240 within machine learning, supervised and reinforcement and unsupervised. 10874 09:03:54,240 --> 09:03:57,240 But those are many of the big picture foundational ideas 10875 09:03:57,240 --> 09:04:00,320 that underlie a lot of these techniques, where these are the problems 10876 09:04:00,320 --> 09:04:01,520 that we're trying to solve. 10877 09:04:01,520 --> 09:04:03,640 And we try and solve those problems using 10878 09:04:03,640 --> 09:04:06,560 a number of different methods of trying to take data and learn 10879 09:04:06,560 --> 09:04:08,800 patterns in that data, whether that's trying 10880 09:04:08,800 --> 09:04:10,960 to find neighboring data points that are similar 10881 09:04:10,960 --> 09:04:13,840 or trying to minimize some sort of loss function 10882 09:04:13,840 --> 09:04:17,080 or any number of other techniques that allow us to begin to try 10883 09:04:17,080 --> 09:04:19,360 to solve these sorts of problems. 10884 09:04:19,360 --> 09:04:21,360 That then was a look at some of the principles 10885 09:04:21,360 --> 09:04:23,800 that are at the foundation of modern machine learning, 10886 09:04:23,800 --> 09:04:26,760 this ability to take data and learn from that data 10887 09:04:26,760 --> 09:04:28,840 so that the computer can perform a task even 10888 09:04:28,840 --> 09:04:31,240 if they haven't explicitly been given instructions 10889 09:04:31,240 --> 09:04:32,440 in order to do so. 10890 09:04:32,440 --> 09:04:35,200 Next time, we'll continue this conversation about machine learning, 10891 09:04:35,200 --> 09:04:38,600 looking at other techniques we can use for solving these sorts of problems. 10892 09:04:38,600 --> 09:04:41,320 We'll see you then. 10893 09:04:41,320 --> 09:05:01,360 All right, welcome back, everyone, to an introduction 10894 09:05:01,360 --> 09:05:03,360 to artificial intelligence with Python. 10895 09:05:03,360 --> 09:05:05,640 Now, last time, we took a look at machine learning, 10896 09:05:05,640 --> 09:05:09,280 a set of techniques that computers can use in order to take a set of data 10897 09:05:09,280 --> 09:05:11,480 and learn some patterns inside of that data, 10898 09:05:11,480 --> 09:05:14,560 learn how to perform a task even if we the programmers didn't 10899 09:05:14,560 --> 09:05:18,760 give the computer explicit instructions for how to perform that task. 10900 09:05:18,760 --> 09:05:21,780 Today, we transition to one of the most popular techniques and tools 10901 09:05:21,780 --> 09:05:24,600 within machine learning, that of neural networks. 10902 09:05:24,600 --> 09:05:27,600 And neural networks were inspired as early as the 1940s 10903 09:05:27,600 --> 09:05:30,600 by researchers who were thinking about how it is that humans learn, 10904 09:05:30,600 --> 09:05:33,000 studying neuroscience in the human brain and trying 10905 09:05:33,000 --> 09:05:36,160 to see whether or not we could apply those same ideas to computers 10906 09:05:36,160 --> 09:05:39,440 as well and model computer learning off of human learning. 10907 09:05:39,440 --> 09:05:41,480 So how is the brain structured? 10908 09:05:41,480 --> 09:05:45,080 Well, very simply put, the brain consists of a whole bunch of neurons. 10909 09:05:45,080 --> 09:05:47,480 And those neurons are connected to one another 10910 09:05:47,480 --> 09:05:49,800 and communicate with one another in some way. 10911 09:05:49,800 --> 09:05:52,800 In particular, if you think about the structure of a biological neural 10912 09:05:52,800 --> 09:05:55,800 network, something like this, there are a couple of key properties 10913 09:05:55,800 --> 09:05:57,320 that scientists observed. 10914 09:05:57,320 --> 09:05:59,680 One was that these neurons are connected to each other 10915 09:05:59,680 --> 09:06:01,880 and receive electrical signals from one another, 10916 09:06:01,880 --> 09:06:06,120 that one neuron can propagate electrical signals to another neuron. 10917 09:06:06,120 --> 09:06:09,080 And another point is that neurons process those input signals 10918 09:06:09,080 --> 09:06:12,760 and then can be activated, that a neuron becomes activated at a certain point 10919 09:06:12,760 --> 09:06:16,840 and then can propagate further signals onto neurons in the future. 10920 09:06:16,840 --> 09:06:18,600 And so the question then became, could we 10921 09:06:18,600 --> 09:06:22,160 take this biological idea of how it is that humans learn with brains 10922 09:06:22,160 --> 09:06:25,440 and with neurons and apply that to a machine as well, 10923 09:06:25,440 --> 09:06:29,520 in effect designing an artificial neural network, or an ANN, 10924 09:06:29,520 --> 09:06:31,760 which will be a mathematical model for learning 10925 09:06:31,760 --> 09:06:34,860 that is inspired by these biological neural networks? 10926 09:06:34,860 --> 09:06:37,360 And what artificial neural networks will allow us to do 10927 09:06:37,360 --> 09:06:40,600 is they will first be able to model some sort of mathematical function. 10928 09:06:40,600 --> 09:06:42,440 Every time you look at a neural network, which 10929 09:06:42,440 --> 09:06:44,520 we'll see more of later today, each one of them 10930 09:06:44,520 --> 09:06:46,720 is really just some mathematical function that 10931 09:06:46,720 --> 09:06:50,160 is mapping certain inputs to particular outputs based 10932 09:06:50,160 --> 09:06:53,240 on the structure of the network, that depending on where we place 10933 09:06:53,240 --> 09:06:55,800 particular units inside of this neural network, 10934 09:06:55,800 --> 09:06:59,640 that's going to determine how it is that the network is going to function. 10935 09:06:59,640 --> 09:07:01,800 And in particular, artificial neural networks 10936 09:07:01,800 --> 09:07:05,760 are going to lend themselves to a way that we can learn what the network's 10937 09:07:05,760 --> 09:07:07,240 parameters should be. 10938 09:07:07,240 --> 09:07:08,920 We'll see more on that in just a moment. 10939 09:07:08,920 --> 09:07:11,000 But in effect, we want a model such that it 10940 09:07:11,000 --> 09:07:13,500 is easy for us to be able to write some code that 10941 09:07:13,500 --> 09:07:16,040 allows for the network to be able to figure out 10942 09:07:16,040 --> 09:07:18,480 how to model the right mathematical function given 10943 09:07:18,480 --> 09:07:20,840 a particular set of input data. 10944 09:07:20,840 --> 09:07:23,120 So in order to create our artificial neural network, 10945 09:07:23,120 --> 09:07:25,360 instead of using biological neurons, we're just 10946 09:07:25,360 --> 09:07:28,240 going to use what we're going to call units, units inside of a neural 10947 09:07:28,240 --> 09:07:31,460 network, which we can represent kind of like a node in a graph, which 10948 09:07:31,460 --> 09:07:34,520 will here be represented just by a blue circle like this. 10949 09:07:34,520 --> 09:07:37,520 And these artificial units, these artificial neurons, 10950 09:07:37,520 --> 09:07:39,320 can be connected to one another. 10951 09:07:39,320 --> 09:07:41,560 So here, for instance, we have two units that 10952 09:07:41,560 --> 09:07:46,240 are connected by this edge inside of this graph, effectively. 10953 09:07:46,240 --> 09:07:48,020 And so what we're going to do now is think 10954 09:07:48,020 --> 09:07:51,680 of this idea as some sort of mapping from inputs to outputs. 10955 09:07:51,680 --> 09:07:54,800 So we have one unit that is connected to another unit 10956 09:07:54,800 --> 09:07:58,460 that we might think of this side of the input and that side of the output. 10957 09:07:58,460 --> 09:08:00,680 And what we're trying to do then is to figure out 10958 09:08:00,680 --> 09:08:04,000 how to solve a problem, how to model some sort of mathematical function. 10959 09:08:04,000 --> 09:08:05,680 And this might take the form of something 10960 09:08:05,680 --> 09:08:08,640 we saw last time, which was something like we have certain inputs, 10961 09:08:08,640 --> 09:08:10,680 like variables x1 and x2. 10962 09:08:10,680 --> 09:08:13,800 And given those inputs, we want to perform some sort of task, 10963 09:08:13,800 --> 09:08:16,820 a task like predicting whether or not it's going to rain. 10964 09:08:16,820 --> 09:08:20,120 And ideally, we'd like some way, given these inputs, x1 and x2, 10965 09:08:20,120 --> 09:08:23,120 which stand for some sort of variables to do with the weather, 10966 09:08:23,120 --> 09:08:27,000 we would like to be able to predict, in this case, a Boolean classification. 10967 09:08:27,000 --> 09:08:30,160 Is it going to rain, or is it not going to rain? 10968 09:08:30,160 --> 09:08:33,360 And we did this last time by way of a mathematical function. 10969 09:08:33,360 --> 09:08:36,880 We defined some function, h, for our hypothesis function, 10970 09:08:36,880 --> 09:08:41,160 that took as input x1 and x2, the two inputs that we cared about processing, 10971 09:08:41,160 --> 09:08:44,080 in order to determine whether we thought it was going to rain 10972 09:08:44,080 --> 09:08:46,160 or whether we thought it was not going to rain. 10973 09:08:46,160 --> 09:08:48,840 The question then becomes, what does this hypothesis function 10974 09:08:48,840 --> 09:08:51,400 do in order to make that determination? 10975 09:08:51,400 --> 09:08:56,520 And we decided last time to use a linear combination of these input variables 10976 09:08:56,520 --> 09:08:58,160 to determine what the output should be. 10977 09:08:58,160 --> 09:09:02,680 So our hypothesis function was equal to something like this. 10978 09:09:02,680 --> 09:09:07,560 Weight 0 plus weight 1 times x1 plus weight 2 times x2. 10979 09:09:07,560 --> 09:09:11,960 So what's going on here is that x1 and x2, those are input variables, 10980 09:09:11,960 --> 09:09:15,040 the inputs to this hypothesis function. 10981 09:09:15,040 --> 09:09:17,960 And each of those input variables is being multiplied 10982 09:09:17,960 --> 09:09:20,400 by some weight, which is just some number. 10983 09:09:20,400 --> 09:09:25,240 So x1 is being multiplied by weight 1, x2 is being multiplied by weight 2. 10984 09:09:25,240 --> 09:09:27,660 And we have this additional weight, weight 0, 10985 09:09:27,660 --> 09:09:30,160 that doesn't get multiplied by an input variable at all, 10986 09:09:30,160 --> 09:09:32,040 that just serves to either move the function up 10987 09:09:32,040 --> 09:09:33,840 or move the function's value down. 10988 09:09:33,840 --> 09:09:36,160 You can think of this as either a weight that's just 10989 09:09:36,160 --> 09:09:38,720 multiplied by some dummy value, like the number 1. 10990 09:09:38,720 --> 09:09:41,840 It's multiplied by 1, and so it's not multiplied by anything. 10991 09:09:41,840 --> 09:09:43,840 Or sometimes, you'll see in the literature, 10992 09:09:43,840 --> 09:09:46,280 people call this variable weight 0 a bias, 10993 09:09:46,280 --> 09:09:48,780 so that you can think of these variables as slightly different. 10994 09:09:48,780 --> 09:09:50,920 We have weights that are multiplied by the input, 10995 09:09:50,920 --> 09:09:54,560 and we separately add some bias to the result as well. 10996 09:09:54,560 --> 09:09:56,240 You'll hear both of those terminologies used 10997 09:09:56,240 --> 09:09:59,960 when people talk about neural networks and machine learning. 10998 09:09:59,960 --> 09:10:02,160 So in effect, what we've done here is that in order 10999 09:10:02,160 --> 09:10:06,240 to define a hypothesis function, we just need to decide and figure out 11000 09:10:06,240 --> 09:10:08,640 what these weights should be to determine 11001 09:10:08,640 --> 09:10:12,520 what values to multiply by our inputs to get some sort of result. 11002 09:10:12,520 --> 09:10:14,600 Of course, at the end of this, what we need to do 11003 09:10:14,600 --> 09:10:18,120 is make some sort of classification, like rainy or not rainy. 11004 09:10:18,120 --> 09:10:20,400 And to do that, we use some sort of function 11005 09:10:20,400 --> 09:10:22,400 that defines some sort of threshold. 11006 09:10:22,400 --> 09:10:25,040 And so we saw, for instance, the step function, 11007 09:10:25,040 --> 09:10:30,120 which is defined as 1 if the result of multiplying the weights by the inputs 11008 09:10:30,120 --> 09:10:32,360 is at least 0, otherwise it's 0. 11009 09:10:32,360 --> 09:10:34,200 And you can think of this line down the middle 11010 09:10:34,200 --> 09:10:35,560 as kind of like a dotted line. 11011 09:10:35,560 --> 09:10:38,560 Effectively, it stays at 0 all the way up to one point, 11012 09:10:38,560 --> 09:10:41,320 and then the function steps or jumps up to 1. 11013 09:10:41,320 --> 09:10:43,720 So it's 0 before it reaches some threshold, 11014 09:10:43,720 --> 09:10:46,960 and then it's 1 after it reaches a particular threshold. 11015 09:10:46,960 --> 09:10:49,040 And so this was one way we could define what 11016 09:10:49,040 --> 09:10:51,680 will come to call an activation function, a function that 11017 09:10:51,680 --> 09:10:56,400 determines when it is that this output becomes active, changes to 1 11018 09:10:56,400 --> 09:10:58,280 instead of being a 0. 11019 09:10:58,280 --> 09:11:02,120 But we also saw that if we didn't just want a purely binary classification, 11020 09:11:02,120 --> 09:11:04,800 we didn't want purely 1 or 0, but we wanted 11021 09:11:04,800 --> 09:11:07,880 to allow for some in-between real numbered values, 11022 09:11:07,880 --> 09:11:09,300 we could use a different function. 11023 09:11:09,300 --> 09:11:11,720 And there are a number of choices, but the one that we looked at 11024 09:11:11,720 --> 09:11:15,760 was the logistic sigmoid function that has sort of an s-shaped curve, 11025 09:11:15,760 --> 09:11:18,160 where we could represent this as a probability that 11026 09:11:18,160 --> 09:11:20,920 may be somewhere in between the probability of rain 11027 09:11:20,920 --> 09:11:22,640 or something like 0.5. 11028 09:11:22,640 --> 09:11:25,680 Maybe a little bit later, the probability of rain is 0.8. 11029 09:11:25,680 --> 09:11:29,600 And so rather than just have a binary classification of 0 or 1, 11030 09:11:29,600 --> 09:11:32,320 we could allow for numbers that are in between as well. 11031 09:11:32,320 --> 09:11:35,000 And it turns out there are many other different types of activation 11032 09:11:35,000 --> 09:11:37,560 functions, where an activation function just 11033 09:11:37,560 --> 09:11:41,000 takes the output of multiplying the weights together and adding that bias, 11034 09:11:41,000 --> 09:11:43,800 and then figuring out what the actual output should be. 11035 09:11:43,800 --> 09:11:48,040 Another popular one is the rectified linear unit, otherwise known as ReLU. 11036 09:11:48,040 --> 09:11:50,480 And the way that works is that it just takes its input 11037 09:11:50,480 --> 09:11:52,920 and takes the maximum of that input and 0. 11038 09:11:52,920 --> 09:11:55,120 So if it's positive, it remains unchanged. 11039 09:11:55,120 --> 09:11:59,000 But if it's 0, if it's negative, it goes ahead and levels out at 0. 11040 09:11:59,000 --> 09:12:02,400 And there are other activation functions that we could choose as well. 11041 09:12:02,400 --> 09:12:04,720 But in short, each of these activation functions, 11042 09:12:04,720 --> 09:12:07,880 you can just think of as a function that gets applied 11043 09:12:07,880 --> 09:12:10,360 to the result of all of this computation. 11044 09:12:10,360 --> 09:12:15,400 We take some function g and apply it to the result of all of that calculation. 11045 09:12:15,400 --> 09:12:17,380 And this then is what we saw last time, the way 11046 09:12:17,380 --> 09:12:20,920 of defining some hypothesis function that takes in inputs, 11047 09:12:20,920 --> 09:12:23,980 calculate some linear combination of those inputs, 11048 09:12:23,980 --> 09:12:28,760 and then passes it through some sort of activation function to get our output. 11049 09:12:28,760 --> 09:12:32,800 And this actually turns out to be the model for the simplest of neural 11050 09:12:32,800 --> 09:12:36,440 networks, that we're going to instead represent this mathematical idea 11051 09:12:36,440 --> 09:12:39,320 graphically by using a structure like this. 11052 09:12:39,320 --> 09:12:42,000 Here then is a neural network that has two inputs. 11053 09:12:42,000 --> 09:12:44,400 We can think of this as x1 and this as x2. 11054 09:12:44,400 --> 09:12:48,120 And then one output, which you can think of as classifying whether or not 11055 09:12:48,120 --> 09:12:50,960 we think it's going to rain or not rain, for example, 11056 09:12:50,960 --> 09:12:52,640 in this particular instance. 11057 09:12:52,640 --> 09:12:54,600 And so how exactly does this model work? 11058 09:12:54,600 --> 09:12:57,640 Well, each of these two inputs represents one of our input variables, 11059 09:12:57,640 --> 09:12:59,660 x1 and x2. 11060 09:12:59,660 --> 09:13:05,040 And notice that these inputs are connected to this output via these edges, 11061 09:13:05,040 --> 09:13:06,980 which are going to be defined by their weights. 11062 09:13:06,980 --> 09:13:10,840 So these edges each have a weight associated with them, weight 1 and weight 11063 09:13:10,840 --> 09:13:12,000 2. 11064 09:13:12,000 --> 09:13:14,320 And then this output unit, what it's going to do 11065 09:13:14,320 --> 09:13:17,680 is it is going to calculate an output based on those inputs 11066 09:13:17,680 --> 09:13:19,240 and based on those weights. 11067 09:13:19,240 --> 09:13:23,320 This output unit is going to multiply all the inputs by their weights, 11068 09:13:23,320 --> 09:13:26,760 add in this bias term, which you can think of as an extra w0 term 11069 09:13:26,760 --> 09:13:31,120 that gets added into it, and then we pass it through an activation function. 11070 09:13:31,120 --> 09:13:34,640 So this then is just a graphical way of representing the same idea 11071 09:13:34,640 --> 09:13:36,800 we saw last time just mathematically. 11072 09:13:36,800 --> 09:13:40,120 And we're going to call this a very simple neural network. 11073 09:13:40,120 --> 09:13:42,000 And we'd like for this neural network to be 11074 09:13:42,000 --> 09:13:44,560 able to learn how to calculate some function, 11075 09:13:44,560 --> 09:13:46,960 that we want some function for the neural network to learn. 11076 09:13:46,960 --> 09:13:50,600 And the neural network is going to learn what should the values of w0, 11077 09:13:50,600 --> 09:13:52,240 w1, and w2 be? 11078 09:13:52,240 --> 09:13:54,560 What should the activation function be in order 11079 09:13:54,560 --> 09:13:57,240 to get the result that we would expect? 11080 09:13:57,240 --> 09:13:59,440 So we can actually take a look at an example of this. 11081 09:13:59,440 --> 09:14:02,440 What then is a very simple function that we might calculate? 11082 09:14:02,440 --> 09:14:06,040 Well, if we recall back from when we were looking at propositional logic, 11083 09:14:06,040 --> 09:14:07,920 one of the simplest functions we looked at 11084 09:14:07,920 --> 09:14:12,160 was something like the or function that takes two inputs, x and y, 11085 09:14:12,160 --> 09:14:16,640 and outputs 1, otherwise known as true, if either one of the inputs 11086 09:14:16,640 --> 09:14:22,280 or both of them are 1, and outputs of 0 if both of the inputs are 0 or false. 11087 09:14:22,280 --> 09:14:23,800 So this then is the or function. 11088 09:14:23,800 --> 09:14:25,880 And this was the truth table for the or function, 11089 09:14:25,880 --> 09:14:29,840 that as long as either of the inputs are 1, the output of the function is 1, 11090 09:14:29,840 --> 09:14:34,520 and the only case where the output is 0 is where both of the inputs are 0. 11091 09:14:34,520 --> 09:14:38,200 So the question is, how could we take this and train a neural network 11092 09:14:38,200 --> 09:14:40,600 to be able to learn this particular function? 11093 09:14:40,600 --> 09:14:42,440 What would those weights look like? 11094 09:14:42,440 --> 09:14:44,280 Well, we could do something like this. 11095 09:14:44,280 --> 09:14:45,960 Here's our neural network. 11096 09:14:45,960 --> 09:14:48,920 And I'll propose that in order to calculate the or function, 11097 09:14:48,920 --> 09:14:52,880 we're going to use a value of 1 for each of the weights. 11098 09:14:52,880 --> 09:14:55,800 And we'll use a bias of negative 1. 11099 09:14:55,800 --> 09:14:59,520 And then we'll just use this step function as our activation function. 11100 09:14:59,520 --> 09:15:00,720 How then does this work? 11101 09:15:00,720 --> 09:15:04,240 Well, if I wanted to calculate something like 0 or 0, 11102 09:15:04,240 --> 09:15:08,000 which we know to be 0 because false or false is false, then what are we going 11103 09:15:08,000 --> 09:15:08,720 to do? 11104 09:15:08,720 --> 09:15:12,400 Well, our output unit is going to calculate this input multiplied 11105 09:15:12,400 --> 09:15:14,680 by the weight, 0 times 1, that's 0. 11106 09:15:14,680 --> 09:15:17,440 Same thing here, 0 times 1, that's 0. 11107 09:15:17,440 --> 09:15:21,360 And we'll add to that the bias minus 1. 11108 09:15:21,360 --> 09:15:23,800 So that'll give us a result of negative 1. 11109 09:15:23,800 --> 09:15:26,920 If we plot that on our activation function, negative 1 is here. 11110 09:15:26,920 --> 09:15:30,520 It's before the threshold, which means either 0 or 1. 11111 09:15:30,520 --> 09:15:32,400 It's only 1 after the threshold. 11112 09:15:32,400 --> 09:15:34,720 Since negative 1 is before the threshold, 11113 09:15:34,720 --> 09:15:38,440 the output that this unit provides is going to be 0. 11114 09:15:38,440 --> 09:15:43,480 And that's what we would expect it to be, that 0 or 0 should be 0. 11115 09:15:43,480 --> 09:15:47,360 What if instead we had had 1 or 0, where this is the number 1? 11116 09:15:47,360 --> 09:15:50,720 Well, in this case, in order to calculate what the output is going to be, 11117 09:15:50,720 --> 09:15:55,720 we again have to do this weighted sum, 1 times 1, that's 1. 11118 09:15:55,720 --> 09:15:57,400 0 times 1, that's 0. 11119 09:15:57,400 --> 09:15:59,480 Sum of that so far is 1. 11120 09:15:59,480 --> 09:16:00,880 Add negative 1 to that. 11121 09:16:00,880 --> 09:16:02,440 Well, then the output is 0. 11122 09:16:02,440 --> 09:16:05,600 And if we plot 0 on the step function, 0 ends up being here. 11123 09:16:05,600 --> 09:16:07,320 It's just at the threshold. 11124 09:16:07,320 --> 09:16:11,440 And so the output here is going to be 1, because the output of 1 or 0, 11125 09:16:11,440 --> 09:16:12,240 that's 1. 11126 09:16:12,240 --> 09:16:13,960 So that's what we would expect as well. 11127 09:16:13,960 --> 09:16:17,800 And just for one more example, if I had 1 or 1, what would the result be? 11128 09:16:17,800 --> 09:16:19,520 Well, 1 times 1 is 1. 11129 09:16:19,520 --> 09:16:20,520 1 times 1 is 1. 11130 09:16:20,520 --> 09:16:22,120 The sum of those is 2. 11131 09:16:22,120 --> 09:16:23,440 I add the bias term to that. 11132 09:16:23,440 --> 09:16:24,720 I get the number 1. 11133 09:16:24,720 --> 09:16:27,000 1 plotted on this graph is way over there. 11134 09:16:27,000 --> 09:16:28,760 That's well beyond the threshold. 11135 09:16:28,760 --> 09:16:31,040 And so this output is going to be 1 as well. 11136 09:16:31,040 --> 09:16:34,160 The output is always 0 or 1, depending on whether or not 11137 09:16:34,160 --> 09:16:35,480 we're past the threshold. 11138 09:16:35,480 --> 09:16:39,840 And this neural network then models the OR function, a very simple function, 11139 09:16:39,840 --> 09:16:40,720 definitely. 11140 09:16:40,720 --> 09:16:42,520 But it still is able to model it correctly. 11141 09:16:42,520 --> 09:16:48,000 If I give it the inputs, it will tell me what x1 or x2 happens to be. 11142 09:16:48,000 --> 09:16:50,840 And you could imagine trying to do this for other functions as well. 11143 09:16:50,840 --> 09:16:55,440 A function like the AND function, for instance, that takes two inputs 11144 09:16:55,440 --> 09:16:59,360 and calculates whether both x and y are true. 11145 09:16:59,360 --> 09:17:04,080 So if x is 1 and y is 1, then the output of x and y is 1. 11146 09:17:04,080 --> 09:17:07,160 But in all the other cases, the output is 0. 11147 09:17:07,160 --> 09:17:10,400 How could we model that inside of a neural network as well? 11148 09:17:10,400 --> 09:17:13,200 Well, it turns out we could do it in the same way, 11149 09:17:13,200 --> 09:17:16,480 except instead of negative 1 as the bias, 11150 09:17:16,480 --> 09:17:20,000 we can use negative 2 as the bias instead. 11151 09:17:20,000 --> 09:17:21,360 What does that end up looking like? 11152 09:17:21,360 --> 09:17:25,960 Well, if I had 1 and 1, that should be 1, because 1 true and true 11153 09:17:25,960 --> 09:17:27,080 is equal to true. 11154 09:17:27,080 --> 09:17:29,000 Well, I take 1 times 1, that's 1. 11155 09:17:29,000 --> 09:17:30,200 1 times 1 is 1. 11156 09:17:30,200 --> 09:17:32,240 I get a total sum of 2 so far. 11157 09:17:32,240 --> 09:17:35,880 Now I add the bias of negative 2, and I get the value 0. 11158 09:17:35,880 --> 09:17:38,480 And 0, when I plot it on the activation function, 11159 09:17:38,480 --> 09:17:42,480 is just past that threshold, and so the output is going to be 1. 11160 09:17:42,480 --> 09:17:46,800 But if I had any other input, for example, like 1 and 0, 11161 09:17:46,800 --> 09:17:51,040 well, the weighted sum of these is 1 plus 0 is going to be 1. 11162 09:17:51,040 --> 09:17:53,960 Minus 2 is going to give us negative 1, and negative 1 11163 09:17:53,960 --> 09:17:58,760 is not past that threshold, and so the output is going to be 0. 11164 09:17:58,760 --> 09:18:01,400 So those then are some very simple functions 11165 09:18:01,400 --> 09:18:05,720 that we can model using a neural network that has two inputs and one output, 11166 09:18:05,720 --> 09:18:08,720 where our goal is to be able to figure out what those weights should be 11167 09:18:08,720 --> 09:18:11,080 in order to determine what the output should be. 11168 09:18:11,080 --> 09:18:14,480 And you could imagine generalizing this to calculate more complex functions 11169 09:18:14,480 --> 09:18:17,160 as well, that maybe, given the humidity and the pressure, 11170 09:18:17,160 --> 09:18:20,000 we want to calculate what's the probability that it's going to rain, 11171 09:18:20,000 --> 09:18:20,840 for example. 11172 09:18:20,840 --> 09:18:22,880 Or we might want to do a regression-style problem. 11173 09:18:22,880 --> 09:18:26,440 We're given some amount of advertising, and given what month it is maybe, 11174 09:18:26,440 --> 09:18:28,480 we want to predict what our expected sales are 11175 09:18:28,480 --> 09:18:30,400 going to be for that particular month. 11176 09:18:30,400 --> 09:18:34,000 So you could imagine these inputs and outputs being different as well. 11177 09:18:34,000 --> 09:18:36,360 And it turns out that in some problems, we're not just 11178 09:18:36,360 --> 09:18:39,840 going to have two inputs, and the nice thing about these neural networks 11179 09:18:39,840 --> 09:18:42,480 is that we can compose multiple units together, 11180 09:18:42,480 --> 09:18:46,280 make our networks more complex just by adding more units 11181 09:18:46,280 --> 09:18:48,720 into this particular neural network. 11182 09:18:48,720 --> 09:18:52,880 So the network we've been looking at has two inputs and one output. 11183 09:18:52,880 --> 09:18:56,280 But we could just as easily say, let's go ahead and have three inputs in there, 11184 09:18:56,280 --> 09:18:58,600 or have even more inputs, where we could arbitrarily 11185 09:18:58,600 --> 09:19:02,520 decide however many inputs there are to our problem, all going 11186 09:19:02,520 --> 09:19:06,480 to be calculating some sort of output that we care about figuring out 11187 09:19:06,480 --> 09:19:07,720 the value of. 11188 09:19:07,720 --> 09:19:10,480 How then does the math work for figuring out that output? 11189 09:19:10,480 --> 09:19:12,440 Well, it's going to work in a very similar way. 11190 09:19:12,440 --> 09:19:16,760 In the case of two inputs, we had two weights indicated by these edges, 11191 09:19:16,760 --> 09:19:20,280 and we multiplied the weights by the numbers, adding this bias term. 11192 09:19:20,280 --> 09:19:22,840 And we'll do the same thing in the other cases as well. 11193 09:19:22,840 --> 09:19:25,160 If I have three inputs, you'll imagine multiplying 11194 09:19:25,160 --> 09:19:27,920 each of these three inputs by each of these weights. 11195 09:19:27,920 --> 09:19:31,080 If I had five inputs instead, we're going to do the same thing. 11196 09:19:31,080 --> 09:19:35,920 Here I'm saying sum up from 1 to 5, xi multiplied by weight i. 11197 09:19:35,920 --> 09:19:38,880 So take each of the five input variables, multiply them 11198 09:19:38,880 --> 09:19:41,880 by their corresponding weight, and then add the bias to that. 11199 09:19:41,880 --> 09:19:45,280 So this would be a case where there are five inputs into this neural network, 11200 09:19:45,280 --> 09:19:46,000 for example. 11201 09:19:46,000 --> 09:19:48,200 But there could be more, arbitrarily many nodes 11202 09:19:48,200 --> 09:19:51,120 that we want inside of this neural network, where each time we're just 11203 09:19:51,120 --> 09:19:54,960 going to sum up all of those input variables multiplied by their weight 11204 09:19:54,960 --> 09:19:57,760 and then add the bias term at the very end. 11205 09:19:57,760 --> 09:20:00,240 And so this allows us to be able to represent problems 11206 09:20:00,240 --> 09:20:05,440 that have even more inputs just by growing the size of our neural network. 11207 09:20:05,440 --> 09:20:08,480 Now, the next question we might ask is a question about how it 11208 09:20:08,480 --> 09:20:10,840 is that we train these neural networks. 11209 09:20:10,840 --> 09:20:13,160 In the case of the or function and the and function, 11210 09:20:13,160 --> 09:20:16,080 they were simple enough functions that I could just tell you, 11211 09:20:16,080 --> 09:20:17,480 like here, what the weights should be. 11212 09:20:17,480 --> 09:20:19,360 And you could probably reason through it yourself 11213 09:20:19,360 --> 09:20:23,280 what the weights should be in order to calculate the output that you want. 11214 09:20:23,280 --> 09:20:26,000 But in general, with functions like predicting sales 11215 09:20:26,000 --> 09:20:27,960 or predicting whether or not it's going to rain, 11216 09:20:27,960 --> 09:20:30,640 these are much trickier functions to be able to figure out. 11217 09:20:30,640 --> 09:20:33,160 We would like the computer to have some mechanism 11218 09:20:33,160 --> 09:20:36,040 of calculating what it is that the weights should be, 11219 09:20:36,040 --> 09:20:39,280 how it is to set the weights so that our neural network is 11220 09:20:39,280 --> 09:20:41,800 able to accurately model the function that we 11221 09:20:41,800 --> 09:20:43,320 care about trying to estimate. 11222 09:20:43,320 --> 09:20:45,400 And it turns out that the strategy for doing this, 11223 09:20:45,400 --> 09:20:49,600 inspired by the domain of calculus, is a technique called gradient descent. 11224 09:20:49,600 --> 09:20:52,320 And what gradient descent is, it is an algorithm 11225 09:20:52,320 --> 09:20:55,920 for minimizing loss when you're training a neural network. 11226 09:20:55,920 --> 09:20:59,920 And recall that loss refers to how bad our hypothesis 11227 09:20:59,920 --> 09:21:03,520 function happens to be, that we can define certain loss functions. 11228 09:21:03,520 --> 09:21:06,720 And we saw some examples of loss functions last time that just give us 11229 09:21:06,720 --> 09:21:09,360 a number for any particular hypothesis, saying, 11230 09:21:09,360 --> 09:21:11,360 how poorly does it model the data? 11231 09:21:11,360 --> 09:21:13,200 How many examples does it get wrong? 11232 09:21:13,200 --> 09:21:17,640 How are they worse or less bad as compared to other hypothesis functions 11233 09:21:17,640 --> 09:21:19,120 that we might define? 11234 09:21:19,120 --> 09:21:22,640 And this loss function is just a mathematical function. 11235 09:21:22,640 --> 09:21:24,360 And when you have a mathematical function, 11236 09:21:24,360 --> 09:21:26,160 in calculus what you could do is calculate 11237 09:21:26,160 --> 09:21:29,280 something known as the gradient, which you can think of as like a slope. 11238 09:21:29,280 --> 09:21:32,960 It's the direction the loss function is moving at any particular point. 11239 09:21:32,960 --> 09:21:36,200 And what it's going to tell us is, in which direction 11240 09:21:36,200 --> 09:21:41,120 should we be moving these weights in order to minimize the amount of loss? 11241 09:21:41,120 --> 09:21:43,920 And so generally speaking, we won't get into the calculus of it. 11242 09:21:43,920 --> 09:21:46,240 But the high level idea for gradient descent 11243 09:21:46,240 --> 09:21:47,880 is going to look something like this. 11244 09:21:47,880 --> 09:21:51,080 If we want to train a neural network, we'll go ahead and start just 11245 09:21:51,080 --> 09:21:52,840 by choosing the weights randomly. 11246 09:21:52,840 --> 09:21:56,180 Just pick random weights for all of the weights in the neural network. 11247 09:21:56,180 --> 09:21:58,440 And then we'll use the input data that we have access 11248 09:21:58,440 --> 09:22:00,560 to in order to train the network, in order 11249 09:22:00,560 --> 09:22:02,880 to figure out what the weights should actually be. 11250 09:22:02,880 --> 09:22:05,360 So we'll repeat this process again and again. 11251 09:22:05,360 --> 09:22:08,200 The first step is we're going to calculate the gradient based 11252 09:22:08,200 --> 09:22:09,320 on all of the data points. 11253 09:22:09,320 --> 09:22:11,240 So we'll look at all the data and figure out 11254 09:22:11,240 --> 09:22:13,760 what the gradient is at the place where we currently 11255 09:22:13,760 --> 09:22:15,760 are for the current setting of the weights, which 11256 09:22:15,760 --> 09:22:19,440 means in which direction should we move the weights in order 11257 09:22:19,440 --> 09:22:24,480 to minimize the total amount of loss, in order to make our solution better. 11258 09:22:24,480 --> 09:22:26,820 And once we've calculated that gradient, which direction 11259 09:22:26,820 --> 09:22:29,120 we should move in the loss function, well, 11260 09:22:29,120 --> 09:22:32,240 then we can just update those weights according to the gradient. 11261 09:22:32,240 --> 09:22:35,200 Take a small step in the direction of those weights 11262 09:22:35,200 --> 09:22:37,800 in order to try to make our solution a little bit better. 11263 09:22:37,800 --> 09:22:40,200 And the size of the step that we take, that's going to vary. 11264 09:22:40,200 --> 09:22:43,120 And you can choose that when you're training a particular neural network. 11265 09:22:43,120 --> 09:22:46,240 But in short, the idea is going to be take all the data points, 11266 09:22:46,240 --> 09:22:48,840 figure out based on those data points in what direction 11267 09:22:48,840 --> 09:22:52,240 the weights should move, and then move the weights one small step 11268 09:22:52,240 --> 09:22:53,160 in that direction. 11269 09:22:53,160 --> 09:22:55,600 And if you repeat that process over and over again, 11270 09:22:55,600 --> 09:22:58,760 adjusting the weights a little bit at a time based on all the data points, 11271 09:22:58,760 --> 09:23:02,320 eventually you should end up with a pretty good solution 11272 09:23:02,320 --> 09:23:04,300 to trying to solve this sort of problem. 11273 09:23:04,300 --> 09:23:06,760 At least that's what we would hope to happen. 11274 09:23:06,760 --> 09:23:08,600 Now, if you look at this algorithm, a good question 11275 09:23:08,600 --> 09:23:10,920 to ask anytime you're analyzing an algorithm 11276 09:23:10,920 --> 09:23:14,640 is what is going to be the expensive part of doing the calculation? 11277 09:23:14,640 --> 09:23:17,160 What's going to take a lot of work to try to figure out? 11278 09:23:17,160 --> 09:23:19,680 What is going to be expensive to calculate? 11279 09:23:19,680 --> 09:23:22,060 And in particular, in the case of gradient descent, 11280 09:23:22,060 --> 09:23:26,240 the really expensive part is this all data points part right here, 11281 09:23:26,240 --> 09:23:30,200 having to take all of the data points and using all of those data points 11282 09:23:30,200 --> 09:23:34,000 figure out what the gradient is at this particular setting of all 11283 09:23:34,000 --> 09:23:34,500 of the weights. 11284 09:23:34,500 --> 09:23:37,040 Because odds are in a big machine learning problem 11285 09:23:37,040 --> 09:23:39,580 where you're trying to solve a big problem with a lot of data, 11286 09:23:39,580 --> 09:23:41,960 you have a lot of data points in order to calculate. 11287 09:23:41,960 --> 09:23:44,840 And figuring out the gradient based on all of those data points 11288 09:23:44,840 --> 09:23:46,160 is going to be expensive. 11289 09:23:46,160 --> 09:23:47,680 And you'll have to do it many times. 11290 09:23:47,680 --> 09:23:50,640 You'll likely repeat this process again and again and again, 11291 09:23:50,640 --> 09:23:54,320 going through all the data points, taking one small step over and over 11292 09:23:54,320 --> 09:23:57,780 as you try and figure out what the optimal setting of those weights 11293 09:23:57,780 --> 09:23:59,280 happens to be. 11294 09:23:59,280 --> 09:24:02,120 It turns out that we would ideally like to be 11295 09:24:02,120 --> 09:24:04,000 able to train our neural networks faster, 11296 09:24:04,000 --> 09:24:07,680 to be able to more quickly converge to some sort of solution that 11297 09:24:07,680 --> 09:24:10,000 is going to be a good solution to the problem. 11298 09:24:10,000 --> 09:24:13,000 So in that case, there are alternatives to just standard gradient descent, 11299 09:24:13,000 --> 09:24:15,280 which looks at all of the data points at once. 11300 09:24:15,280 --> 09:24:18,640 We can employ a method like stochastic gradient descent, 11301 09:24:18,640 --> 09:24:22,320 which will randomly just choose one data point at a time 11302 09:24:22,320 --> 09:24:25,240 to calculate the gradient based on, instead of calculating it 11303 09:24:25,240 --> 09:24:27,160 based on all of the data points. 11304 09:24:27,160 --> 09:24:30,180 So the idea there is that we have some setting of the weights. 11305 09:24:30,180 --> 09:24:31,560 We pick a data point. 11306 09:24:31,560 --> 09:24:34,720 And based on that one data point, we figure out in which direction 11307 09:24:34,720 --> 09:24:37,400 should we move all of the weights and move the weights in that small 11308 09:24:37,400 --> 09:24:39,800 direction, then take another data point and do that again 11309 09:24:39,800 --> 09:24:41,600 and repeat this process again and again, 11310 09:24:41,600 --> 09:24:44,240 maybe looking at each of the data points multiple times, 11311 09:24:44,240 --> 09:24:48,640 but each time only using one data point to calculate the gradient, 11312 09:24:48,640 --> 09:24:51,720 to calculate which direction we should move in. 11313 09:24:51,720 --> 09:24:55,040 Now, just using one data point instead of all of the data points 11314 09:24:55,040 --> 09:24:58,920 probably gives us a less accurate estimate of what the gradient actually 11315 09:24:58,920 --> 09:24:59,760 is. 11316 09:24:59,760 --> 09:25:01,880 But on the plus side, it's going to be much faster 11317 09:25:01,880 --> 09:25:04,520 to be able to calculate, that we can much more quickly calculate 11318 09:25:04,520 --> 09:25:07,200 what the gradient is based on one data point, 11319 09:25:07,200 --> 09:25:09,880 instead of calculating based on all of the data points 11320 09:25:09,880 --> 09:25:13,400 and having to do all of that computational work again and again. 11321 09:25:13,400 --> 09:25:16,120 So there are trade-offs here between looking at all of the data points 11322 09:25:16,120 --> 09:25:18,160 and just looking at one data point. 11323 09:25:18,160 --> 09:25:21,000 And it turns out that a middle ground that is also quite popular 11324 09:25:21,000 --> 09:25:24,600 is a technique called mini-batch gradient descent, where the idea there 11325 09:25:24,600 --> 09:25:28,080 is instead of looking at all of the data versus just a single point, 11326 09:25:28,080 --> 09:25:32,080 we instead divide our data set up into small batches, groups of data points, 11327 09:25:32,080 --> 09:25:34,840 where you can decide how big a particular batch is. 11328 09:25:34,840 --> 09:25:37,680 But in short, you're just going to look at a small number of points 11329 09:25:37,680 --> 09:25:41,280 at any given time, hopefully getting a more accurate estimate of the gradient, 11330 09:25:41,280 --> 09:25:44,960 but also not requiring all of the computational effort needed 11331 09:25:44,960 --> 09:25:48,800 to look at every single one of these data points. 11332 09:25:48,800 --> 09:25:50,960 So gradient descent, then, is this technique 11333 09:25:50,960 --> 09:25:53,800 that we can use in order to train these neural networks, 11334 09:25:53,800 --> 09:25:56,680 in order to figure out what the setting of all of these weights 11335 09:25:56,680 --> 09:25:59,520 should be if we want some way to try and get 11336 09:25:59,520 --> 09:26:02,760 an accurate notion of how it is that this function should work, 11337 09:26:02,760 --> 09:26:08,400 some way of modeling how to transform the inputs into particular outputs. 11338 09:26:08,400 --> 09:26:11,320 Now, so far, the networks that we've taken a look at 11339 09:26:11,320 --> 09:26:13,600 have all been structured similar to this. 11340 09:26:13,600 --> 09:26:17,080 We have some number of inputs, maybe two or three or five or more. 11341 09:26:17,080 --> 09:26:21,240 And then we have one output that is just predicting like rain or no rain 11342 09:26:21,240 --> 09:26:23,680 or just predicting one particular value. 11343 09:26:23,680 --> 09:26:25,600 But often in machine learning problems, we 11344 09:26:25,600 --> 09:26:27,840 don't just care about one output. 11345 09:26:27,840 --> 09:26:31,040 We might care about an output that has multiple different values 11346 09:26:31,040 --> 09:26:32,320 associated with it. 11347 09:26:32,320 --> 09:26:35,040 So in the same way that we could take a neural network 11348 09:26:35,040 --> 09:26:40,160 and add units to the input layer, we can likewise add inputs or add outputs 11349 09:26:40,160 --> 09:26:41,760 to the output layer as well. 11350 09:26:41,760 --> 09:26:44,760 Instead of just one output, you could imagine we have two outputs, 11351 09:26:44,760 --> 09:26:47,120 or we could have four outputs, for example, 11352 09:26:47,120 --> 09:26:50,880 where in each case, as we add more inputs or add more outputs, 11353 09:26:50,880 --> 09:26:54,360 if we want to keep this network fully connected between these two layers, 11354 09:26:54,360 --> 09:26:58,840 we just need to add more weights, that now each of these input nodes 11355 09:26:58,840 --> 09:27:02,820 has four weights associated with each of the four outputs. 11356 09:27:02,820 --> 09:27:06,320 And that's true for each of these various different input nodes. 11357 09:27:06,320 --> 09:27:09,120 So as we add nodes, we add more weights in order 11358 09:27:09,120 --> 09:27:11,480 to make sure that each of the inputs can somehow 11359 09:27:11,480 --> 09:27:14,800 be connected to each of the outputs so that each output 11360 09:27:14,800 --> 09:27:19,760 value can be calculated based on what the value of the input happens to be. 11361 09:27:19,760 --> 09:27:23,720 So what might a case be where we want multiple different output values? 11362 09:27:23,720 --> 09:27:26,600 Well, you might consider that in the case of weather predicting, 11363 09:27:26,600 --> 09:27:30,720 for example, we might not just care whether it's raining or not raining. 11364 09:27:30,720 --> 09:27:33,500 There might be multiple different categories of weather 11365 09:27:33,500 --> 09:27:35,600 that we would like to categorize the weather into. 11366 09:27:35,600 --> 09:27:39,360 With just a single output variable, we can do a binary classification, 11367 09:27:39,360 --> 09:27:42,920 like rain or no rain, for instance, 1 or 0. 11368 09:27:42,920 --> 09:27:45,600 But it doesn't allow us to do much more than that. 11369 09:27:45,600 --> 09:27:47,560 With multiple output variables, I might be 11370 09:27:47,560 --> 09:27:50,480 able to use each one to predict something a little different. 11371 09:27:50,480 --> 09:27:54,040 Maybe I want to categorize the weather into one of four different categories, 11372 09:27:54,040 --> 09:27:58,000 something like is it going to be raining or sunny or cloudy or snowy. 11373 09:27:58,000 --> 09:27:59,960 And I now have four output variables that 11374 09:27:59,960 --> 09:28:03,800 can be used to represent maybe the probability that it is 11375 09:28:03,800 --> 09:28:08,520 rainy as opposed to sunny as opposed to cloudy or as opposed to snowy. 11376 09:28:08,520 --> 09:28:10,560 How then would this neural network work? 11377 09:28:10,560 --> 09:28:13,320 Well, we have some input variables that represent some data 11378 09:28:13,320 --> 09:28:15,240 that we have collected about the weather. 11379 09:28:15,240 --> 09:28:18,760 Each of those inputs gets multiplied by each of these various different weights. 11380 09:28:18,760 --> 09:28:20,980 We have more multiplications to do, but these 11381 09:28:20,980 --> 09:28:24,040 are fairly quick mathematical operations to perform. 11382 09:28:24,040 --> 09:28:25,800 And then what we get is after passing them 11383 09:28:25,800 --> 09:28:28,400 through some sort of activation function in the outputs, 11384 09:28:28,400 --> 09:28:32,160 we end up getting some sort of number, where that number, you might imagine, 11385 09:28:32,160 --> 09:28:36,000 you could interpret as a probability, like a probability that it is one 11386 09:28:36,000 --> 09:28:38,360 category as opposed to another category. 11387 09:28:38,360 --> 09:28:40,640 So here we're saying that based on the inputs, 11388 09:28:40,640 --> 09:28:45,000 we think there is a 10% chance that it's raining, a 60% chance that it's sunny, 11389 09:28:45,000 --> 09:28:48,720 a 20% chance of cloudy, a 10% chance that it's snowy. 11390 09:28:48,720 --> 09:28:52,800 And given that output, if these represent a probability distribution, 11391 09:28:52,800 --> 09:28:55,920 well, then you could just pick whichever one has the highest value, 11392 09:28:55,920 --> 09:28:58,720 in this case, sunny, and say that, well, most likely, we 11393 09:28:58,720 --> 09:29:04,040 think that this categorization of inputs means that the output should be snowy 11394 09:29:04,040 --> 09:29:05,120 or should be sunny. 11395 09:29:05,120 --> 09:29:09,800 And that is what we would expect the weather to be in this particular instance. 11396 09:29:09,800 --> 09:29:13,760 And so this allows us to do these sort of multi-class classifications, 11397 09:29:13,760 --> 09:29:17,620 where instead of just having a binary classification, 1 or 0, 11398 09:29:17,620 --> 09:29:20,680 we can have as many different categories as we want. 11399 09:29:20,680 --> 09:29:23,640 And we can have our neural network output these probabilities 11400 09:29:23,640 --> 09:29:27,700 over which categories are more likely than other categories. 11401 09:29:27,700 --> 09:29:30,820 And using that data, we're able to draw some sort of inference 11402 09:29:30,820 --> 09:29:33,200 on what it is that we should do. 11403 09:29:33,200 --> 09:29:35,800 So this was sort of the idea of supervised machine learning. 11404 09:29:35,800 --> 09:29:38,860 I can give this neural network a whole bunch of data, 11405 09:29:38,860 --> 09:29:42,920 a whole bunch of input data corresponding to some label, some output data, 11406 09:29:42,920 --> 09:29:45,000 like we know that it was raining on this day, 11407 09:29:45,000 --> 09:29:46,960 we know that it was sunny on that day. 11408 09:29:46,960 --> 09:29:49,400 And using all of that data, the algorithm 11409 09:29:49,400 --> 09:29:52,400 can use gradient descent to figure out what all of the weights 11410 09:29:52,400 --> 09:29:55,840 should be in order to create some sort of model that hopefully allows us 11411 09:29:55,840 --> 09:29:59,280 a way to predict what we think the weather is going to be. 11412 09:29:59,280 --> 09:30:02,160 But neural networks have a lot of other applications as well. 11413 09:30:02,160 --> 09:30:06,520 You could imagine applying the same sort of idea to a reinforcement learning 11414 09:30:06,520 --> 09:30:09,280 sort of example as well, where you remember that in reinforcement 11415 09:30:09,280 --> 09:30:13,080 learning, what we wanted to do is train some sort of agent 11416 09:30:13,080 --> 09:30:16,080 to learn what action to take, depending on what state 11417 09:30:16,080 --> 09:30:17,400 they currently happen to be in. 11418 09:30:17,400 --> 09:30:19,640 So depending on the current state of the world, 11419 09:30:19,640 --> 09:30:23,040 we wanted the agent to pick from one of the available actions 11420 09:30:23,040 --> 09:30:24,800 that is available to them. 11421 09:30:24,800 --> 09:30:28,280 And you might model that by having each of these input variables 11422 09:30:28,280 --> 09:30:33,240 represent some information about the state, some data about what state 11423 09:30:33,240 --> 09:30:34,920 our agent is currently in. 11424 09:30:34,920 --> 09:30:37,320 And then the output, for example, could be each 11425 09:30:37,320 --> 09:30:40,160 of the various different actions that our agent could take, 11426 09:30:40,160 --> 09:30:42,560 action 1, 2, 3, and 4. 11427 09:30:42,560 --> 09:30:45,560 And you might imagine that this network would work in the same way, 11428 09:30:45,560 --> 09:30:48,840 but based on these particular inputs, we go ahead and calculate values 11429 09:30:48,840 --> 09:30:50,080 for each of these outputs. 11430 09:30:50,080 --> 09:30:53,960 And those outputs could model which action is better than other actions. 11431 09:30:53,960 --> 09:30:56,440 And we could just choose, based on looking at those outputs, 11432 09:30:56,440 --> 09:30:59,120 which action we should take. 11433 09:30:59,120 --> 09:31:01,840 And so these neural networks are very broadly applicable, 11434 09:31:01,840 --> 09:31:05,240 that all they're really doing is modeling some mathematical function. 11435 09:31:05,240 --> 09:31:07,600 So anything that we can frame as a mathematical function, 11436 09:31:07,600 --> 09:31:11,320 something like classifying inputs into various different categories 11437 09:31:11,320 --> 09:31:15,120 or figuring out based on some input state what action we should take, 11438 09:31:15,120 --> 09:31:18,680 these are all mathematical functions that we could attempt to model 11439 09:31:18,680 --> 09:31:21,360 by taking advantage of this neural network structure, 11440 09:31:21,360 --> 09:31:25,000 and in particular, taking advantage of this technique, gradient descent, 11441 09:31:25,000 --> 09:31:27,600 that we can use in order to figure out what the weights should 11442 09:31:27,600 --> 09:31:31,280 be in order to do this sort of calculation. 11443 09:31:31,280 --> 09:31:33,960 Now, how is it that you would go about training a neural network that 11444 09:31:33,960 --> 09:31:36,800 has multiple outputs instead of just one? 11445 09:31:36,800 --> 09:31:40,320 Well, with just a single output, we could see what the output for that value 11446 09:31:40,320 --> 09:31:44,360 should be, and then you update all of the weights that corresponded to it. 11447 09:31:44,360 --> 09:31:47,920 And when we have multiple outputs, at least in this particular case, 11448 09:31:47,920 --> 09:31:51,520 we can really think of this as four separate neural networks, 11449 09:31:51,520 --> 09:31:55,520 that really we just have one network here that has these three inputs 11450 09:31:55,520 --> 09:32:00,040 corresponding with these three weights corresponding to this one output value. 11451 09:32:00,040 --> 09:32:02,440 And the same thing is true for this output value. 11452 09:32:02,440 --> 09:32:06,040 This output value effectively defines yet another neural network 11453 09:32:06,040 --> 09:32:09,600 that has these same three inputs, but a different set of weights 11454 09:32:09,600 --> 09:32:11,160 that correspond to this output. 11455 09:32:11,160 --> 09:32:14,200 And likewise, this output has its own set of weights as well, 11456 09:32:14,200 --> 09:32:17,080 and same thing for the fourth output too. 11457 09:32:17,080 --> 09:32:20,760 And so if you wanted to train a neural network that had four outputs instead 11458 09:32:20,760 --> 09:32:23,720 of just one, in this case where the inputs are directly 11459 09:32:23,720 --> 09:32:25,640 connected to the outputs, you could really 11460 09:32:25,640 --> 09:32:28,840 think of this as just training four independent neural networks. 11461 09:32:28,840 --> 09:32:31,040 We know what the outputs for each of these four 11462 09:32:31,040 --> 09:32:34,280 should be based on our input data, and using that data, 11463 09:32:34,280 --> 09:32:37,680 we can begin to figure out what all of these individual weights should be. 11464 09:32:37,680 --> 09:32:39,560 And maybe there's an additional step at the end 11465 09:32:39,560 --> 09:32:43,680 to make sure that we turn these values into a probability distribution such 11466 09:32:43,680 --> 09:32:46,160 that we can interpret which one is better than another 11467 09:32:46,160 --> 09:32:50,360 or more likely than another as a category or something like that. 11468 09:32:50,360 --> 09:32:53,560 So this then seems like it does a pretty good job of taking inputs 11469 09:32:53,560 --> 09:32:55,480 and trying to predict what outputs should be. 11470 09:32:55,480 --> 09:32:58,800 And we'll see some real examples of this in just a moment as well. 11471 09:32:58,800 --> 09:33:01,360 But it's important then to think about what the limitations 11472 09:33:01,360 --> 09:33:05,520 of this sort of approach is, of just taking some linear combination 11473 09:33:05,520 --> 09:33:09,200 of inputs and passing it into some sort of activation function. 11474 09:33:09,200 --> 09:33:12,440 And it turns out that when we do this in the case of binary classification, 11475 09:33:12,440 --> 09:33:16,720 trying to predict does it belong to one category or another, 11476 09:33:16,720 --> 09:33:20,240 we can only predict things that are linearly separable. 11477 09:33:20,240 --> 09:33:22,800 Because we're taking a linear combination of inputs 11478 09:33:22,800 --> 09:33:26,520 and using that to define some decision boundary or threshold, 11479 09:33:26,520 --> 09:33:29,960 then what we get is a situation where if we have this set of data, 11480 09:33:29,960 --> 09:33:35,080 we can predict a line that separates linearly the red points from the blue 11481 09:33:35,080 --> 09:33:39,840 points, but a single unit that is making a binary classification, otherwise 11482 09:33:39,840 --> 09:33:44,880 known as a perceptron, can't deal with a situation like this, where we've 11483 09:33:44,880 --> 09:33:48,400 seen this type of situation before, where there is no straight line that 11484 09:33:48,400 --> 09:33:51,540 just goes straight through the data that will divide the red points away 11485 09:33:51,540 --> 09:33:52,680 from the blue points. 11486 09:33:52,680 --> 09:33:55,000 It's a more complex decision boundary. 11487 09:33:55,000 --> 09:33:58,280 The decision boundary somehow needs to capture the things inside of this 11488 09:33:58,280 --> 09:33:59,280 circle. 11489 09:33:59,280 --> 09:34:03,160 And there isn't really a line that will allow us to deal with that. 11490 09:34:03,160 --> 09:34:05,640 So this is the limitation of the perceptron, 11491 09:34:05,640 --> 09:34:08,800 these units that just make these binary decisions based on their inputs, 11492 09:34:08,800 --> 09:34:12,520 that a single perceptron is only capable of learning 11493 09:34:12,520 --> 09:34:15,280 a linearly separable decision boundary. 11494 09:34:15,280 --> 09:34:17,480 All it can do is define a line. 11495 09:34:17,480 --> 09:34:19,440 And sure, it can give us probabilities based 11496 09:34:19,440 --> 09:34:21,880 on how close to that decision boundary we are, 11497 09:34:21,880 --> 09:34:26,760 but it can only really decide based on a linear decision boundary. 11498 09:34:26,760 --> 09:34:29,600 And so this doesn't seem like it's going to generalize well 11499 09:34:29,600 --> 09:34:32,160 to situations where real world data is involved, 11500 09:34:32,160 --> 09:34:34,880 because real world data often isn't linearly separable. 11501 09:34:34,880 --> 09:34:38,240 It often isn't the case that we can just draw a line through the data 11502 09:34:38,240 --> 09:34:41,280 and be able to divide it up into multiple groups. 11503 09:34:41,280 --> 09:34:43,320 So what then is the solution to this? 11504 09:34:43,320 --> 09:34:47,640 Well, what was proposed was the idea of a multilayer neural network, 11505 09:34:47,640 --> 09:34:49,840 that so far all of the neural networks we've seen 11506 09:34:49,840 --> 09:34:52,480 have had a set of inputs and a set of outputs, 11507 09:34:52,480 --> 09:34:55,440 and the inputs are connected to those outputs. 11508 09:34:55,440 --> 09:34:57,840 But in a multilayer neural network, this is going 11509 09:34:57,840 --> 09:35:00,800 to be an artificial neural network that has an input layer still. 11510 09:35:00,800 --> 09:35:06,200 It has an output layer, but also has one or more hidden layers in between. 11511 09:35:06,200 --> 09:35:09,320 Other layers of artificial neurons or units 11512 09:35:09,320 --> 09:35:12,160 that are going to calculate their own values as well. 11513 09:35:12,160 --> 09:35:15,280 So instead of a neural network that looks like this with three inputs 11514 09:35:15,280 --> 09:35:17,800 and one output, you might imagine in the middle 11515 09:35:17,800 --> 09:35:21,520 here injecting a hidden layer, something like this. 11516 09:35:21,520 --> 09:35:23,520 This is a hidden layer that has four nodes. 11517 09:35:23,520 --> 09:35:26,760 You could choose how many nodes or units end up going into the hidden layer. 11518 09:35:26,760 --> 09:35:29,480 You can have multiple hidden layers as well. 11519 09:35:29,480 --> 09:35:33,680 And so now each of these inputs isn't directly connected to the output. 11520 09:35:33,680 --> 09:35:36,440 Each of the inputs is connected to this hidden layer. 11521 09:35:36,440 --> 09:35:38,520 And then all of the nodes in the hidden layer, those 11522 09:35:38,520 --> 09:35:41,200 are connected to the one output. 11523 09:35:41,200 --> 09:35:43,920 And so this is just another step that we can 11524 09:35:43,920 --> 09:35:46,480 take towards calculating more complex functions. 11525 09:35:46,480 --> 09:35:49,920 Each of these hidden units will calculate its output value, 11526 09:35:49,920 --> 09:35:53,920 otherwise known as its activation, based on a linear combination 11527 09:35:53,920 --> 09:35:55,320 of all the inputs. 11528 09:35:55,320 --> 09:35:57,600 And once we have values for all of these nodes, 11529 09:35:57,600 --> 09:36:00,720 as opposed to this just being the output, we do the same thing again. 11530 09:36:00,720 --> 09:36:04,240 Calculate the output for this node based on multiplying 11531 09:36:04,240 --> 09:36:07,960 each of the values for these units by their weights as well. 11532 09:36:07,960 --> 09:36:10,520 So in effect, the way this works is that we start with inputs. 11533 09:36:10,520 --> 09:36:14,120 They get multiplied by weights in order to calculate values for the hidden nodes. 11534 09:36:14,120 --> 09:36:16,880 Those get multiplied by weights in order to figure out 11535 09:36:16,880 --> 09:36:19,840 what the ultimate output is going to be. 11536 09:36:19,840 --> 09:36:22,360 And the advantage of layering things like this 11537 09:36:22,360 --> 09:36:25,640 is it gives us an ability to model more complex functions, 11538 09:36:25,640 --> 09:36:29,560 that instead of just having a single decision boundary, a single line 11539 09:36:29,560 --> 09:36:33,600 dividing the red points from the blue points, each of these hidden nodes 11540 09:36:33,600 --> 09:36:35,960 can learn a different decision boundary. 11541 09:36:35,960 --> 09:36:37,840 And we can combine those decision boundaries 11542 09:36:37,840 --> 09:36:41,000 to figure out what the ultimate output is going to be. 11543 09:36:41,000 --> 09:36:43,480 And as we begin to imagine more complex situations, 11544 09:36:43,480 --> 09:36:47,200 you could imagine each of these nodes learning some useful property 11545 09:36:47,200 --> 09:36:50,560 or learning some useful feature of all of the inputs 11546 09:36:50,560 --> 09:36:53,440 and us somehow learning how to combine those features together 11547 09:36:53,440 --> 09:36:56,320 in order to get the output that we actually want. 11548 09:36:56,320 --> 09:36:59,120 Now, the natural question when we begin to look at this now 11549 09:36:59,120 --> 09:37:02,160 is to ask the question of, how do we train a neural network that 11550 09:37:02,160 --> 09:37:04,440 has hidden layers inside of it? 11551 09:37:04,440 --> 09:37:07,120 And this turns out to initially be a bit of a tricky question, 11552 09:37:07,120 --> 09:37:10,440 because the input data that we are given is we 11553 09:37:10,440 --> 09:37:13,520 are given values for all of the inputs, and we're 11554 09:37:13,520 --> 09:37:16,960 given what the value of the output should be, what the category is, 11555 09:37:16,960 --> 09:37:18,120 for example. 11556 09:37:18,120 --> 09:37:22,160 But the input data doesn't tell us what the values for all of these nodes 11557 09:37:22,160 --> 09:37:22,880 should be. 11558 09:37:22,880 --> 09:37:26,520 So we don't know how far off each of these nodes actually 11559 09:37:26,520 --> 09:37:29,760 is because we're only given data for the inputs and the outputs. 11560 09:37:29,760 --> 09:37:31,640 The reason this is called the hidden layer 11561 09:37:31,640 --> 09:37:34,040 is because the data that is made available to us 11562 09:37:34,040 --> 09:37:38,200 doesn't tell us what the values for all of these intermediate nodes 11563 09:37:38,200 --> 09:37:39,760 should actually be. 11564 09:37:39,760 --> 09:37:42,160 And so the strategy people came up with was 11565 09:37:42,160 --> 09:37:48,120 to say that if you know what the error or the losses on the output node, 11566 09:37:48,120 --> 09:37:50,280 well, then based on what these weights are, 11567 09:37:50,280 --> 09:37:52,280 if one of these weights is higher than another, 11568 09:37:52,280 --> 09:37:55,120 you can calculate an estimate for how much 11569 09:37:55,120 --> 09:38:00,840 the error from this node was due to this part of the hidden node, 11570 09:38:00,840 --> 09:38:03,280 or this part of the hidden layer, or this part of the hidden layer, 11571 09:38:03,280 --> 09:38:05,680 based on the values of these weights, in effect saying 11572 09:38:05,680 --> 09:38:10,120 that based on the error from the output, I can back propagate the error 11573 09:38:10,120 --> 09:38:14,240 and figure out an estimate for what the error is for each of these nodes 11574 09:38:14,240 --> 09:38:15,400 in the hidden layer as well. 11575 09:38:15,400 --> 09:38:18,480 And there's some more calculus here that we won't get into the details of, 11576 09:38:18,480 --> 09:38:21,840 but the idea of this algorithm is known as back propagation. 11577 09:38:21,840 --> 09:38:24,040 It's an algorithm for training a neural network 11578 09:38:24,040 --> 09:38:26,200 with multiple different hidden layers. 11579 09:38:26,200 --> 09:38:28,240 And the idea for this, the pseudocode for it, 11580 09:38:28,240 --> 09:38:31,960 will again be if we want to run gradient descent with back propagation. 11581 09:38:31,960 --> 09:38:35,200 We'll start with a random choice of weights, as we did before. 11582 09:38:35,200 --> 09:38:38,680 And now we'll go ahead and repeat the training process again and again. 11583 09:38:38,680 --> 09:38:41,080 But what we're going to do each time is now 11584 09:38:41,080 --> 09:38:43,960 we're going to calculate the error for the output layer first. 11585 09:38:43,960 --> 09:38:45,920 We know the output and what it should be, 11586 09:38:45,920 --> 09:38:49,680 and we know what we calculated so we can figure out what the error there is. 11587 09:38:49,680 --> 09:38:52,340 But then we're going to repeat for every layer, 11588 09:38:52,340 --> 09:38:55,360 starting with the output layer, moving back into the hidden layer, 11589 09:38:55,360 --> 09:38:58,160 then the hidden layer before that if there are multiple hidden layers, 11590 09:38:58,160 --> 09:39:00,480 going back all the way to the very first hidden layer, 11591 09:39:00,480 --> 09:39:05,000 assuming there are multiple, we're going to propagate the error back one layer. 11592 09:39:05,000 --> 09:39:07,280 Whatever the error was from the output, figure out 11593 09:39:07,280 --> 09:39:09,200 what the error should be a layer before that 11594 09:39:09,200 --> 09:39:11,800 based on what the values of those weights are. 11595 09:39:11,800 --> 09:39:14,900 And then we can update those weights. 11596 09:39:14,900 --> 09:39:17,000 So graphically, the way you might think about this 11597 09:39:17,000 --> 09:39:18,720 is that we first start with the output. 11598 09:39:18,720 --> 09:39:20,360 We know what the output should be. 11599 09:39:20,360 --> 09:39:22,240 We know what output we calculated. 11600 09:39:22,240 --> 09:39:23,720 And based on that, we can figure out, all right, 11601 09:39:23,720 --> 09:39:25,280 how do we need to update those weights? 11602 09:39:25,280 --> 09:39:28,600 Backpropagating the error to these nodes. 11603 09:39:28,600 --> 09:39:31,560 And using that, we can figure out how we should update these weights. 11604 09:39:31,560 --> 09:39:33,480 And you might imagine if there are multiple layers, 11605 09:39:33,480 --> 09:39:35,760 we could repeat this process again and again 11606 09:39:35,760 --> 09:39:39,600 to begin to figure out how all of these weights should be updated. 11607 09:39:39,600 --> 09:39:41,520 And this backpropagation algorithm is really 11608 09:39:41,520 --> 09:39:44,360 the key algorithm that makes neural networks possible. 11609 09:39:44,360 --> 09:39:47,840 It makes it possible to take these multi-level structures 11610 09:39:47,840 --> 09:39:50,240 and be able to train those structures depending 11611 09:39:50,240 --> 09:39:52,800 on what the values of these weights are in order 11612 09:39:52,800 --> 09:39:56,360 to figure out how it is that we should go about updating those weights in 11613 09:39:56,360 --> 09:39:59,520 order to create some function that is able to minimize 11614 09:39:59,520 --> 09:40:02,800 the total amount of loss, to figure out some good setting of the weights 11615 09:40:02,800 --> 09:40:07,600 that will take the inputs and translate it into the output that we expect. 11616 09:40:07,600 --> 09:40:10,640 And this works, as we said, not just for a single hidden layer. 11617 09:40:10,640 --> 09:40:13,800 But you can imagine multiple hidden layers, where each hidden layer we just 11618 09:40:13,800 --> 09:40:17,440 define however many nodes we want, where each of the nodes in one layer, 11619 09:40:17,440 --> 09:40:19,680 we can connect to the nodes in the next layer, 11620 09:40:19,680 --> 09:40:22,160 defining more and more complex networks that 11621 09:40:22,160 --> 09:40:26,320 are able to model more and more complex types of functions. 11622 09:40:26,320 --> 09:40:30,160 And so this type of network is what we might call a deep neural network, 11623 09:40:30,160 --> 09:40:33,480 part of a larger family of deep learning algorithms, 11624 09:40:33,480 --> 09:40:34,760 if you've ever heard that term. 11625 09:40:34,760 --> 09:40:38,520 And all deep learning is about is it's using multiple layers 11626 09:40:38,520 --> 09:40:41,560 to be able to predict and be able to model higher level 11627 09:40:41,560 --> 09:40:44,120 features inside of the input, to be able to figure out 11628 09:40:44,120 --> 09:40:45,280 what the output should be. 11629 09:40:45,280 --> 09:40:47,520 And so a deep neural network is just a neural network 11630 09:40:47,520 --> 09:40:49,240 that has multiple of these hidden layers, 11631 09:40:49,240 --> 09:40:52,200 where we start at the input, calculate values for this layer, 11632 09:40:52,200 --> 09:40:55,640 then this layer, then this layer, and then ultimately get an output. 11633 09:40:55,640 --> 09:40:59,280 And this allows us to be able to model more and more sophisticated types 11634 09:40:59,280 --> 09:41:02,600 of functions, that each of these layers can calculate something 11635 09:41:02,600 --> 09:41:05,800 a little bit different, and we can combine that information 11636 09:41:05,800 --> 09:41:08,560 to figure out what the output should be. 11637 09:41:08,560 --> 09:41:11,120 Of course, as with any situation of machine learning, 11638 09:41:11,120 --> 09:41:13,640 as we begin to make our models more and more complex, 11639 09:41:13,640 --> 09:41:17,200 to model more and more complex functions, the risk we run 11640 09:41:17,200 --> 09:41:18,960 is something like overfitting. 11641 09:41:18,960 --> 09:41:22,840 And we talked about overfitting last time in the context of overfitting 11642 09:41:22,840 --> 09:41:25,480 based on when we were training our models to be 11643 09:41:25,480 --> 09:41:27,720 able to learn some sort of decision boundary, 11644 09:41:27,720 --> 09:41:31,800 where overfitting happens when we fit too closely to the training data. 11645 09:41:31,800 --> 09:41:36,160 And as a result, we don't generalize well to other situations as well. 11646 09:41:36,160 --> 09:41:40,280 And one of the risks we run with a far more complex neural network that 11647 09:41:40,280 --> 09:41:44,160 has many, many different nodes is that we might overfit based on the input 11648 09:41:44,160 --> 09:41:44,660 data. 11649 09:41:44,660 --> 09:41:46,800 We might grow over reliant on certain nodes 11650 09:41:46,800 --> 09:41:49,880 to calculate things just purely based on the input data that 11651 09:41:49,880 --> 09:41:53,520 doesn't allow us to generalize very well to the output. 11652 09:41:53,520 --> 09:41:56,440 And there are a number of strategies for dealing with overfitting. 11653 09:41:56,440 --> 09:41:59,280 But one of the most popular in the context of neural networks 11654 09:41:59,280 --> 09:42:01,160 is a technique known as dropout. 11655 09:42:01,160 --> 09:42:04,440 And what dropout does is it, when we're training the neural network, 11656 09:42:04,440 --> 09:42:08,000 what we'll do in dropout is temporarily remove units, 11657 09:42:08,000 --> 09:42:11,560 temporarily remove these artificial neurons from our network chosen at 11658 09:42:11,560 --> 09:42:12,640 random. 11659 09:42:12,640 --> 09:42:16,360 And the goal here is to prevent over-reliance on certain units. 11660 09:42:16,360 --> 09:42:18,480 What generally happens in overfitting is that we 11661 09:42:18,480 --> 09:42:21,920 begin to over-rely on certain units inside the neural network 11662 09:42:21,920 --> 09:42:24,880 to be able to tell us how to interpret the input data. 11663 09:42:24,880 --> 09:42:28,160 What dropout will do is randomly remove some of these units 11664 09:42:28,160 --> 09:42:31,520 in order to reduce the chance that we over-rely on certain units 11665 09:42:31,520 --> 09:42:35,480 to make our neural network more robust, to be able to handle the situations 11666 09:42:35,480 --> 09:42:39,360 even when we just drop out particular neurons entirely. 11667 09:42:39,360 --> 09:42:42,120 So the way that might work is we have a network like this. 11668 09:42:42,120 --> 09:42:44,240 And as we're training it, when we go about trying 11669 09:42:44,240 --> 09:42:47,120 to update the weights the first time, we'll just randomly pick 11670 09:42:47,120 --> 09:42:49,600 some percentage of the nodes to drop out of the network. 11671 09:42:49,600 --> 09:42:51,440 It's as if those nodes aren't there at all. 11672 09:42:51,440 --> 09:42:54,640 It's as if the weights associated with those nodes aren't there at all. 11673 09:42:54,640 --> 09:42:56,080 And we'll train it this way. 11674 09:42:56,080 --> 09:42:58,440 Then the next time we update the weights, we'll pick a different set 11675 09:42:58,440 --> 09:42:59,960 and just go ahead and train that way. 11676 09:42:59,960 --> 09:43:02,720 And then again, randomly choose and train with other nodes 11677 09:43:02,720 --> 09:43:04,600 that have been dropped out as well. 11678 09:43:04,600 --> 09:43:07,240 And the goal of that is that after the training process, 11679 09:43:07,240 --> 09:43:10,640 if you train by dropping out random nodes inside of this neural network, 11680 09:43:10,640 --> 09:43:13,480 you hopefully end up with a network that's a little bit more robust, 11681 09:43:13,480 --> 09:43:16,880 that doesn't rely too heavily on any one particular node, 11682 09:43:16,880 --> 09:43:21,800 but more generally learns how to approximate a function in general. 11683 09:43:21,800 --> 09:43:24,040 So that then is a look at some of these techniques 11684 09:43:24,040 --> 09:43:27,400 that we can use in order to implement a neural network, 11685 09:43:27,400 --> 09:43:30,320 to get at the idea of taking this input, passing it 11686 09:43:30,320 --> 09:43:34,120 through these various different layers in order to produce some sort of output. 11687 09:43:34,120 --> 09:43:37,460 And what we'd like to do now is take those ideas and put them into code. 11688 09:43:37,460 --> 09:43:40,300 And to do that, there are a number of different machine learning libraries, 11689 09:43:40,300 --> 09:43:44,160 neural network libraries that we can use that allow us to get access 11690 09:43:44,160 --> 09:43:47,840 to someone's implementation of back propagation and all of these hidden 11691 09:43:47,840 --> 09:43:48,440 layers. 11692 09:43:48,440 --> 09:43:52,060 And one of the most popular, developed by Google, is known as TensorFlow, 11693 09:43:52,060 --> 09:43:55,640 a library that we can use for quickly creating neural networks and modeling 11694 09:43:55,640 --> 09:43:59,520 them and running them on some sample data to see what the output is going 11695 09:43:59,520 --> 09:44:00,040 to be. 11696 09:44:00,040 --> 09:44:01,840 And before we actually start writing code, 11697 09:44:01,840 --> 09:44:04,640 we'll go ahead and take a look at TensorFlow's playground, which 11698 09:44:04,640 --> 09:44:08,000 will be an opportunity for us just to play around with this idea of neural 11699 09:44:08,000 --> 09:44:10,880 networks in different layers, just to get a sense for what 11700 09:44:10,880 --> 09:44:15,200 it is that we can do by taking advantage of neural networks. 11701 09:44:15,200 --> 09:44:18,440 So let's go ahead and go into TensorFlow's playground, which 11702 09:44:18,440 --> 09:44:20,920 you can go to by visiting that URL from before. 11703 09:44:20,920 --> 09:44:24,720 And what we're going to do now is we're going to try and learn the decision 11704 09:44:24,720 --> 09:44:27,500 boundary for this particular output. 11705 09:44:27,500 --> 09:44:30,960 I want to learn to separate the orange points from the blue points. 11706 09:44:30,960 --> 09:44:34,000 And I'd like to learn some sort of setting of weights inside of a neural 11707 09:44:34,000 --> 09:44:37,840 network that will be able to separate those from each other. 11708 09:44:37,840 --> 09:44:40,200 The features we have access to, our input data, 11709 09:44:40,200 --> 09:44:44,960 are the x value and the y value, so the two values along each of the two axes. 11710 09:44:44,960 --> 09:44:47,320 And what I'll do now is I can set particular parameters, 11711 09:44:47,320 --> 09:44:50,080 like what activation function I would like to use. 11712 09:44:50,080 --> 09:44:53,960 And I'll just go ahead and press play and see what happens. 11713 09:44:53,960 --> 09:44:56,360 And what happens here is that you'll see that just 11714 09:44:56,360 --> 09:45:00,640 by using these two input features, the x value and the y value, 11715 09:45:00,640 --> 09:45:04,120 with no hidden layers, just take the input, x and y values, 11716 09:45:04,120 --> 09:45:06,240 and figure out what the decision boundary is. 11717 09:45:06,240 --> 09:45:08,840 Our neural network learns pretty quickly that in order 11718 09:45:08,840 --> 09:45:11,400 to divide these two points, we should just use this line. 11719 09:45:11,400 --> 09:45:13,820 This line acts as a decision boundary that 11720 09:45:13,820 --> 09:45:16,720 separates this group of points from that group of points, 11721 09:45:16,720 --> 09:45:17,720 and it does it very well. 11722 09:45:17,720 --> 09:45:19,420 You can see up here what the loss is. 11723 09:45:19,420 --> 09:45:24,000 The training loss is 0, meaning we were able to perfectly model separating 11724 09:45:24,000 --> 09:45:27,720 these two points from each other inside of our training data. 11725 09:45:27,720 --> 09:45:30,420 So this was a fairly simple case of trying 11726 09:45:30,420 --> 09:45:33,840 to apply a neural network because the data is very clean. 11727 09:45:33,840 --> 09:45:35,880 It's very nicely linearly separable. 11728 09:45:35,880 --> 09:45:39,960 We could just draw a line that separates all of those points from each other. 11729 09:45:39,960 --> 09:45:42,160 Let's now consider a more complex case. 11730 09:45:42,160 --> 09:45:44,640 So I'll go ahead and pause the simulation, 11731 09:45:44,640 --> 09:45:47,840 and we'll go ahead and look at this data set here. 11732 09:45:47,840 --> 09:45:50,320 This data set is a little bit more complex now. 11733 09:45:50,320 --> 09:45:52,520 In this data set, we still have blue and orange points 11734 09:45:52,520 --> 09:45:54,440 that we'd like to separate from each other. 11735 09:45:54,440 --> 09:45:56,680 But there's no single line that we can draw 11736 09:45:56,680 --> 09:45:59,640 that is going to be able to figure out how to separate the blue from the orange, 11737 09:45:59,640 --> 09:46:02,720 because the blue is located in these two quadrants, 11738 09:46:02,720 --> 09:46:04,900 and the orange is located here and here. 11739 09:46:04,900 --> 09:46:07,920 It's a more complex function to be able to learn. 11740 09:46:07,920 --> 09:46:09,000 So let's see what happens. 11741 09:46:09,000 --> 09:46:13,440 If we just try and predict based on those inputs, the x and y coordinates, 11742 09:46:13,440 --> 09:46:16,800 what the output should be, I'll press Play. 11743 09:46:16,800 --> 09:46:18,640 And what you'll notice is that we're not really 11744 09:46:18,640 --> 09:46:21,800 able to draw much of a conclusion, that we're not 11745 09:46:21,800 --> 09:46:25,780 able to very cleanly see how we should divide the orange points from the blue 11746 09:46:25,780 --> 09:46:30,040 points, and you don't see a very clean separation there. 11747 09:46:30,040 --> 09:46:34,320 So it seems like we don't have enough sophistication inside of our network 11748 09:46:34,320 --> 09:46:37,080 to be able to model something that is that complex. 11749 09:46:37,080 --> 09:46:39,800 We need a better model for this neural network. 11750 09:46:39,800 --> 09:46:42,960 And I'll do that by adding a hidden layer. 11751 09:46:42,960 --> 09:46:45,960 So now I have a hidden layer that has two neurons inside of it. 11752 09:46:45,960 --> 09:46:49,080 So I have two inputs that then go to two neurons 11753 09:46:49,080 --> 09:46:52,240 inside of a hidden layer that then go to our output. 11754 09:46:52,240 --> 09:46:54,040 And now I'll press Play. 11755 09:46:54,040 --> 09:46:57,800 And what you'll notice here is that we're able to do slightly better. 11756 09:46:57,800 --> 09:47:00,680 We're able to now say, all right, these points are definitely blue. 11757 09:47:00,680 --> 09:47:02,620 These points are definitely orange. 11758 09:47:02,620 --> 09:47:05,760 We're still struggling a little bit with these points up here, though. 11759 09:47:05,760 --> 09:47:08,920 And what we can do is we can see for each of these hidden neurons, 11760 09:47:08,920 --> 09:47:11,720 what is it exactly that these hidden neurons are doing? 11761 09:47:11,720 --> 09:47:15,120 Each hidden neuron is learning its own decision boundary. 11762 09:47:15,120 --> 09:47:16,840 And we can see what that boundary is. 11763 09:47:16,840 --> 09:47:19,600 This first neuron is learning, all right, 11764 09:47:19,600 --> 09:47:22,680 this line that seems to separate some of the blue points 11765 09:47:22,680 --> 09:47:24,760 from the rest of the points. 11766 09:47:24,760 --> 09:47:27,360 This other hidden neuron is learning another line 11767 09:47:27,360 --> 09:47:29,960 that seems to be separating the orange points in the lower right 11768 09:47:29,960 --> 09:47:31,680 from the rest of the points. 11769 09:47:31,680 --> 09:47:36,480 So that's why we're able to figure out these two areas in the bottom region. 11770 09:47:36,480 --> 09:47:40,360 But we're still not able to perfectly classify all of the points. 11771 09:47:40,360 --> 09:47:42,920 So let's go ahead and add another neuron. 11772 09:47:42,920 --> 09:47:46,160 Now we've got three neurons inside of our hidden layer 11773 09:47:46,160 --> 09:47:48,520 and see what we're able to learn now. 11774 09:47:48,520 --> 09:47:50,680 All right, well, now we seem to be doing a better job. 11775 09:47:50,680 --> 09:47:53,240 By learning three different decision boundaries, which 11776 09:47:53,240 --> 09:47:55,800 each of the three neurons inside of our hidden layer, 11777 09:47:55,800 --> 09:47:59,820 we're able to much better figure out how to separate these blue points 11778 09:47:59,820 --> 09:48:00,820 from the orange points. 11779 09:48:00,820 --> 09:48:03,600 And we can see what each of these hidden neurons is learning. 11780 09:48:03,600 --> 09:48:06,480 Each one is learning a slightly different decision boundary. 11781 09:48:06,480 --> 09:48:09,120 And then we're combining those decision boundaries together 11782 09:48:09,120 --> 09:48:11,960 to figure out what the overall output should be. 11783 09:48:11,960 --> 09:48:15,640 And then we can try it one more time by adding a fourth neuron there 11784 09:48:15,640 --> 09:48:17,120 and try learning that. 11785 09:48:17,120 --> 09:48:19,400 And it seems like now we can do even better at trying 11786 09:48:19,400 --> 09:48:21,600 to separate the blue points from the orange points. 11787 09:48:21,600 --> 09:48:24,520 But we were only able to do this by adding a hidden layer, 11788 09:48:24,520 --> 09:48:27,400 by adding some layer that is learning some other boundaries 11789 09:48:27,400 --> 09:48:30,340 and combining those boundaries to determine the output. 11790 09:48:30,340 --> 09:48:33,680 And the strength, the size and thickness of these lines 11791 09:48:33,680 --> 09:48:37,040 indicate how high these weights are, how important each of these inputs 11792 09:48:37,040 --> 09:48:40,320 is for making this sort of calculation. 11793 09:48:40,320 --> 09:48:42,880 And we can do maybe one more simulation. 11794 09:48:42,880 --> 09:48:46,200 Let's go ahead and try this on a data set that looks like this. 11795 09:48:46,200 --> 09:48:47,960 Go ahead and get rid of the hidden layer. 11796 09:48:47,960 --> 09:48:51,040 Here now we're trying to separate the blue points from the orange points 11797 09:48:51,040 --> 09:48:53,080 where all the blue points are located, again, 11798 09:48:53,080 --> 09:48:54,960 inside of a circle effectively. 11799 09:48:54,960 --> 09:48:57,280 So we're not going to be able to learn a line. 11800 09:48:57,280 --> 09:48:58,480 Notice I press Play. 11801 09:48:58,480 --> 09:49:01,400 And we're really not able to draw any sort of classification at all 11802 09:49:01,400 --> 09:49:04,800 because there is no line that cleanly separates the blue points 11803 09:49:04,800 --> 09:49:06,920 from the orange points. 11804 09:49:06,920 --> 09:49:10,600 So let's try to solve this by introducing a hidden layer. 11805 09:49:10,600 --> 09:49:12,760 I'll go ahead and press Play. 11806 09:49:12,760 --> 09:49:14,920 And all right, with two neurons in a hidden layer, 11807 09:49:14,920 --> 09:49:17,080 we're able to do a little better because we effectively 11808 09:49:17,080 --> 09:49:18,880 learned two different decision boundaries. 11809 09:49:18,880 --> 09:49:20,640 We learned this line here. 11810 09:49:20,640 --> 09:49:23,020 And we learned this line on the right-hand side. 11811 09:49:23,020 --> 09:49:25,160 And right now we're just saying, all right, well, if it's in between, 11812 09:49:25,160 --> 09:49:25,960 we'll call it blue. 11813 09:49:25,960 --> 09:49:27,760 And if it's outside, we'll call it orange. 11814 09:49:27,760 --> 09:49:30,240 So not great, but certainly better than before, 11815 09:49:30,240 --> 09:49:33,000 that we're learning one decision boundary and another. 11816 09:49:33,000 --> 09:49:36,920 And based on those, we can figure out what the output should be. 11817 09:49:36,920 --> 09:49:42,200 But let's now go ahead and add a third neuron and see what happens now. 11818 09:49:42,200 --> 09:49:43,400 I go ahead and train it. 11819 09:49:43,400 --> 09:49:46,240 And now, using three different decision boundaries 11820 09:49:46,240 --> 09:49:48,160 that are learned by each of these hidden neurons, 11821 09:49:48,160 --> 09:49:51,040 we're able to much more accurately model this distinction 11822 09:49:51,040 --> 09:49:53,080 between blue points and orange points. 11823 09:49:53,080 --> 09:49:56,000 We're able to figure out maybe with these three decision boundaries, 11824 09:49:56,000 --> 09:49:58,800 combining them together, you can imagine figuring out 11825 09:49:58,800 --> 09:50:02,360 what the output should be and how to make that sort of classification. 11826 09:50:02,360 --> 09:50:05,720 And so the goal here is just to get a sense for having more neurons 11827 09:50:05,720 --> 09:50:09,720 in these hidden layers allows us to learn more structure in the data, 11828 09:50:09,720 --> 09:50:12,640 allows us to figure out what the relevant and important decision 11829 09:50:12,640 --> 09:50:13,600 boundaries are. 11830 09:50:13,600 --> 09:50:15,840 And then using this backpropagation algorithm, 11831 09:50:15,840 --> 09:50:18,840 we're able to figure out what the values of these weights should be 11832 09:50:18,840 --> 09:50:23,640 in order to train this network to be able to classify one category of points 11833 09:50:23,640 --> 09:50:26,360 away from another category of points instead. 11834 09:50:26,360 --> 09:50:28,280 And this is ultimately what we're going to be trying 11835 09:50:28,280 --> 09:50:32,120 to do whenever we're training a neural network. 11836 09:50:32,120 --> 09:50:34,600 So let's go ahead and actually see an example of this. 11837 09:50:34,600 --> 09:50:38,160 You'll recall from last time that we had this banknotes file 11838 09:50:38,160 --> 09:50:41,360 that included information about counterfeit banknotes as opposed 11839 09:50:41,360 --> 09:50:45,960 to authentic banknotes, where I had four different values for each banknote 11840 09:50:45,960 --> 09:50:48,920 and then a categorization of whether that banknote is considered 11841 09:50:48,920 --> 09:50:51,560 to be authentic or a counterfeit note. 11842 09:50:51,560 --> 09:50:55,160 And what I wanted to do was, based on that input information, 11843 09:50:55,160 --> 09:50:57,120 figure out some function that could calculate 11844 09:50:57,120 --> 09:51:00,480 based on the input information what category it belonged to. 11845 09:51:00,480 --> 09:51:02,840 And what I've written here in banknotes.py 11846 09:51:02,840 --> 09:51:05,760 is a neural network that will learn just that, a network that 11847 09:51:05,760 --> 09:51:08,560 learns based on all of the input whether or not 11848 09:51:08,560 --> 09:51:13,040 we should categorize a banknote as authentic or as counterfeit. 11849 09:51:13,040 --> 09:51:15,520 The first step is the same as what we saw from last time. 11850 09:51:15,520 --> 09:51:17,840 I'm really just reading the data in and getting it 11851 09:51:17,840 --> 09:51:19,320 into an appropriate format. 11852 09:51:19,320 --> 09:51:22,960 And so this is where more of the writing Python code on your own 11853 09:51:22,960 --> 09:51:25,080 comes in, in terms of manipulating this data, 11854 09:51:25,080 --> 09:51:28,320 massaging the data into a format that will be understood 11855 09:51:28,320 --> 09:51:32,200 by a machine learning library like scikit-learn or like TensorFlow. 11856 09:51:32,200 --> 09:51:35,960 And so here I separate it into a training and a testing set. 11857 09:51:35,960 --> 09:51:40,280 And now what I'm doing down below is I'm creating a neural network. 11858 09:51:40,280 --> 09:51:42,760 Here I'm using TF, which stands for TensorFlow. 11859 09:51:42,760 --> 09:51:47,520 Up above, I said import TensorFlow as TF, TF just an abbreviation that we'll 11860 09:51:47,520 --> 09:51:49,560 often use so we don't need to write out TensorFlow 11861 09:51:49,560 --> 09:51:52,840 every time we want to use anything inside of the library. 11862 09:51:52,840 --> 09:51:55,160 I'm using TF.keras. 11863 09:51:55,160 --> 09:51:57,600 Keras is an API, a set of functions that we 11864 09:51:57,600 --> 09:52:02,000 can use in order to manipulate neural networks inside of TensorFlow. 11865 09:52:02,000 --> 09:52:04,440 And it turns out there are other machine learning libraries 11866 09:52:04,440 --> 09:52:06,720 that also use the Keras API. 11867 09:52:06,720 --> 09:52:08,920 But here I'm saying, all right, go ahead and give me 11868 09:52:08,920 --> 09:52:12,480 a model that is a sequential model, a sequential neural network, 11869 09:52:12,480 --> 09:52:14,920 meaning one layer after another. 11870 09:52:14,920 --> 09:52:17,960 And now I'm going to add to that model what layers 11871 09:52:17,960 --> 09:52:20,200 I want inside of my neural network. 11872 09:52:20,200 --> 09:52:22,080 So here I'm saying model.add. 11873 09:52:22,080 --> 09:52:24,400 Go ahead and add a dense layer. 11874 09:52:24,400 --> 09:52:28,040 And when we say a dense layer, we mean a layer that is just each 11875 09:52:28,040 --> 09:52:30,400 of the nodes inside of the layer is going to be connected 11876 09:52:30,400 --> 09:52:32,240 to each of the nodes from the previous layer. 11877 09:52:32,240 --> 09:52:35,600 So we have a densely connected layer. 11878 09:52:35,600 --> 09:52:38,280 This layer is going to have eight units inside of it. 11879 09:52:38,280 --> 09:52:40,840 So it's going to be a hidden layer inside of a neural network 11880 09:52:40,840 --> 09:52:43,720 with eight different units, eight artificial neurons, each of which 11881 09:52:43,720 --> 09:52:45,040 might learn something different. 11882 09:52:45,040 --> 09:52:47,000 And I just sort of chose eight arbitrarily. 11883 09:52:47,000 --> 09:52:50,760 You could choose a different number of hidden nodes inside of the layer. 11884 09:52:50,760 --> 09:52:53,520 And as we saw before, depending on the number of units 11885 09:52:53,520 --> 09:52:56,480 there are inside of your hidden layer, more units 11886 09:52:56,480 --> 09:52:58,400 means you can learn more complex functions. 11887 09:52:58,400 --> 09:53:01,600 So maybe you can more accurately model the training data. 11888 09:53:01,600 --> 09:53:02,720 But it comes at the cost. 11889 09:53:02,720 --> 09:53:05,760 More units means more weights that you need to figure out how to update. 11890 09:53:05,760 --> 09:53:08,280 So it might be more expensive to do that calculation. 11891 09:53:08,280 --> 09:53:10,640 And you also run the risk of overfitting on the data. 11892 09:53:10,640 --> 09:53:13,040 If you have too many units and you learn to just 11893 09:53:13,040 --> 09:53:15,600 overfit on the training data, that's not good either. 11894 09:53:15,600 --> 09:53:16,520 So there is a balance. 11895 09:53:16,520 --> 09:53:20,160 And there's often a testing process where you'll train on some data 11896 09:53:20,160 --> 09:53:23,240 and maybe validate how well you're doing on a separate set of data, 11897 09:53:23,240 --> 09:53:26,840 often called a validation set, to see, all right, which setting of parameters. 11898 09:53:26,840 --> 09:53:28,080 How many layers should I have? 11899 09:53:28,080 --> 09:53:29,800 How many units should be in each layer? 11900 09:53:29,800 --> 09:53:32,600 Which one of those performs the best on the validation set? 11901 09:53:32,600 --> 09:53:36,680 So you can do some testing to figure out what these hyper parameters, so called, 11902 09:53:36,680 --> 09:53:38,840 should be equal to. 11903 09:53:38,840 --> 09:53:41,480 Next, I specify what the input shape is. 11904 09:53:41,480 --> 09:53:43,480 Meaning, all right, what does my input look like? 11905 09:53:43,480 --> 09:53:44,840 My input has four values. 11906 09:53:44,840 --> 09:53:48,920 And so the input shape is just four, because we have four inputs. 11907 09:53:48,920 --> 09:53:51,240 And then I specify what the activation function is. 11908 09:53:51,240 --> 09:53:53,040 And the activation function, again, we can choose. 11909 09:53:53,040 --> 09:53:55,440 There are a number of different activation functions. 11910 09:53:55,440 --> 09:53:59,200 Here I'm using relu, which you might recall from earlier. 11911 09:53:59,200 --> 09:54:01,560 And then I'll add an output layer. 11912 09:54:01,560 --> 09:54:02,920 So I have my hidden layer. 11913 09:54:02,920 --> 09:54:05,960 Now I'm adding one more layer that will just have one unit, 11914 09:54:05,960 --> 09:54:07,960 because all I want to do is predict something 11915 09:54:07,960 --> 09:54:10,520 like counterfeit build or authentic build. 11916 09:54:10,520 --> 09:54:12,280 So I just need a single unit. 11917 09:54:12,280 --> 09:54:14,480 And the activation function I'm going to use here 11918 09:54:14,480 --> 09:54:16,840 is that sigmoid activation function, which, again, 11919 09:54:16,840 --> 09:54:20,800 was that S-shaped curve that just gave us a probability of what 11920 09:54:20,800 --> 09:54:24,160 is the probability that this is a counterfeit build, 11921 09:54:24,160 --> 09:54:26,400 as opposed to an authentic build. 11922 09:54:26,400 --> 09:54:29,240 So that, then, is the structure of my neural network, 11923 09:54:29,240 --> 09:54:32,880 a sequential neural network that has one hidden layer with eight units inside 11924 09:54:32,880 --> 09:54:37,040 of it, and then one output layer that just has a single unit inside of it. 11925 09:54:37,040 --> 09:54:38,760 And I can choose how many units there are. 11926 09:54:38,760 --> 09:54:40,960 I can choose the activation function. 11927 09:54:40,960 --> 09:54:44,240 Then I'm going to compile this model. 11928 09:54:44,240 --> 09:54:48,040 TensorFlow gives you a choice of how you would like to optimize the weights. 11929 09:54:48,040 --> 09:54:50,160 There are various different algorithms for doing that. 11930 09:54:50,160 --> 09:54:52,040 What type of loss function you want to use. 11931 09:54:52,040 --> 09:54:54,120 Again, many different options for doing that. 11932 09:54:54,120 --> 09:54:57,300 And then how I want to evaluate my model, well, I care about accuracy. 11933 09:54:57,300 --> 09:55:01,920 I care about how many of my points am I able to classify correctly 11934 09:55:01,920 --> 09:55:04,600 versus not correctly as counterfeit or not counterfeit. 11935 09:55:04,600 --> 09:55:09,920 And I would like it to report to me how accurate my model is performing. 11936 09:55:09,920 --> 09:55:12,360 Then, now that I've defined that model, I 11937 09:55:12,360 --> 09:55:15,520 call model.fit to say go ahead and train the model. 11938 09:55:15,520 --> 09:55:19,480 Train it on all the training data plus all of the training labels. 11939 09:55:19,480 --> 09:55:22,360 So labels for each of those pieces of training data. 11940 09:55:22,360 --> 09:55:25,440 And I'm saying run it for 20 epics, meaning go ahead and go 11941 09:55:25,440 --> 09:55:28,080 through each of these training points 20 times, effectively. 11942 09:55:28,080 --> 09:55:31,480 Go through the data 20 times and keep trying to update the weights. 11943 09:55:31,480 --> 09:55:33,720 If I did it for more, I could train for even longer 11944 09:55:33,720 --> 09:55:36,040 and maybe get a more accurate result. 11945 09:55:36,040 --> 09:55:39,640 But then after I fit it on all the data, I'll go ahead and just test it. 11946 09:55:39,640 --> 09:55:43,720 I'll evaluate my model using model.evaluate built into TensorFlow 11947 09:55:43,720 --> 09:55:47,380 that is just going to tell me how well do I perform on the testing data. 11948 09:55:47,380 --> 09:55:50,420 So ultimately, this is just going to give me some numbers that tell me 11949 09:55:50,420 --> 09:55:54,320 how well we did in this particular case. 11950 09:55:54,320 --> 09:55:57,840 So now what I'm going to do is go into banknotes and go ahead and run 11951 09:55:57,840 --> 09:55:59,240 banknotes.py. 11952 09:55:59,240 --> 09:56:02,280 And what's going to happen now is it's going to read in all of that training 11953 09:56:02,280 --> 09:56:02,880 data. 11954 09:56:02,880 --> 09:56:05,880 It's going to generate a neural network with all my inputs, 11955 09:56:05,880 --> 09:56:10,240 my eight hidden units inside my layer, and then an output unit. 11956 09:56:10,240 --> 09:56:11,880 And now what it's doing is it's training. 11957 09:56:11,880 --> 09:56:13,600 It's training 20 times. 11958 09:56:13,600 --> 09:56:17,200 And each time you can see how my accuracy is increasing on my training data. 11959 09:56:17,200 --> 09:56:20,200 It starts off the very first time not very accurate, 11960 09:56:20,200 --> 09:56:23,920 though better than random, something like 79% of the time. 11961 09:56:23,920 --> 09:56:26,880 It's able to accurately classify one bill from another. 11962 09:56:26,880 --> 09:56:29,600 But as I keep training, notice this accuracy value 11963 09:56:29,600 --> 09:56:33,320 improves and improves and improves until after I've trained through all 11964 09:56:33,320 --> 09:56:39,600 the data points 20 times, it looks like my accuracy is above 99% on the training 11965 09:56:39,600 --> 09:56:40,480 data. 11966 09:56:40,480 --> 09:56:43,840 And here's where I tested it on a whole bunch of testing data. 11967 09:56:43,840 --> 09:56:48,440 And it looks like in this case, I was also like 99.8% accurate. 11968 09:56:48,440 --> 09:56:51,200 So just using that, I was able to generate a neural network that 11969 09:56:51,200 --> 09:56:54,480 can detect counterfeit bills from authentic bills based on this input 11970 09:56:54,480 --> 09:56:59,280 data 99.8% of the time, at least based on this particular testing data. 11971 09:56:59,280 --> 09:57:01,480 And I might want to test it with more data as well, 11972 09:57:01,480 --> 09:57:03,040 just to be confident about that. 11973 09:57:03,040 --> 09:57:06,960 But this is really the value of using a machine learning library like TensorFlow. 11974 09:57:06,960 --> 09:57:10,000 And there are others available for Python and other languages as well. 11975 09:57:10,000 --> 09:57:13,520 But all I have to do is define the structure of the network 11976 09:57:13,520 --> 09:57:16,560 and define the data that I'm going to pass into the network. 11977 09:57:16,560 --> 09:57:19,840 And then TensorFlow runs the backpropagation algorithm 11978 09:57:19,840 --> 09:57:22,040 for learning what all of those weights should be, 11979 09:57:22,040 --> 09:57:24,640 for figuring out how to train this neural network to be 11980 09:57:24,640 --> 09:57:27,240 able to accurately, as accurately as possible, 11981 09:57:27,240 --> 09:57:31,920 figure out what the output values should be there as well. 11982 09:57:31,920 --> 09:57:36,400 And so this then was a look at what it is that neural networks can do just 11983 09:57:36,400 --> 09:57:39,520 using these sequences of layer after layer after layer. 11984 09:57:39,520 --> 09:57:43,240 And you can begin to imagine applying these to much more general problems. 11985 09:57:43,240 --> 09:57:45,920 And one big problem in computing and artificial intelligence 11986 09:57:45,920 --> 09:57:49,280 more generally is the problem of computer vision. 11987 09:57:49,280 --> 09:57:51,840 Computer vision is all about computational methods 11988 09:57:51,840 --> 09:57:54,600 for analyzing and understanding images. 11989 09:57:54,600 --> 09:57:57,400 You might have pictures that you want the computer to figure out 11990 09:57:57,400 --> 09:57:59,480 how to deal with, how to process those images 11991 09:57:59,480 --> 09:58:02,960 and figure out how to produce some sort of useful result out of this. 11992 09:58:02,960 --> 09:58:05,360 You've seen this in the context of social media websites 11993 09:58:05,360 --> 09:58:08,360 that are able to look at a photo that contains a whole bunch of faces. 11994 09:58:08,360 --> 09:58:10,520 And it's able to figure out what's a picture of whom 11995 09:58:10,520 --> 09:58:13,320 and label those and tag them with appropriate people. 11996 09:58:13,320 --> 09:58:15,360 This is becoming increasingly relevant as we 11997 09:58:15,360 --> 09:58:19,280 begin to discuss self-driving cars, that these cars now have cameras. 11998 09:58:19,280 --> 09:58:22,080 And we would like for the computer to have some sort of algorithm 11999 09:58:22,080 --> 09:58:26,760 that looks at the image and figures out what color is the light, what cars 12000 09:58:26,760 --> 09:58:29,200 are around us and in what direction, for example. 12001 09:58:29,200 --> 09:58:33,160 And so computer vision is all about taking an image and figuring out 12002 09:58:33,160 --> 09:58:35,600 what sort of computation, what sort of calculation 12003 09:58:35,600 --> 09:58:36,880 we can do with that image. 12004 09:58:36,880 --> 09:58:40,720 It's also relevant in the context of something like handwriting recognition. 12005 09:58:40,720 --> 09:58:43,800 This, what you're looking at, is an example of the MNIST data set. 12006 09:58:43,800 --> 09:58:46,240 It's a big data set just of handwritten digits 12007 09:58:46,240 --> 09:58:48,800 that we could use to ideally try and figure out 12008 09:58:48,800 --> 09:58:52,480 how to predict, given someone's handwriting, given a photo of a digit 12009 09:58:52,480 --> 09:58:57,120 that they have drawn, can you predict whether it's a 0, 1, 2, 3, 4, 5, 6, 7, 8, 12010 09:58:57,120 --> 09:58:58,320 or 9, for example. 12011 09:58:58,320 --> 09:59:01,080 So this sort of handwriting recognition is yet another task 12012 09:59:01,080 --> 09:59:04,280 that we might want to use computer vision tasks and tools 12013 09:59:04,280 --> 09:59:05,720 to be able to apply it towards. 12014 09:59:05,720 --> 09:59:08,840 This might be a task that we might care about. 12015 09:59:08,840 --> 09:59:11,360 So how, then, can we use neural networks to be 12016 09:59:11,360 --> 09:59:13,080 able to solve a problem like this? 12017 09:59:13,080 --> 09:59:15,600 Well, neural networks rely upon some sort of input 12018 09:59:15,600 --> 09:59:17,600 where that input is just numerical data. 12019 09:59:17,600 --> 09:59:19,880 We have a whole bunch of units where each one of them 12020 09:59:19,880 --> 09:59:22,080 just represents some sort of number. 12021 09:59:22,080 --> 09:59:24,920 And so in the context of something like handwriting recognition 12022 09:59:24,920 --> 09:59:29,200 or in the context of just an image, you might imagine that an image is really 12023 09:59:29,200 --> 09:59:34,160 just a grid of pixels, grid of dots where each dot has some sort of color. 12024 09:59:34,160 --> 09:59:36,800 And in the context of something like handwriting recognition, 12025 09:59:36,800 --> 09:59:39,880 you might imagine that if you just fill in each of these dots in a particular 12026 09:59:39,880 --> 09:59:42,640 way, you can generate a 2 or an 8, for example, 12027 09:59:42,640 --> 09:59:46,680 based on which dots happen to be shaded in and which dots are not. 12028 09:59:46,680 --> 09:59:50,400 And we can represent each of these pixel values just using numbers. 12029 09:59:50,400 --> 09:59:55,360 So for a particular pixel, for example, 0 might represent entirely black. 12030 09:59:55,360 --> 09:59:57,300 Depending on how you're representing color, 12031 09:59:57,300 --> 10:00:02,000 it's often common to represent color values on a 0 to 255 range 12032 10:00:02,000 --> 10:00:06,160 so that you can represent a color using 8 bits for a particular value, 12033 10:00:06,160 --> 10:00:08,400 like how much white is in the image. 12034 10:00:08,400 --> 10:00:10,920 So 0 might represent all black. 12035 10:00:10,920 --> 10:00:14,200 255 might represent entirely white as a pixel. 12036 10:00:14,200 --> 10:00:18,400 And somewhere in between might represent some shade of gray, for example. 12037 10:00:18,400 --> 10:00:20,760 But you might imagine not just having a single slider that 12038 10:00:20,760 --> 10:00:22,640 determines how much white is in the image, 12039 10:00:22,640 --> 10:00:24,760 but if you had a color image, you might imagine 12040 10:00:24,760 --> 10:00:28,080 three different numerical values, a red, green, and blue value, 12041 10:00:28,080 --> 10:00:30,760 where the red value controls how much red is in the image. 12042 10:00:30,760 --> 10:00:33,800 We have one value for controlling how much green is in the pixel 12043 10:00:33,800 --> 10:00:36,440 and one value for how much blue is in the pixel as well. 12044 10:00:36,440 --> 10:00:40,240 And depending on how it is that you set these values of red, green, and blue, 12045 10:00:40,240 --> 10:00:42,000 you can get a different color. 12046 10:00:42,000 --> 10:00:45,720 And so any pixel can really be represented, in this case, 12047 10:00:45,720 --> 10:00:50,640 by three numerical values, a red value, a green value, and a blue value. 12048 10:00:50,640 --> 10:00:54,160 And if you take a whole bunch of these pixels, assemble them together 12049 10:00:54,160 --> 10:00:56,840 inside of a grid of pixels, then you really 12050 10:00:56,840 --> 10:00:59,040 just have a whole bunch of numerical values 12051 10:00:59,040 --> 10:01:03,120 that you can use in order to perform some sort of prediction task. 12052 10:01:03,120 --> 10:01:05,800 And so what you might imagine doing is using the same techniques 12053 10:01:05,800 --> 10:01:08,680 we talked about before, just design a neural network 12054 10:01:08,680 --> 10:01:12,120 with a lot of inputs, that for each of the pixels, 12055 10:01:12,120 --> 10:01:13,960 we might have one or three different inputs 12056 10:01:13,960 --> 10:01:16,840 in the case of a color image, a different input that 12057 10:01:16,840 --> 10:01:20,080 is just connected to a deep neural network, for example. 12058 10:01:20,080 --> 10:01:22,920 And this deep neural network might take all of the pixels 12059 10:01:22,920 --> 10:01:27,040 inside of the image of what digit a person drew. 12060 10:01:27,040 --> 10:01:29,160 And the output might be like 10 neurons that 12061 10:01:29,160 --> 10:01:32,360 classify it as a 0, or a 1, or a 2, or a 3, 12062 10:01:32,360 --> 10:01:36,760 or just tells us in some way what that digit happens to be. 12063 10:01:36,760 --> 10:01:39,080 Now, there are a couple of drawbacks to this approach. 12064 10:01:39,080 --> 10:01:42,680 The first drawback to the approach is just the size of this input array, 12065 10:01:42,680 --> 10:01:44,600 that we have a whole bunch of inputs. 12066 10:01:44,600 --> 10:01:47,160 If we have a big image that has a lot of different channels, 12067 10:01:47,160 --> 10:01:50,040 we're looking at a lot of inputs, and therefore a lot of weights 12068 10:01:50,040 --> 10:01:51,960 that we have to calculate. 12069 10:01:51,960 --> 10:01:55,680 And a second problem is the fact that by flattening everything 12070 10:01:55,680 --> 10:01:58,040 into just this structure of all the pixels, 12071 10:01:58,040 --> 10:02:00,760 we've lost access to a lot of the information 12072 10:02:00,760 --> 10:02:03,280 about the structure of the image that's relevant, 12073 10:02:03,280 --> 10:02:05,800 that really, when a person looks at an image, 12074 10:02:05,800 --> 10:02:08,040 they're looking at particular features of the image. 12075 10:02:08,040 --> 10:02:09,000 They're looking at curves. 12076 10:02:09,000 --> 10:02:09,880 They're looking at shapes. 12077 10:02:09,880 --> 10:02:11,720 They're looking at what things can you identify 12078 10:02:11,720 --> 10:02:14,640 in different regions of the image, and maybe put those things together 12079 10:02:14,640 --> 10:02:18,200 in order to get a better picture of what the overall image is about. 12080 10:02:18,200 --> 10:02:22,200 And by just turning it into pixel values for each of the pixels, 12081 10:02:22,200 --> 10:02:24,600 sure, you might be able to learn that structure, 12082 10:02:24,600 --> 10:02:26,520 but it might be challenging in order to do so. 12083 10:02:26,520 --> 10:02:28,880 It might be helpful to take advantage of the fact 12084 10:02:28,880 --> 10:02:31,400 that you can use properties of the image itself, the fact 12085 10:02:31,400 --> 10:02:33,660 that it's structured in a particular way, to be 12086 10:02:33,660 --> 10:02:37,400 able to improve the way that we learn based on that image too. 12087 10:02:37,400 --> 10:02:40,480 So in order to figure out how we can train our neural networks to better 12088 10:02:40,480 --> 10:02:43,760 be able to deal with images, we'll introduce a couple of ideas, 12089 10:02:43,760 --> 10:02:45,960 a couple of algorithms that we can apply that 12090 10:02:45,960 --> 10:02:50,080 allow us to take the image and extract some useful information out 12091 10:02:50,080 --> 10:02:50,880 of that image. 12092 10:02:50,880 --> 10:02:54,720 And the first idea we'll introduce is the notion of image convolution. 12093 10:02:54,720 --> 10:02:58,240 And what image convolution is all about is it's about filtering an image, 12094 10:02:58,240 --> 10:03:01,600 sort of extracting useful or relevant features out of the image. 12095 10:03:01,600 --> 10:03:05,680 And the way we do that is by applying a particular filter that 12096 10:03:05,680 --> 10:03:09,040 basically adds the value for every pixel with the values 12097 10:03:09,040 --> 10:03:11,480 for all of the neighboring pixels to it, according 12098 10:03:11,480 --> 10:03:14,080 to some sort of kernel matrix, which we'll see in a moment, 12099 10:03:14,080 --> 10:03:17,560 is going to allow us to weight these pixels in various different ways. 12100 10:03:17,560 --> 10:03:19,560 And the goal of image convolution, then, is 12101 10:03:19,560 --> 10:03:22,960 to extract some sort of interesting or useful features out of an image, 12102 10:03:22,960 --> 10:03:26,320 to be able to take a pixel and, based on its neighboring pixels, 12103 10:03:26,320 --> 10:03:29,200 maybe predict some sort of valuable information. 12104 10:03:29,200 --> 10:03:32,120 Something like taking a pixel and looking at its neighboring pixels, 12105 10:03:32,120 --> 10:03:33,560 you might be able to predict whether or not 12106 10:03:33,560 --> 10:03:35,360 there's some sort of curve inside the image, 12107 10:03:35,360 --> 10:03:38,440 or whether it's forming the outline of a particular line or a shape, 12108 10:03:38,440 --> 10:03:39,280 for example. 12109 10:03:39,280 --> 10:03:42,040 And that might be useful if you're trying to use 12110 10:03:42,040 --> 10:03:44,680 all of these various different features to combine them 12111 10:03:44,680 --> 10:03:48,120 to say something meaningful about an image as a whole. 12112 10:03:48,120 --> 10:03:50,080 So how, then, does image convolution work? 12113 10:03:50,080 --> 10:03:52,280 Well, we start with a kernel matrix. 12114 10:03:52,280 --> 10:03:54,440 And the kernel matrix looks something like this. 12115 10:03:54,440 --> 10:03:58,080 And the idea of this is that, given a pixel that will be the middle pixel, 12116 10:03:58,080 --> 10:04:00,960 we're going to multiply each of the neighboring pixels 12117 10:04:00,960 --> 10:04:04,440 by these values in order to get some sort of result 12118 10:04:04,440 --> 10:04:06,820 by summing up all the numbers together. 12119 10:04:06,820 --> 10:04:09,320 So if I take this kernel, which you can think of as a filter 12120 10:04:09,320 --> 10:04:13,440 that I'm going to apply to the image, and let's say that I take this image. 12121 10:04:13,440 --> 10:04:14,760 This is a 4 by 4 image. 12122 10:04:14,760 --> 10:04:16,680 We'll think of it as just a black and white image, 12123 10:04:16,680 --> 10:04:19,920 where each one is just a single pixel value. 12124 10:04:19,920 --> 10:04:22,840 So somewhere between 0 and 255, for example. 12125 10:04:22,840 --> 10:04:25,720 So we have a whole bunch of individual pixel values like this. 12126 10:04:25,720 --> 10:04:30,560 And what I'd like to do is apply this kernel, this filter, so to speak, 12127 10:04:30,560 --> 10:04:32,440 to this image. 12128 10:04:32,440 --> 10:04:35,040 And the way I'll do that is, all right, the kernel is 3 by 3. 12129 10:04:35,040 --> 10:04:38,200 You can imagine a 5 by 5 kernel or a larger kernel, too. 12130 10:04:38,200 --> 10:04:41,280 And I'll take it and just first apply it to the first 3 12131 10:04:41,280 --> 10:04:43,720 by 3 section of the image. 12132 10:04:43,720 --> 10:04:46,960 And what I'll do is I'll take each of these pixel values, 12133 10:04:46,960 --> 10:04:50,200 multiply it by its corresponding value in the filter matrix, 12134 10:04:50,200 --> 10:04:53,200 and add all of the results together. 12135 10:04:53,200 --> 10:04:59,320 So here, for example, I'll say 10 times 0, plus 20 times negative 1, 12136 10:04:59,320 --> 10:05:03,720 plus 30 times 0, so on and so forth, doing all of this calculation. 12137 10:05:03,720 --> 10:05:05,480 And at the end, if I take all these values, 12138 10:05:05,480 --> 10:05:08,240 multiply them by their corresponding value in the kernel, 12139 10:05:08,240 --> 10:05:11,680 add the results together, for this particular set of 9 pixels, 12140 10:05:11,680 --> 10:05:14,800 I get the value of 10, for example. 12141 10:05:14,800 --> 10:05:19,880 And then what I'll do is I'll slide this 3 by 3 grid, effectively, over. 12142 10:05:19,880 --> 10:05:24,520 I'll slide the kernel by 1 to look at the next 3 by 3 section. 12143 10:05:24,520 --> 10:05:26,440 Here, I'm just sliding it over by 1 pixel. 12144 10:05:26,440 --> 10:05:28,240 But you might imagine a different stride length, 12145 10:05:28,240 --> 10:05:31,040 or maybe I jump by multiple pixels at a time if you really wanted to. 12146 10:05:31,040 --> 10:05:32,400 You have different options here. 12147 10:05:32,400 --> 10:05:35,920 But here, I'm just sliding over, looking at the next 3 by 3 section. 12148 10:05:35,920 --> 10:05:40,240 And I'll do the same math, 20 times 0, plus 30 times negative 1, 12149 10:05:40,240 --> 10:05:45,240 plus 40 times 0, plus 20 times negative 1, so on and so forth, plus 30 times 5. 12150 10:05:45,240 --> 10:05:47,240 And what I end up getting is the number 20. 12151 10:05:47,240 --> 10:05:50,520 Then you can imagine shifting over to this one, doing the same thing, 12152 10:05:50,520 --> 10:05:54,320 calculating the number 40, for example, and then doing the same thing here, 12153 10:05:54,320 --> 10:05:56,920 and calculating a value there as well. 12154 10:05:56,920 --> 10:06:00,640 And so what we have now is what we'll call a feature map. 12155 10:06:00,640 --> 10:06:03,600 We have taken this kernel, applied it to each 12156 10:06:03,600 --> 10:06:06,320 of these various different regions, and what we get 12157 10:06:06,320 --> 10:06:11,240 is some representation of a filtered version of that image. 12158 10:06:11,240 --> 10:06:13,040 And so to give a more concrete example of why 12159 10:06:13,040 --> 10:06:14,920 it is that this kind of thing could be useful, 12160 10:06:14,920 --> 10:06:18,480 let's take this kernel matrix, for example, which is quite a famous one, 12161 10:06:18,480 --> 10:06:22,240 that has an 8 in the middle, and then all of the neighboring pixels 12162 10:06:22,240 --> 10:06:23,680 get a negative 1. 12163 10:06:23,680 --> 10:06:26,920 And let's imagine we wanted to apply that to a 3 12164 10:06:26,920 --> 10:06:31,320 by 3 part of an image that looks like this, where all the values are the same. 12165 10:06:31,320 --> 10:06:33,560 They're all 20, for instance. 12166 10:06:33,560 --> 10:06:38,160 Well, in this case, if you do 20 times 8, and then subtract 20, subtract 20, 12167 10:06:38,160 --> 10:06:40,920 subtract 20 for each of the eight neighbors, well, the result of that 12168 10:06:40,920 --> 10:06:44,680 is you just get that expression, which comes out to be 0. 12169 10:06:44,680 --> 10:06:47,200 You multiplied 20 by 8, but then you subtracted 12170 10:06:47,200 --> 10:06:50,200 20 eight times, according to that particular kernel. 12171 10:06:50,200 --> 10:06:52,400 The result of all that is just 0. 12172 10:06:52,400 --> 10:06:56,400 So the takeaway here is that when a lot of the pixels are the same value, 12173 10:06:56,400 --> 10:06:59,320 we end up getting a value close to 0. 12174 10:06:59,320 --> 10:07:02,720 If, though, we had something like this, 20 is along this first row, 12175 10:07:02,720 --> 10:07:05,720 then 50 is in the second row, and 50 is in the third row, well, 12176 10:07:05,720 --> 10:07:08,920 then when you do this, because it's the same kind of math, 20 times negative 1, 12177 10:07:08,920 --> 10:07:12,680 20 times negative 1, so on and so forth, then I get a higher value, 12178 10:07:12,680 --> 10:07:15,680 a value like 90 in this particular case. 12179 10:07:15,680 --> 10:07:21,040 And so the more general idea here is that by applying this kernel, negative 1s, 12180 10:07:21,040 --> 10:07:23,800 8 in the middle, and then negative 1s, what I get 12181 10:07:23,800 --> 10:07:29,240 is when this middle value is very different from the neighboring values, 12182 10:07:29,240 --> 10:07:31,880 like 50 is greater than these 20s, then you'll 12183 10:07:31,880 --> 10:07:34,640 end up with a value higher than 0. 12184 10:07:34,640 --> 10:07:36,760 If this number is higher than its neighbors, 12185 10:07:36,760 --> 10:07:38,280 you end up getting a bigger output. 12186 10:07:38,280 --> 10:07:41,360 But if this value is the same as all of its neighbors, 12187 10:07:41,360 --> 10:07:43,920 then you get a lower output, something like 0. 12188 10:07:43,920 --> 10:07:46,440 And it turns out that this sort of filter can therefore 12189 10:07:46,440 --> 10:07:49,720 be used in something like detecting edges in an image. 12190 10:07:49,720 --> 10:07:53,120 Or I want to detect the boundaries between various different objects 12191 10:07:53,120 --> 10:07:54,160 inside of an image. 12192 10:07:54,160 --> 10:07:57,200 I might use a filter like this, which is able to tell 12193 10:07:57,200 --> 10:08:00,000 whether the value of this pixel is different 12194 10:08:00,000 --> 10:08:02,080 from the values of the neighboring pixel, 12195 10:08:02,080 --> 10:08:06,800 if it's greater than the values of the pixels that happen to surround it. 12196 10:08:06,800 --> 10:08:09,480 And so we can use this in terms of image filtering. 12197 10:08:09,480 --> 10:08:11,520 And so I'll show you an example of that. 12198 10:08:11,520 --> 10:08:17,680 I have here in filter.py a file that uses Python's image library, 12199 10:08:17,680 --> 10:08:21,400 or PIL, to do some image filtering. 12200 10:08:21,400 --> 10:08:23,080 I go ahead and open an image. 12201 10:08:23,080 --> 10:08:26,840 And then all I'm going to do is apply a kernel to that image. 12202 10:08:26,840 --> 10:08:30,520 It's going to be a 3 by 3 kernel, same kind of kernel we saw before. 12203 10:08:30,520 --> 10:08:31,960 And here is the kernel. 12204 10:08:31,960 --> 10:08:34,880 This is just a list representation of the same matrix 12205 10:08:34,880 --> 10:08:36,160 that I showed you a moment ago. 12206 10:08:36,160 --> 10:08:38,160 It's negative 1, negative 1, negative 1. 12207 10:08:38,160 --> 10:08:40,920 The second row is negative 1, 8, negative 1. 12208 10:08:40,920 --> 10:08:43,200 And the third row is all negative 1s. 12209 10:08:43,200 --> 10:08:47,840 And then at the end, I'm going to go ahead and show the filtered image. 12210 10:08:47,840 --> 10:08:53,640 So if, for example, I go into convolution directory 12211 10:08:53,640 --> 10:08:56,600 and I open up an image, like bridge.png, this 12212 10:08:56,600 --> 10:09:02,560 is what an input image might look like, just an image of a bridge over a river. 12213 10:09:02,560 --> 10:09:07,640 Now I'm going to go ahead and run this filter program on the bridge. 12214 10:09:07,640 --> 10:09:10,080 And what I get is this image here. 12215 10:09:10,080 --> 10:09:13,280 Just by taking the original image and applying that filter 12216 10:09:13,280 --> 10:09:17,480 to each 3 by 3 grid, I've extracted all of the boundaries, 12217 10:09:17,480 --> 10:09:20,800 all of the edges inside the image that separate one part of the image 12218 10:09:20,800 --> 10:09:21,360 from another. 12219 10:09:21,360 --> 10:09:24,000 So here I've got a representation of boundaries 12220 10:09:24,000 --> 10:09:26,320 between particular parts of the image. 12221 10:09:26,320 --> 10:09:28,880 And you might imagine that if a machine learning algorithm is 12222 10:09:28,880 --> 10:09:33,120 trying to learn what an image is of, a filter like this could be pretty useful. 12223 10:09:33,120 --> 10:09:35,920 Maybe the machine learning algorithm doesn't 12224 10:09:35,920 --> 10:09:38,440 care about all of the details of the image. 12225 10:09:38,440 --> 10:09:40,480 It just cares about certain useful features. 12226 10:09:40,480 --> 10:09:42,640 It cares about particular shapes that are 12227 10:09:42,640 --> 10:09:45,160 able to help it determine that based on the image, 12228 10:09:45,160 --> 10:09:47,680 this is going to be a bridge, for example. 12229 10:09:47,680 --> 10:09:50,080 And so this type of idea of image convolution 12230 10:09:50,080 --> 10:09:55,480 can allow us to apply filters to images that allow us to extract useful results 12231 10:09:55,480 --> 10:09:59,680 out of those images, taking an image and extracting its edges, for example. 12232 10:09:59,680 --> 10:10:01,760 And you might imagine many other filters that 12233 10:10:01,760 --> 10:10:05,200 could be applied to an image that are able to extract particular values as 12234 10:10:05,200 --> 10:10:05,700 well. 12235 10:10:05,700 --> 10:10:08,880 And a filter might have separate kernels for the red values, the green values, 12236 10:10:08,880 --> 10:10:11,400 and the blue values that are all summed together at the end, 12237 10:10:11,400 --> 10:10:14,000 such that you could have particular filters looking for, 12238 10:10:14,000 --> 10:10:15,720 is there red in this part of the image? 12239 10:10:15,720 --> 10:10:17,560 Are there green in other parts of the image? 12240 10:10:17,560 --> 10:10:20,880 You can begin to assemble these relevant and useful filters 12241 10:10:20,880 --> 10:10:24,400 that are able to do these calculations as well. 12242 10:10:24,400 --> 10:10:26,840 So that then was the idea of image convolution, 12243 10:10:26,840 --> 10:10:29,400 applying some sort of filter to an image to be 12244 10:10:29,400 --> 10:10:32,760 able to extract some useful features out of that image. 12245 10:10:32,760 --> 10:10:35,840 But all the while, these images are still pretty big. 12246 10:10:35,840 --> 10:10:38,000 There's a lot of pixels involved in the image. 12247 10:10:38,000 --> 10:10:40,560 And realistically speaking, if you've got a really big image, 12248 10:10:40,560 --> 10:10:42,200 that poses a couple of problems. 12249 10:10:42,200 --> 10:10:45,080 One, it means a lot of input going into the neural network. 12250 10:10:45,080 --> 10:10:48,280 But two, it also means that we really have 12251 10:10:48,280 --> 10:10:50,600 to care about what's in each particular pixel. 12252 10:10:50,600 --> 10:10:54,200 Whereas realistically, we often, if you're looking at an image, 12253 10:10:54,200 --> 10:10:58,120 you don't care whether something is in one particular pixel versus the pixel 12254 10:10:58,120 --> 10:10:59,400 immediately to the right of it. 12255 10:10:59,400 --> 10:11:01,000 They're pretty close together. 12256 10:11:01,000 --> 10:11:03,920 You really just care about whether there's a particular feature 12257 10:11:03,920 --> 10:11:05,720 in some region of the image. 12258 10:11:05,720 --> 10:11:09,480 And maybe you don't care about exactly which pixel it happens to be in. 12259 10:11:09,480 --> 10:11:11,960 And so there's a technique we can use known as pooling. 12260 10:11:11,960 --> 10:11:15,920 And what pooling is, is it means reducing the size of an input 12261 10:11:15,920 --> 10:11:18,600 by sampling from regions inside of the input. 12262 10:11:18,600 --> 10:11:22,160 So we're going to take a big image and turn it into a smaller image 12263 10:11:22,160 --> 10:11:23,160 by using pooling. 12264 10:11:23,160 --> 10:11:25,800 And in particular, one of the most popular types of pooling 12265 10:11:25,800 --> 10:11:27,160 is called max pooling. 12266 10:11:27,160 --> 10:11:29,760 And what max pooling does is it pools just 12267 10:11:29,760 --> 10:11:33,640 by choosing the maximum value in a particular region. 12268 10:11:33,640 --> 10:11:36,800 So for example, let's imagine I had this 4 by 4 image. 12269 10:11:36,800 --> 10:11:38,640 But I wanted to reduce its dimensions. 12270 10:11:38,640 --> 10:11:42,640 I wanted to make it a smaller image so that I have fewer inputs to work with. 12271 10:11:42,640 --> 10:11:47,120 Well, what I could do is I could apply a 2 by 2 max pool, 12272 10:11:47,120 --> 10:11:50,880 where the idea would be that I'm going to first look at this 2 by 2 region 12273 10:11:50,880 --> 10:11:53,240 and say, what is the maximum value in that region? 12274 10:11:53,240 --> 10:11:54,600 Well, it's the number 50. 12275 10:11:54,600 --> 10:11:57,080 So we'll go ahead and just use the number 50. 12276 10:11:57,080 --> 10:11:58,600 And then we'll look at this 2 by 2 region. 12277 10:11:58,600 --> 10:12:00,160 What is the maximum value here? 12278 10:12:00,160 --> 10:12:02,360 It's 110, so that's going to be my value. 12279 10:12:02,360 --> 10:12:04,680 Likewise here, the maximum value looks like 20. 12280 10:12:04,680 --> 10:12:05,960 Go ahead and put that there. 12281 10:12:05,960 --> 10:12:09,200 Then for this last region, the maximum value was 40. 12282 10:12:09,200 --> 10:12:10,800 So we'll go ahead and use that. 12283 10:12:10,800 --> 10:12:14,520 And what I have now is a smaller representation 12284 10:12:14,520 --> 10:12:17,520 of this same original image that I obtained just 12285 10:12:17,520 --> 10:12:21,960 by picking the maximum value from each of these regions. 12286 10:12:21,960 --> 10:12:25,160 So again, the advantages here are now I only 12287 10:12:25,160 --> 10:12:27,960 have to deal with a 2 by 2 input instead of a 4 by 4. 12288 10:12:27,960 --> 10:12:31,160 And you can imagine shrinking the size of an image even more. 12289 10:12:31,160 --> 10:12:36,120 But in addition to that, I'm now able to make my analysis 12290 10:12:36,120 --> 10:12:40,280 independent of whether a particular value was in this pixel or this pixel. 12291 10:12:40,280 --> 10:12:42,720 I don't care if the 50 was here or here. 12292 10:12:42,720 --> 10:12:45,200 As long as it was generally in this region, 12293 10:12:45,200 --> 10:12:47,240 I'll still get access to that value. 12294 10:12:47,240 --> 10:12:51,480 So it makes our algorithms a little bit more robust as well. 12295 10:12:51,480 --> 10:12:54,520 So that then is pooling, taking the size of the image, 12296 10:12:54,520 --> 10:12:58,040 reducing it a little bit by just sampling from particular regions 12297 10:12:58,040 --> 10:12:59,520 inside of the image. 12298 10:12:59,520 --> 10:13:03,320 And now we can put all of these ideas together, pooling, image convolution, 12299 10:13:03,320 --> 10:13:06,960 and neural networks all together into another type of neural network 12300 10:13:06,960 --> 10:13:10,920 called a convolutional neural network, or a CNN, which 12301 10:13:10,920 --> 10:13:14,400 is a neural network that uses this convolution step usually 12302 10:13:14,400 --> 10:13:18,080 in the context of analyzing an image, for example. 12303 10:13:18,080 --> 10:13:20,600 And so the way that a convolutional neural network works 12304 10:13:20,600 --> 10:13:24,440 is that we start with some sort of input image, some grid of pixels. 12305 10:13:24,440 --> 10:13:27,840 But rather than immediately put that into the neural network layers 12306 10:13:27,840 --> 10:13:31,160 that we've seen before, we'll start by applying a convolution step, 12307 10:13:31,160 --> 10:13:33,440 where the convolution step involves applying 12308 10:13:33,440 --> 10:13:36,680 some number of different image filters to our original image 12309 10:13:36,680 --> 10:13:40,160 in order to get what we call a feature map, the result of applying 12310 10:13:40,160 --> 10:13:41,920 some filter to an image. 12311 10:13:41,920 --> 10:13:45,120 And we could do this once, but in general, we'll do this multiple times, 12312 10:13:45,120 --> 10:13:48,480 getting a whole bunch of different feature maps, each of which 12313 10:13:48,480 --> 10:13:51,600 might extract some different relevant feature out of the image, 12314 10:13:51,600 --> 10:13:53,920 some different important characteristic of the image 12315 10:13:53,920 --> 10:13:56,600 that we might care about using in order to calculate 12316 10:13:56,600 --> 10:13:58,160 what the result should be. 12317 10:13:58,160 --> 10:14:01,040 And in the same way that when we train neural networks, 12318 10:14:01,040 --> 10:14:04,520 we can train neural networks to learn the weights between particular units 12319 10:14:04,520 --> 10:14:07,240 inside of the neural networks, we can also train neural networks 12320 10:14:07,240 --> 10:14:09,560 to learn what those filters should be, what 12321 10:14:09,560 --> 10:14:11,840 the values of the filters should be in order 12322 10:14:11,840 --> 10:14:15,840 to get the most useful, most relevant information out of the original image 12323 10:14:15,840 --> 10:14:18,800 just by figuring out what setting of those filter values, 12324 10:14:18,800 --> 10:14:23,000 the values inside of that kernel, results in minimizing the loss function, 12325 10:14:23,000 --> 10:14:26,520 minimizing how poorly our hypothesis actually 12326 10:14:26,520 --> 10:14:30,920 performs in figuring out the classification of a particular image, 12327 10:14:30,920 --> 10:14:32,080 for example. 12328 10:14:32,080 --> 10:14:34,800 So we first apply this convolution step, get a whole bunch 12329 10:14:34,800 --> 10:14:36,760 of these various different feature maps. 12330 10:14:36,760 --> 10:14:38,720 But these feature maps are quite large. 12331 10:14:38,720 --> 10:14:41,480 There's a lot of pixel values that happen to be here. 12332 10:14:41,480 --> 10:14:44,720 And so a logical next step to take is a pooling step, 12333 10:14:44,720 --> 10:14:48,040 where we reduce the size of these images by using max pooling, 12334 10:14:48,040 --> 10:14:51,840 for example, extracting the maximum value from any particular region. 12335 10:14:51,840 --> 10:14:53,720 There are other pooling methods that exist as well, 12336 10:14:53,720 --> 10:14:54,880 depending on the situation. 12337 10:14:54,880 --> 10:14:57,040 You could use something like average pooling, 12338 10:14:57,040 --> 10:14:59,480 where instead of taking the maximum value from a region, 12339 10:14:59,480 --> 10:15:03,240 you take the average value from a region, which has its uses as well. 12340 10:15:03,240 --> 10:15:07,280 But in effect, what pooling will do is it will take these feature maps 12341 10:15:07,280 --> 10:15:09,460 and reduce their dimensions so that we end up 12342 10:15:09,460 --> 10:15:12,080 with smaller grids with fewer pixels. 12343 10:15:12,080 --> 10:15:14,320 And this then is going to be easier for us to deal with. 12344 10:15:14,320 --> 10:15:16,960 It's going to mean fewer inputs that we have to worry about. 12345 10:15:16,960 --> 10:15:19,480 And it's also going to mean we're more resilient, 12346 10:15:19,480 --> 10:15:22,560 more robust against potential movements of particular values, 12347 10:15:22,560 --> 10:15:24,680 just by one pixel, when ultimately we really 12348 10:15:24,680 --> 10:15:27,520 don't care about those one-pixel differences that 12349 10:15:27,520 --> 10:15:30,160 might arise in the original image. 12350 10:15:30,160 --> 10:15:32,120 And now, after we've done this pooling step, 12351 10:15:32,120 --> 10:15:36,500 now we have a whole bunch of values that we can then flatten out and just put 12352 10:15:36,500 --> 10:15:38,560 into a more traditional neural network. 12353 10:15:38,560 --> 10:15:40,320 So we go ahead and flatten it, and then we 12354 10:15:40,320 --> 10:15:42,240 end up with a traditional neural network that 12355 10:15:42,240 --> 10:15:46,480 has one input for each of these values in each of these resulting feature 12356 10:15:46,480 --> 10:15:51,400 maps after we do the convolution and after we do the pooling step. 12357 10:15:51,400 --> 10:15:54,720 And so this then is the general structure of a convolutional network. 12358 10:15:54,720 --> 10:15:58,200 We begin with the image, apply convolution, apply pooling, 12359 10:15:58,200 --> 10:16:01,200 flatten the results, and then put that into a more traditional neural 12360 10:16:01,200 --> 10:16:03,440 network that might itself have hidden layers. 12361 10:16:03,440 --> 10:16:05,540 You can have deep convolutional networks that 12362 10:16:05,540 --> 10:16:09,760 have hidden layers in between this flattened layer and the eventual output 12363 10:16:09,760 --> 10:16:13,360 to be able to calculate various different features of those values. 12364 10:16:13,360 --> 10:16:17,360 But this then can help us to be able to use convolution and pooling 12365 10:16:17,360 --> 10:16:19,760 to use our knowledge about the structure of an image 12366 10:16:19,760 --> 10:16:23,280 to be able to get better results, to be able to train our networks faster 12367 10:16:23,280 --> 10:16:27,320 in order to better capture particular parts of the image. 12368 10:16:27,320 --> 10:16:30,640 And there's no reason necessarily why you can only use these steps once. 12369 10:16:30,640 --> 10:16:33,520 In fact, in practice, you'll often use convolution and pooling 12370 10:16:33,520 --> 10:16:36,440 multiple times in multiple different steps. 12371 10:16:36,440 --> 10:16:39,560 See, what you might imagine doing is starting with an image, 12372 10:16:39,560 --> 10:16:42,240 first applying convolution to get a whole bunch of maps, 12373 10:16:42,240 --> 10:16:45,360 then applying pooling, then applying convolution again, 12374 10:16:45,360 --> 10:16:48,000 because these maps are still pretty big. 12375 10:16:48,000 --> 10:16:51,760 You can apply convolution to try and extract relevant features out 12376 10:16:51,760 --> 10:16:55,240 of this result. Then take those results, apply pooling 12377 10:16:55,240 --> 10:16:57,820 in order to reduce their dimensions, and then take that 12378 10:16:57,820 --> 10:17:01,280 and feed it into a neural network that maybe has fewer inputs. 12379 10:17:01,280 --> 10:17:04,040 So here I have two different convolution and pooling steps. 12380 10:17:04,040 --> 10:17:08,280 I do convolution and pooling once, and then I do convolution and pooling 12381 10:17:08,280 --> 10:17:11,400 a second time, each time extracting useful features 12382 10:17:11,400 --> 10:17:14,200 from the layer before it, each time using pooling 12383 10:17:14,200 --> 10:17:17,280 to reduce the dimensions of what you're ultimately looking at. 12384 10:17:17,280 --> 10:17:21,160 And the goal now of this sort of model is that in each of these steps, 12385 10:17:21,160 --> 10:17:25,400 you can begin to learn different types of features of the original image. 12386 10:17:25,400 --> 10:17:28,400 That maybe in the first step, you learn very low level features. 12387 10:17:28,400 --> 10:17:31,940 Just learn and look for features like edges and curves and shapes, 12388 10:17:31,940 --> 10:17:36,000 because based on pixels and their neighboring values, you can figure out, 12389 10:17:36,000 --> 10:17:37,320 all right, what are the edges? 12390 10:17:37,320 --> 10:17:38,040 What are the curves? 12391 10:17:38,040 --> 10:17:41,000 What are the various different shapes that might be present there? 12392 10:17:41,000 --> 10:17:43,760 But then once you have a mapping that just represents 12393 10:17:43,760 --> 10:17:46,520 where the edges and curves and shapes happen to be, 12394 10:17:46,520 --> 10:17:49,160 you can imagine applying the same sort of process again 12395 10:17:49,160 --> 10:17:51,760 to begin to look for higher level features, look for objects, 12396 10:17:51,760 --> 10:17:55,320 maybe look for people's eyes and facial recognition, for example. 12397 10:17:55,320 --> 10:17:59,200 Maybe look for more complex shapes like the curves on a particular number 12398 10:17:59,200 --> 10:18:02,440 if you're trying to recognize a digit in a handwriting recognition sort 12399 10:18:02,440 --> 10:18:03,620 of scenario. 12400 10:18:03,620 --> 10:18:06,680 And then after all of that, now that you have these results that 12401 10:18:06,680 --> 10:18:08,760 represent these higher level features, you 12402 10:18:08,760 --> 10:18:12,240 can pass them into a neural network, which is really just a deep neural 12403 10:18:12,240 --> 10:18:14,680 network that looks like this, where you might imagine 12404 10:18:14,680 --> 10:18:18,360 making a binary classification or classifying into multiple categories 12405 10:18:18,360 --> 10:18:23,400 or performing various different tasks on this sort of model. 12406 10:18:23,400 --> 10:18:26,600 So convolutional neural networks can be quite powerful and quite popular 12407 10:18:26,600 --> 10:18:28,800 when it comes towards trying to analyze images. 12408 10:18:28,800 --> 10:18:29,920 We don't strictly need them. 12409 10:18:29,920 --> 10:18:32,320 We could have just used a vanilla neural network 12410 10:18:32,320 --> 10:18:35,640 that just operates with layer after layer, as we've seen before. 12411 10:18:35,640 --> 10:18:38,240 But these convolutional neural networks can be quite helpful, 12412 10:18:38,240 --> 10:18:40,400 in particular, because of the way they model 12413 10:18:40,400 --> 10:18:43,280 the way a human might look at an image, that instead of a human looking 12414 10:18:43,280 --> 10:18:46,440 at every single pixel simultaneously and trying to convolve all of them 12415 10:18:46,440 --> 10:18:48,560 by multiplying them together, you might imagine 12416 10:18:48,560 --> 10:18:50,920 that what convolution is really doing is looking 12417 10:18:50,920 --> 10:18:53,120 at various different regions of the image 12418 10:18:53,120 --> 10:18:56,040 and extracting relevant information and features out 12419 10:18:56,040 --> 10:18:57,600 of those parts of the image, the same way 12420 10:18:57,600 --> 10:18:59,920 that a human might have visual receptors that 12421 10:18:59,920 --> 10:19:02,240 are looking at particular parts of what they see 12422 10:19:02,240 --> 10:19:04,720 and using those combining them to figure out 12423 10:19:04,720 --> 10:19:09,320 what meaning they can draw from all of those various different inputs. 12424 10:19:09,320 --> 10:19:11,840 And so you might imagine applying this to a situation 12425 10:19:11,840 --> 10:19:13,760 like handwriting recognition. 12426 10:19:13,760 --> 10:19:16,200 So we'll go ahead and see an example of that now, 12427 10:19:16,200 --> 10:19:19,160 where I'll go ahead and open up handwriting.py. 12428 10:19:19,160 --> 10:19:23,040 Again, what we do here is we first import TensorFlow. 12429 10:19:23,040 --> 10:19:26,680 And then TensorFlow, it turns out, has a few data sets 12430 10:19:26,680 --> 10:19:30,360 that are built into the library that you can just immediately access. 12431 10:19:30,360 --> 10:19:33,160 And one of the most famous data sets in machine learning 12432 10:19:33,160 --> 10:19:35,360 is the MNIST data set, which is just a data 12433 10:19:35,360 --> 10:19:38,560 set of a whole bunch of samples of people's handwritten digits. 12434 10:19:38,560 --> 10:19:41,200 I showed you a slide of that a little while ago. 12435 10:19:41,200 --> 10:19:43,720 And what we can do is just immediately access 12436 10:19:43,720 --> 10:19:45,880 that data set which is built into the library 12437 10:19:45,880 --> 10:19:47,760 so that if I want to do something like train 12438 10:19:47,760 --> 10:19:50,960 on a whole bunch of handwritten digits, I can just use the data set 12439 10:19:50,960 --> 10:19:52,040 that is provided to me. 12440 10:19:52,040 --> 10:19:55,400 Of course, if I had my own data set of handwritten images, 12441 10:19:55,400 --> 10:19:56,920 I can apply the same idea. 12442 10:19:56,920 --> 10:19:59,700 I'd first just need to take those images and turn them 12443 10:19:59,700 --> 10:20:02,640 into an array of pixels, because that's the way that these 12444 10:20:02,640 --> 10:20:03,380 are going to be formatted. 12445 10:20:03,380 --> 10:20:05,240 They're going to be formatted as, effectively, 12446 10:20:05,240 --> 10:20:08,320 an array of individual pixels. 12447 10:20:08,320 --> 10:20:10,560 Now there's a bit of reshaping I need to do, 12448 10:20:10,560 --> 10:20:12,520 just turning the data into a format that I 12449 10:20:12,520 --> 10:20:14,480 can put into my convolutional neural network. 12450 10:20:14,480 --> 10:20:17,400 So this is doing things like taking all the values 12451 10:20:17,400 --> 10:20:19,200 and dividing them by 255. 12452 10:20:19,200 --> 10:20:22,840 If you remember, these color values tend to range from 0 to 255. 12453 10:20:22,840 --> 10:20:25,200 So I can divide them by 255 just to put them 12454 10:20:25,200 --> 10:20:29,560 into 0 to 1 range, which might be a little bit easier to train on. 12455 10:20:29,560 --> 10:20:32,200 And then doing various other modifications to the data 12456 10:20:32,200 --> 10:20:34,560 just to get it into a nice usable format. 12457 10:20:34,560 --> 10:20:37,000 But here's the interesting and important part. 12458 10:20:37,000 --> 10:20:41,200 Here is where I create the convolutional neural network, the CNN, 12459 10:20:41,200 --> 10:20:44,160 where here I'm saying, go ahead and use a sequential model. 12460 10:20:44,160 --> 10:20:47,880 And before I could use model.add to say add a layer, add a layer, add a layer, 12461 10:20:47,880 --> 10:20:50,840 another way I could define it is just by passing as input 12462 10:20:50,840 --> 10:20:55,920 to this sequential neural network a list of all of the layers that I want. 12463 10:20:55,920 --> 10:21:00,120 And so here, the very first layer in my model is a convolution layer, 12464 10:21:00,120 --> 10:21:03,360 where I'm first going to apply convolution to my image. 12465 10:21:03,360 --> 10:21:05,640 I'm going to use 13 different filters. 12466 10:21:05,640 --> 10:21:09,680 So my model is going to learn 32, rather, 32 different filters 12467 10:21:09,680 --> 10:21:13,360 that I would like to learn on the input image, where each filter is going 12468 10:21:13,360 --> 10:21:15,120 to be a 3 by 3 kernel. 12469 10:21:15,120 --> 10:21:17,400 So we saw those 3 by 3 kernels before, where 12470 10:21:17,400 --> 10:21:20,560 we could multiply each value in a 3 by 3 grid by a value, 12471 10:21:20,560 --> 10:21:22,800 multiply it, and add all the results together. 12472 10:21:22,800 --> 10:21:27,600 So here, I'm going to learn 32 different of these 3 by 3 filters. 12473 10:21:27,600 --> 10:21:29,920 I can, again, specify my activation function. 12474 10:21:29,920 --> 10:21:32,560 And I specify what my input shape is. 12475 10:21:32,560 --> 10:21:34,880 My input shape in the banknotes case was just 4. 12476 10:21:34,880 --> 10:21:36,400 I had 4 inputs. 12477 10:21:36,400 --> 10:21:40,280 My input shape here is going to be 28, 28, 1, 12478 10:21:40,280 --> 10:21:42,920 because for each of these handwritten digits, 12479 10:21:42,920 --> 10:21:46,320 it turns out that the MNIST data set organizes their data. 12480 10:21:46,320 --> 10:21:49,080 Each image is a 28 by 28 pixel grid. 12481 10:21:49,080 --> 10:21:51,360 So we're going to have a 28 by 28 pixel grid. 12482 10:21:51,360 --> 10:21:54,640 And each one of those images only has one channel value. 12483 10:21:54,640 --> 10:21:56,720 These handwritten digits are just black and white. 12484 10:21:56,720 --> 10:21:59,220 So there's just a single color value representing 12485 10:21:59,220 --> 10:22:00,800 how much black or how much white. 12486 10:22:00,800 --> 10:22:02,800 You might imagine that in a color image, if you 12487 10:22:02,800 --> 10:22:05,560 were doing this sort of thing, you might have three different channels, 12488 10:22:05,560 --> 10:22:07,840 a red, a green, and a blue channel, for example. 12489 10:22:07,840 --> 10:22:09,960 But in the case of just handwriting recognition, 12490 10:22:09,960 --> 10:22:12,960 recognizing a digit, we're just going to use a single value for, 12491 10:22:12,960 --> 10:22:14,880 like, shaded in or not shaded in. 12492 10:22:14,880 --> 10:22:18,440 And it might range, but it's just a single color value. 12493 10:22:18,440 --> 10:22:22,040 And that, then, is the very first layer of our neural network, 12494 10:22:22,040 --> 10:22:24,560 a convolutional layer that will take the input 12495 10:22:24,560 --> 10:22:26,400 and learn a whole bunch of different filters 12496 10:22:26,400 --> 10:22:30,920 that we can apply to the input to extract meaningful features. 12497 10:22:30,920 --> 10:22:34,360 Next step is going to be a max pooling layer, also built right 12498 10:22:34,360 --> 10:22:37,640 into TensorFlow, where this is going to be a layer that 12499 10:22:37,640 --> 10:22:40,400 is going to use a pool size of 2 by 2, meaning 12500 10:22:40,400 --> 10:22:43,080 we're going to look at 2 by 2 regions inside of the image 12501 10:22:43,080 --> 10:22:45,160 and just extract the maximum value. 12502 10:22:45,160 --> 10:22:47,320 Again, we've seen why this can be helpful. 12503 10:22:47,320 --> 10:22:49,920 It'll help to reduce the size of our input. 12504 10:22:49,920 --> 10:22:53,120 And once we've done that, we'll go ahead and flatten all of the units 12505 10:22:53,120 --> 10:22:55,480 just into a single layer that we can then 12506 10:22:55,480 --> 10:22:57,560 pass into the rest of the neural network. 12507 10:22:57,560 --> 10:23:00,200 And now, here's the rest of the neural network. 12508 10:23:00,200 --> 10:23:02,880 Here, I'm saying, let's add a hidden layer to my neural network 12509 10:23:02,880 --> 10:23:06,160 with 128 units, so a whole bunch of hidden units 12510 10:23:06,160 --> 10:23:07,840 inside of the hidden layer. 12511 10:23:07,840 --> 10:23:11,400 And just to prevent overfitting, I can add a dropout to that. 12512 10:23:11,400 --> 10:23:14,200 Say, you know what, when you're training, randomly dropout half 12513 10:23:14,200 --> 10:23:16,520 of the nodes from this hidden layer just to make sure 12514 10:23:16,520 --> 10:23:19,440 we don't become over-reliant on any particular node, 12515 10:23:19,440 --> 10:23:22,820 we begin to really generalize and stop ourselves from overfitting. 12516 10:23:22,820 --> 10:23:25,640 So TensorFlow allows us, just by adding a single line, 12517 10:23:25,640 --> 10:23:28,920 to add dropout into our model as well, such that when it's training, 12518 10:23:28,920 --> 10:23:31,360 it will perform this dropout step in order 12519 10:23:31,360 --> 10:23:36,000 to help make sure that we don't overfit on this particular data. 12520 10:23:36,000 --> 10:23:38,760 And then finally, I add an output layer. 12521 10:23:38,760 --> 10:23:42,840 The output layer is going to have 10 units, one for each category 12522 10:23:42,840 --> 10:23:45,640 that I would like to classify digits into, so 0 through 9, 12523 10:23:45,640 --> 10:23:47,560 10 different categories. 12524 10:23:47,560 --> 10:23:49,960 And the activation function I'm going to use here 12525 10:23:49,960 --> 10:23:52,880 is called the softmax activation function. 12526 10:23:52,880 --> 10:23:55,760 And in short, what the softmax activation function is going to do 12527 10:23:55,760 --> 10:23:57,760 is it's going to take the output and turn it 12528 10:23:57,760 --> 10:23:59,600 into a probability distribution. 12529 10:23:59,600 --> 10:24:01,600 So ultimately, it's going to tell me, what 12530 10:24:01,600 --> 10:24:03,620 did we estimate the probability is that this 12531 10:24:03,620 --> 10:24:06,180 is a 2 versus a 3 versus a 4. 12532 10:24:06,180 --> 10:24:10,320 And so it will turn it into that probability distribution for me. 12533 10:24:10,320 --> 10:24:12,680 Next up, I'll go ahead and compile my model 12534 10:24:12,680 --> 10:24:15,680 and fit it on all of my training data. 12535 10:24:15,680 --> 10:24:19,760 And then I can evaluate how well the neural network performs. 12536 10:24:19,760 --> 10:24:21,800 And then I've added to my Python program, 12537 10:24:21,800 --> 10:24:24,560 if I've provided a command line argument like the name of a file, 12538 10:24:24,560 --> 10:24:27,440 I'm going to go ahead and save the model to a file. 12539 10:24:27,440 --> 10:24:29,040 And so this can be quite useful too. 12540 10:24:29,040 --> 10:24:31,960 Once you've done the training step, which could take some time in terms 12541 10:24:31,960 --> 10:24:34,400 of taking all the time, going through the data, 12542 10:24:34,400 --> 10:24:38,240 running back propagation with gradient descent to be able to say, all right, 12543 10:24:38,240 --> 10:24:40,720 how should we adjust the weight to this particular model? 12544 10:24:40,720 --> 10:24:42,840 You end up calculating values for these weights, 12545 10:24:42,840 --> 10:24:44,880 calculating values for these filters. 12546 10:24:44,880 --> 10:24:47,720 You'd like to remember that information so you can use it later. 12547 10:24:47,720 --> 10:24:51,480 And so TensorFlow allows us to just save a model to a file, 12548 10:24:51,480 --> 10:24:53,880 such that later, if we want to use the model we've learned, 12549 10:24:53,880 --> 10:24:57,280 use the weights that we've learned to make some sort of new prediction, 12550 10:24:57,280 --> 10:25:00,800 we can just use the model that already exists. 12551 10:25:00,800 --> 10:25:03,800 So what we're doing here is after we've done all the calculation, 12552 10:25:03,800 --> 10:25:07,320 we go ahead and save the model to a file, such 12553 10:25:07,320 --> 10:25:09,480 that we can use it a little bit later. 12554 10:25:09,480 --> 10:25:17,240 So for example, if I go into digits, I'm going to run handwriting.py. 12555 10:25:17,240 --> 10:25:18,200 I won't save it this time. 12556 10:25:18,200 --> 10:25:20,440 We'll just run it and go ahead and see what happens. 12557 10:25:20,440 --> 10:25:22,880 What will happen is we need to go through the model in order 12558 10:25:22,880 --> 10:25:26,120 to train on all of these samples of handwritten digits. 12559 10:25:26,120 --> 10:25:28,760 The MNIST data set gives us thousands and thousands 12560 10:25:28,760 --> 10:25:31,320 of sample handwritten digits in the same format 12561 10:25:31,320 --> 10:25:33,080 that we can use in order to train. 12562 10:25:33,080 --> 10:25:35,640 And so now what you're seeing is this training process. 12563 10:25:35,640 --> 10:25:39,320 And unlike the banknotes case, where there was much fewer data points, 12564 10:25:39,320 --> 10:25:42,280 the data was very, very simple, here this data is more complex 12565 10:25:42,280 --> 10:25:44,280 and this training process takes time. 12566 10:25:44,280 --> 10:25:48,920 And so this is another one of those cases where when training neural networks, 12567 10:25:48,920 --> 10:25:52,280 this is why computational power is so important that oftentimes you 12568 10:25:52,280 --> 10:25:55,440 see people wanting to use sophisticated GPUs in order 12569 10:25:55,440 --> 10:25:59,320 to more efficiently be able to do this sort of neural network training. 12570 10:25:59,320 --> 10:26:02,120 It also speaks to the reason why more data can be helpful. 12571 10:26:02,120 --> 10:26:04,560 The more sample data points you have, the better 12572 10:26:04,560 --> 10:26:06,280 you can begin to do this training. 12573 10:26:06,280 --> 10:26:10,680 So here we're going through 60,000 different samples of handwritten digits. 12574 10:26:10,680 --> 10:26:13,120 And I said we're going to go through them 10 times. 12575 10:26:13,120 --> 10:26:16,040 We're going to go through the data set 10 times, training each time, 12576 10:26:16,040 --> 10:26:18,640 hopefully improving upon our weights with every time 12577 10:26:18,640 --> 10:26:20,080 we run through this data set. 12578 10:26:20,080 --> 10:26:23,480 And we can see over here on the right what the accuracy is each time 12579 10:26:23,480 --> 10:26:26,200 we go ahead and run this model, that the first time it 12580 10:26:26,200 --> 10:26:29,600 looks like we got an accuracy of about 92% of the digits 12581 10:26:29,600 --> 10:26:31,600 correct based on this training set. 12582 10:26:31,600 --> 10:26:34,840 We increased that to 96% or 97%. 12583 10:26:34,840 --> 10:26:38,400 And every time we run this, we're going to see hopefully the accuracy 12584 10:26:38,400 --> 10:26:41,520 improve as we continue to try and use that gradient descent, 12585 10:26:41,520 --> 10:26:43,720 that process of trying to run the algorithm, 12586 10:26:43,720 --> 10:26:46,960 to minimize the loss that we get in order to more accurately 12587 10:26:46,960 --> 10:26:49,120 predict what the output should be. 12588 10:26:49,120 --> 10:26:52,360 And what this process is doing is it's learning not only the weights, 12589 10:26:52,360 --> 10:26:55,320 but it's learning the features to use, the kernel matrix 12590 10:26:55,320 --> 10:26:57,560 to use when performing that convolution step. 12591 10:26:57,560 --> 10:26:59,800 Because this is a convolutional neural network, 12592 10:26:59,800 --> 10:27:02,080 where I'm first performing those convolutions 12593 10:27:02,080 --> 10:27:05,400 and then doing the more traditional neural network structure, 12594 10:27:05,400 --> 10:27:09,280 this is going to learn all of those individual steps as well. 12595 10:27:09,280 --> 10:27:12,720 And so here we see the TensorFlow provides me with some very nice output, 12596 10:27:12,720 --> 10:27:15,560 telling me about how many seconds are left with each of these training 12597 10:27:15,560 --> 10:27:18,880 runs that allows me to see just how well we're doing. 12598 10:27:18,880 --> 10:27:21,240 So we'll go ahead and see how this network performs. 12599 10:27:21,240 --> 10:27:23,760 It looks like we've gone through the data set seven times. 12600 10:27:23,760 --> 10:27:26,560 We're going through it an eighth time now. 12601 10:27:26,560 --> 10:27:28,560 And at this point, the accuracy is pretty high. 12602 10:27:28,560 --> 10:27:32,200 We saw we went from 92% up to 97%. 12603 10:27:32,200 --> 10:27:33,760 Now it looks like 98%. 12604 10:27:33,760 --> 10:27:36,440 And at this point, it seems like things are starting to level out. 12605 10:27:36,440 --> 10:27:39,360 It's probably a limit to how accurate we can ultimately be 12606 10:27:39,360 --> 10:27:41,280 without running the risk of overfitting. 12607 10:27:41,280 --> 10:27:42,600 Of course, with enough nodes, you would just 12608 10:27:42,600 --> 10:27:44,880 memorize the input and overfit upon them. 12609 10:27:44,880 --> 10:27:46,160 But we'd like to avoid doing that. 12610 10:27:46,160 --> 10:27:48,560 And Dropout will help us with this. 12611 10:27:48,560 --> 10:27:53,920 But now we see we're almost done finishing our training step. 12612 10:27:53,920 --> 10:27:55,520 We're at 55,000. 12613 10:27:55,520 --> 10:27:56,920 All right, we finished training. 12614 10:27:56,920 --> 10:28:00,200 And now it's going to go ahead and test for us on 10,000 samples. 12615 10:28:00,200 --> 10:28:04,880 And it looks like on the testing set, we were at 98.8% accurate. 12616 10:28:04,880 --> 10:28:06,880 So we ended up doing pretty well, it seems, 12617 10:28:06,880 --> 10:28:10,280 on this testing set to see how accurately can we 12618 10:28:10,280 --> 10:28:13,320 predict these handwritten digits. 12619 10:28:13,320 --> 10:28:15,840 And so what we could do then is actually test it out. 12620 10:28:15,840 --> 10:28:19,720 I've written a program called Recognition.py using PyGame. 12621 10:28:19,720 --> 10:28:21,560 If you pass it a model that's been trained, 12622 10:28:21,560 --> 10:28:26,120 and I pre-trained an example model using this input data, what we can do 12623 10:28:26,120 --> 10:28:27,960 is see whether or not we've been able to train 12624 10:28:27,960 --> 10:28:31,720 this convolutional neural network to be able to predict handwriting, 12625 10:28:31,720 --> 10:28:32,360 for example. 12626 10:28:32,360 --> 10:28:35,320 So I can try, just like drawing a handwritten digit. 12627 10:28:35,320 --> 10:28:39,400 I'll go ahead and draw the number 2, for example. 12628 10:28:39,400 --> 10:28:40,560 So there's my number 2. 12629 10:28:40,560 --> 10:28:41,440 Again, this is messy. 12630 10:28:41,440 --> 10:28:44,320 If you tried to imagine, how would you write a program with just ifs 12631 10:28:44,320 --> 10:28:46,640 and thens to be able to do this sort of calculation, 12632 10:28:46,640 --> 10:28:48,120 it would be tricky to do so. 12633 10:28:48,120 --> 10:28:50,080 But here I'll press Classify, and all right, 12634 10:28:50,080 --> 10:28:53,600 it seems I was able to correctly classify that what I drew was the number 2. 12635 10:28:53,600 --> 10:28:55,320 I'll go ahead and reset it, try it again. 12636 10:28:55,320 --> 10:28:57,880 We'll draw an 8, for example. 12637 10:28:57,880 --> 10:29:00,480 So here is an 8. 12638 10:29:00,480 --> 10:29:01,640 Press Classify. 12639 10:29:01,640 --> 10:29:05,080 And all right, it predicts that the digit that I drew was an 8. 12640 10:29:05,080 --> 10:29:08,080 And the key here is this really begins to show the power of what 12641 10:29:08,080 --> 10:29:09,920 the neural network is doing, somehow looking 12642 10:29:09,920 --> 10:29:12,440 at various different features of these different pixels, 12643 10:29:12,440 --> 10:29:14,840 figuring out what the relevant features are, 12644 10:29:14,840 --> 10:29:17,600 and figuring out how to combine them to get a classification. 12645 10:29:17,600 --> 10:29:21,600 And this would be a difficult task to provide explicit instructions 12646 10:29:21,600 --> 10:29:24,840 to the computer on how to do, to use a whole bunch of ifs ands 12647 10:29:24,840 --> 10:29:27,480 to process all these pixel values to figure out 12648 10:29:27,480 --> 10:29:28,920 what the handwritten digit is. 12649 10:29:28,920 --> 10:29:31,360 Everyone's going to draw their 8s a little bit differently. 12650 10:29:31,360 --> 10:29:33,920 If I drew the 8 again, it would look a little bit different. 12651 10:29:33,920 --> 10:29:37,800 And yet, ideally, we want to train a network to be robust enough 12652 10:29:37,800 --> 10:29:40,600 so that it begins to learn these patterns on its own. 12653 10:29:40,600 --> 10:29:43,200 All I said was, here is the structure of the network, 12654 10:29:43,200 --> 10:29:45,880 and here is the data on which to train the network. 12655 10:29:45,880 --> 10:29:47,880 And the network learning algorithm just tries 12656 10:29:47,880 --> 10:29:50,320 to figure out what is the optimal set of weights, what 12657 10:29:50,320 --> 10:29:52,800 is the optimal set of filters to use them in order 12658 10:29:52,800 --> 10:29:57,280 to be able to accurately classify a digit into one category or another. 12659 10:29:57,280 --> 10:30:00,680 Just going to show the power of these sorts of convolutional neural 12660 10:30:00,680 --> 10:30:02,280 networks. 12661 10:30:02,280 --> 10:30:06,560 And so that then was a look at how we can use convolutional neural networks 12662 10:30:06,560 --> 10:30:10,640 to begin to solve problems with regards to computer vision, 12663 10:30:10,640 --> 10:30:13,600 the ability to take an image and begin to analyze it. 12664 10:30:13,600 --> 10:30:15,920 So this is the type of analysis you might imagine 12665 10:30:15,920 --> 10:30:18,000 that's happening in self-driving cars that 12666 10:30:18,000 --> 10:30:21,000 are able to figure out what filters to apply to an image 12667 10:30:21,000 --> 10:30:24,040 to understand what it is that the computer is looking at, 12668 10:30:24,040 --> 10:30:26,160 or the same type of idea that might be applied 12669 10:30:26,160 --> 10:30:28,240 to facial recognition and social media to be 12670 10:30:28,240 --> 10:30:31,840 able to determine how to recognize faces in an image as well. 12671 10:30:31,840 --> 10:30:34,440 You can imagine a neural network that instead of classifying 12672 10:30:34,440 --> 10:30:38,280 into one of 10 different digits could instead classify like, 12673 10:30:38,280 --> 10:30:40,880 is this person A or is this person B, trying 12674 10:30:40,880 --> 10:30:45,000 to tell those people apart just based on convolution. 12675 10:30:45,000 --> 10:30:48,160 And so now what we'll take a look at is yet another type of neural network 12676 10:30:48,160 --> 10:30:50,520 that can be quite popular for certain types of tasks. 12677 10:30:50,520 --> 10:30:54,400 But to do so, we'll try to generalize and think about our neural network 12678 10:30:54,400 --> 10:30:55,760 a little bit more abstractly. 12679 10:30:55,760 --> 10:30:58,200 That here we have a sample deep neural network 12680 10:30:58,200 --> 10:31:01,400 where we have this input layer, a whole bunch of different hidden layers 12681 10:31:01,400 --> 10:31:04,080 that are performing certain types of calculations, 12682 10:31:04,080 --> 10:31:07,360 and then an output layer here that just generates some sort of output 12683 10:31:07,360 --> 10:31:09,600 that we care about calculating. 12684 10:31:09,600 --> 10:31:14,000 But we could imagine representing this a little more simply like this. 12685 10:31:14,000 --> 10:31:17,360 Here is just a more abstract representation of our neural network. 12686 10:31:17,360 --> 10:31:20,040 We have some input that might be like a vector 12687 10:31:20,040 --> 10:31:22,360 of a whole bunch of different values as our input. 12688 10:31:22,360 --> 10:31:25,640 That gets passed into a network that performs some sort of calculation 12689 10:31:25,640 --> 10:31:29,600 or computation, and that network produces some sort of output. 12690 10:31:29,600 --> 10:31:31,280 That output might be a single value. 12691 10:31:31,280 --> 10:31:33,120 It might be a whole bunch of different values. 12692 10:31:33,120 --> 10:31:36,040 But this is the general structure of the neural network that we've seen. 12693 10:31:36,040 --> 10:31:39,520 There is some sort of input that gets fed into the network. 12694 10:31:39,520 --> 10:31:43,440 And using that input, the network calculates what the output should be. 12695 10:31:43,440 --> 10:31:46,000 And this sort of model for a neural network 12696 10:31:46,000 --> 10:31:49,040 is what we might call a feed-forward neural network. 12697 10:31:49,040 --> 10:31:52,920 Feed-forward neural networks have connections only in one direction. 12698 10:31:52,920 --> 10:31:56,480 They move from one layer to the next layer to the layer after that, 12699 10:31:56,480 --> 10:31:59,800 such that the inputs pass through various different hidden layers 12700 10:31:59,800 --> 10:32:02,840 and then ultimately produce some sort of output. 12701 10:32:02,840 --> 10:32:05,760 So feed-forward neural networks were very helpful 12702 10:32:05,760 --> 10:32:08,640 for solving these types of classification problems that we saw before. 12703 10:32:08,640 --> 10:32:10,040 We have a whole bunch of input. 12704 10:32:10,040 --> 10:32:12,120 We want to learn what setting of weights will allow us 12705 10:32:12,120 --> 10:32:14,040 to calculate the output effectively. 12706 10:32:14,040 --> 10:32:16,560 But there are some limitations on feed-forward neural networks 12707 10:32:16,560 --> 10:32:17,680 that we'll see in a moment. 12708 10:32:17,680 --> 10:32:20,640 In particular, the input needs to be of a fixed shape, 12709 10:32:20,640 --> 10:32:23,200 like a fixed number of neurons are in the input layer. 12710 10:32:23,200 --> 10:32:24,920 And there's a fixed shape for the output, 12711 10:32:24,920 --> 10:32:28,040 like a fixed number of neurons in the output layer. 12712 10:32:28,040 --> 10:32:30,640 And that has some limitations of its own. 12713 10:32:30,640 --> 10:32:33,360 And a possible solution to this, and we'll 12714 10:32:33,360 --> 10:32:36,440 see examples of the types of problems we can solve for this in just a second, 12715 10:32:36,440 --> 10:32:38,480 is instead of just a feed-forward neural network, 12716 10:32:38,480 --> 10:32:41,840 where there are only connections in one direction from left to right 12717 10:32:41,840 --> 10:32:46,000 effectively across the network, we could also imagine a recurrent neural 12718 10:32:46,000 --> 10:32:48,720 network, where a recurrent neural network generates 12719 10:32:48,720 --> 10:32:54,840 output that gets fed back into itself as input for future runs of that network. 12720 10:32:54,840 --> 10:32:57,080 So whereas in a traditional neural network, 12721 10:32:57,080 --> 10:33:00,920 we have inputs that get fed into the network, that get fed into the output. 12722 10:33:00,920 --> 10:33:02,840 And the only thing that determines the output 12723 10:33:02,840 --> 10:33:05,560 is based on the original input and based on the calculation 12724 10:33:05,560 --> 10:33:08,040 we do inside of the network itself. 12725 10:33:08,040 --> 10:33:11,040 This goes in contrast with a recurrent neural network, 12726 10:33:11,040 --> 10:33:14,680 where in a recurrent neural network, you can imagine output from the network 12727 10:33:14,680 --> 10:33:18,320 feeding back to itself into the network again as input 12728 10:33:18,320 --> 10:33:22,360 for the next time you do the calculations inside of the network. 12729 10:33:22,360 --> 10:33:27,160 What this allows is it allows the network to maintain some sort of state, 12730 10:33:27,160 --> 10:33:33,080 to store some sort of information that can be used on future runs of the network. 12731 10:33:33,080 --> 10:33:35,440 Previously, the network just defined some weights, 12732 10:33:35,440 --> 10:33:38,280 and we passed inputs through the network, and it generated outputs. 12733 10:33:38,280 --> 10:33:42,000 But the network wasn't saving any information based on those inputs 12734 10:33:42,000 --> 10:33:45,400 to be able to remember for future iterations or for future runs. 12735 10:33:45,400 --> 10:33:47,320 What a recurrent neural network will let us do 12736 10:33:47,320 --> 10:33:51,040 is let the network store information that gets passed back in as input 12737 10:33:51,040 --> 10:33:55,560 to the network again the next time we try and perform some sort of action. 12738 10:33:55,560 --> 10:34:00,160 And this is particularly helpful when dealing with sequences of data. 12739 10:34:00,160 --> 10:34:02,880 So we'll see a real world example of this right now, actually. 12740 10:34:02,880 --> 10:34:07,160 Microsoft has developed an AI known as the caption bot. 12741 10:34:07,160 --> 10:34:09,440 And what the caption bot does is it says, 12742 10:34:09,440 --> 10:34:11,760 I can understand the content of any photograph, 12743 10:34:11,760 --> 10:34:13,960 and I'll try to describe it as well as any human. 12744 10:34:13,960 --> 10:34:16,280 I'll analyze your photo, but I won't store it or share it. 12745 10:34:16,280 --> 10:34:19,360 And so what Microsoft's caption bot seems to be claiming to do 12746 10:34:19,360 --> 10:34:22,880 is it can take an image and figure out what's in the image 12747 10:34:22,880 --> 10:34:25,600 and just give us a caption to describe it. 12748 10:34:25,600 --> 10:34:26,760 So let's try it out. 12749 10:34:26,760 --> 10:34:29,640 Here, for example, is an image of Harvard Square. 12750 10:34:29,640 --> 10:34:32,640 It's some people walking in front of one of the buildings at Harvard Square. 12751 10:34:32,640 --> 10:34:34,960 I'll go ahead and take the URL for that image, 12752 10:34:34,960 --> 10:34:39,000 and I'll paste it into caption bot and just press Go. 12753 10:34:39,000 --> 10:34:41,560 So caption bot is analyzing the image, and then it 12754 10:34:41,560 --> 10:34:44,760 says, I think it's a group of people walking 12755 10:34:44,760 --> 10:34:46,800 in front of a building, which seems amazing. 12756 10:34:46,800 --> 10:34:50,720 The AI is able to look at this image and figure out what's in the image. 12757 10:34:50,720 --> 10:34:52,680 And the important thing to recognize here 12758 10:34:52,680 --> 10:34:55,160 is that this is no longer just a classification task. 12759 10:34:55,160 --> 10:34:58,600 We saw being able to classify images with a convolutional neural network 12760 10:34:58,600 --> 10:35:01,800 where the job was take the image and then figure out, 12761 10:35:01,800 --> 10:35:05,920 is it a 0 or a 1 or a 2, or is it this person's face or that person's face? 12762 10:35:05,920 --> 10:35:09,320 What seems to be happening here is the input is an image, 12763 10:35:09,320 --> 10:35:12,440 and we know how to get networks to take input of images, 12764 10:35:12,440 --> 10:35:14,520 but the output is text. 12765 10:35:14,520 --> 10:35:15,240 It's a sentence. 12766 10:35:15,240 --> 10:35:19,640 It's a phrase, like a group of people walking in front of a building. 12767 10:35:19,640 --> 10:35:23,320 And this would seem to pose a challenge for our more traditional feed-forward 12768 10:35:23,320 --> 10:35:28,360 neural networks, for the reason being that in traditional neural networks, 12769 10:35:28,360 --> 10:35:31,840 we just have a fixed-size input and a fixed-size output. 12770 10:35:31,840 --> 10:35:35,160 There are a certain number of neurons in the input to our neural network 12771 10:35:35,160 --> 10:35:37,720 and a certain number of outputs for our neural network, 12772 10:35:37,720 --> 10:35:39,920 and then some calculation that goes on in between. 12773 10:35:39,920 --> 10:35:42,560 But the size of the inputs and the number of values in the input 12774 10:35:42,560 --> 10:35:44,440 and the number of values in the output, those 12775 10:35:44,440 --> 10:35:49,120 are always going to be fixed based on the structure of the neural network. 12776 10:35:49,120 --> 10:35:52,200 And that makes it difficult to imagine how a neural network could take an image 12777 10:35:52,200 --> 10:35:56,080 like this and say it's a group of people walking in front of the building 12778 10:35:56,080 --> 10:36:00,760 because the output is text, like it's a sequence of words. 12779 10:36:00,760 --> 10:36:02,800 Now, it might be possible for a neural network 12780 10:36:02,800 --> 10:36:06,880 to output one word, one word you could represent as a vector of values, 12781 10:36:06,880 --> 10:36:08,600 and you can imagine ways of doing that. 12782 10:36:08,600 --> 10:36:10,520 Next time, we'll talk a little bit more about AI 12783 10:36:10,520 --> 10:36:13,160 as it relates to language and language processing. 12784 10:36:13,160 --> 10:36:15,440 But a sequence of words is much more challenging 12785 10:36:15,440 --> 10:36:18,320 because depending on the image, you might imagine the output 12786 10:36:18,320 --> 10:36:19,800 is a different number of words. 12787 10:36:19,800 --> 10:36:22,400 We could have sequences of different lengths, 12788 10:36:22,400 --> 10:36:26,640 and somehow we still want to be able to generate the appropriate output. 12789 10:36:26,640 --> 10:36:30,560 And so the strategy here is to use a recurrent neural network, 12790 10:36:30,560 --> 10:36:34,080 a neural network that can feed its own output back into itself 12791 10:36:34,080 --> 10:36:36,200 as input for the next time. 12792 10:36:36,200 --> 10:36:40,400 And this allows us to do what we call a one-to-many relationship 12793 10:36:40,400 --> 10:36:43,960 for inputs to outputs, that in vanilla, more traditional neural networks, 12794 10:36:43,960 --> 10:36:47,080 these are what we might consider to be one-to-one neural networks. 12795 10:36:47,080 --> 10:36:49,800 You pass in one set of values as input. 12796 10:36:49,800 --> 10:36:53,240 You get one vector of values as the output. 12797 10:36:53,240 --> 10:36:56,960 But in this case, we want to pass in one value as input, the image, 12798 10:36:56,960 --> 10:36:59,560 and we want to get a sequence, many values as output, 12799 10:36:59,560 --> 10:37:02,400 where each value is like one of these words that 12800 10:37:02,400 --> 10:37:05,640 gets produced by this particular algorithm. 12801 10:37:05,640 --> 10:37:08,200 And so the way we might do this is we might imagine starting 12802 10:37:08,200 --> 10:37:11,400 by providing input, the image, into our neural network. 12803 10:37:11,400 --> 10:37:13,560 And the neural network is going to generate output, 12804 10:37:13,560 --> 10:37:16,000 but the output is not going to be the whole sequence of words, 12805 10:37:16,000 --> 10:37:18,320 because we can't represent the whole sequence of words 12806 10:37:18,320 --> 10:37:20,920 using just a fixed set of neurons. 12807 10:37:20,920 --> 10:37:24,360 Instead, the output is just going to be the first word. 12808 10:37:24,360 --> 10:37:27,800 We're going to train the network to output what the first word of the caption 12809 10:37:27,800 --> 10:37:28,080 should be. 12810 10:37:28,080 --> 10:37:30,320 And you could imagine that Microsoft has trained this 12811 10:37:30,320 --> 10:37:33,320 by running a whole bunch of training samples through the AI, 12812 10:37:33,320 --> 10:37:36,680 giving it a whole bunch of pictures and what the appropriate caption was, 12813 10:37:36,680 --> 10:37:39,800 and having the AI begin to learn from that. 12814 10:37:39,800 --> 10:37:42,080 But now, because the network generates output 12815 10:37:42,080 --> 10:37:44,280 that can be fed back into itself, you could 12816 10:37:44,280 --> 10:37:47,800 imagine the output of the network being fed back into the same network. 12817 10:37:47,800 --> 10:37:50,160 This here looks like a separate network, but it's really 12818 10:37:50,160 --> 10:37:53,440 the same network that's just getting different input, 12819 10:37:53,440 --> 10:37:57,640 that this network's output gets fed back into itself, 12820 10:37:57,640 --> 10:37:59,680 but it's going to generate another output. 12821 10:37:59,680 --> 10:38:04,160 And that other output is going to be the second word in the caption. 12822 10:38:04,160 --> 10:38:06,520 And this recurrent neural network then, this network 12823 10:38:06,520 --> 10:38:09,720 is going to generate other output that can be fed back into itself 12824 10:38:09,720 --> 10:38:12,200 to generate yet another word, fed back into itself 12825 10:38:12,200 --> 10:38:13,680 to generate another word. 12826 10:38:13,680 --> 10:38:18,200 And so recurrent neural networks allow us to represent this one-to-many 12827 10:38:18,200 --> 10:38:18,880 structure. 12828 10:38:18,880 --> 10:38:21,680 You provide one image as input, and the neural network 12829 10:38:21,680 --> 10:38:25,800 can pass data into the next run of the network, and then again and again, 12830 10:38:25,800 --> 10:38:28,240 such that you could run the network multiple times, 12831 10:38:28,240 --> 10:38:33,960 each time generating a different output still based on that original input. 12832 10:38:33,960 --> 10:38:37,320 And this is where recurrent neural networks become particularly useful 12833 10:38:37,320 --> 10:38:40,040 when dealing with sequences of inputs or outputs. 12834 10:38:40,040 --> 10:38:43,360 And my output is a sequence of words, and since I can't very easily 12835 10:38:43,360 --> 10:38:45,960 represent outputting an entire sequence of words, 12836 10:38:45,960 --> 10:38:49,160 I'll instead output that sequence one word at a time 12837 10:38:49,160 --> 10:38:52,680 by allowing my network to pass information about what still 12838 10:38:52,680 --> 10:38:56,840 needs to be said about the photo into the next stage of running the network. 12839 10:38:56,840 --> 10:38:59,480 So you could run the network multiple times, the same network 12840 10:38:59,480 --> 10:39:02,960 with the same weights, just getting different input each time. 12841 10:39:02,960 --> 10:39:06,440 First, getting input from the image, and then getting input from the network 12842 10:39:06,440 --> 10:39:09,880 itself as additional information about what additionally 12843 10:39:09,880 --> 10:39:13,920 needs to be given in a particular caption, for example. 12844 10:39:13,920 --> 10:39:17,400 So this then is a one-to-many relationship inside of a recurrent neural 12845 10:39:17,400 --> 10:39:20,440 network, but it turns out there are other models that we can use, 12846 10:39:20,440 --> 10:39:23,320 other ways we can try and use recurrent neural networks 12847 10:39:23,320 --> 10:39:26,760 to be able to represent data that might be stored in other forms as well. 12848 10:39:26,760 --> 10:39:29,880 We saw how we could use neural networks in order to analyze images 12849 10:39:29,880 --> 10:39:33,200 in the context of convolutional neural networks that take an image, 12850 10:39:33,200 --> 10:39:35,280 figure out various different properties of the image, 12851 10:39:35,280 --> 10:39:38,760 and are able to draw some sort of conclusion based on that. 12852 10:39:38,760 --> 10:39:40,960 But you might imagine that something like YouTube, 12853 10:39:40,960 --> 10:39:44,080 they need to be able to do a lot of learning based on video. 12854 10:39:44,080 --> 10:39:46,920 They need to look through videos to detect if they're like copyright 12855 10:39:46,920 --> 10:39:50,160 violations, or they need to be able to look through videos to maybe identify 12856 10:39:50,160 --> 10:39:53,680 what particular items are inside of the video, for example. 12857 10:39:53,680 --> 10:39:56,680 And video, you might imagine, is much more difficult to put in 12858 10:39:56,680 --> 10:40:00,200 as input to a neural network, because whereas an image, you could just 12859 10:40:00,200 --> 10:40:03,680 treat each pixel as a different value, videos are sequences. 12860 10:40:03,680 --> 10:40:07,760 They're sequences of images, and each sequence might be of different length. 12861 10:40:07,760 --> 10:40:10,720 And so it might be challenging to represent that entire video 12862 10:40:10,720 --> 10:40:15,320 as a single vector of values that you could pass in to a neural network. 12863 10:40:15,320 --> 10:40:17,600 And so here, too, recurrent neural networks 12864 10:40:17,600 --> 10:40:21,320 can be a valuable solution for trying to solve this type of problem. 12865 10:40:21,320 --> 10:40:25,320 Then instead of just passing in a single input into our neural network, 12866 10:40:25,320 --> 10:40:28,440 we could pass in the input one frame at a time, you might imagine. 12867 10:40:28,440 --> 10:40:32,720 First, taking the first frame of the video, passing it into the network, 12868 10:40:32,720 --> 10:40:35,520 and then maybe not having the network output anything at all yet. 12869 10:40:35,520 --> 10:40:40,120 Let it take in another input, and this time, pass it into the network. 12870 10:40:40,120 --> 10:40:43,000 But the network gets information from the last time 12871 10:40:43,000 --> 10:40:45,000 we provided an input into the network. 12872 10:40:45,000 --> 10:40:47,480 Then we pass in a third input, and then a fourth input, 12873 10:40:47,480 --> 10:40:51,200 where each time, what the network gets is it gets the most recent input, 12874 10:40:51,200 --> 10:40:53,600 like each frame of the video. 12875 10:40:53,600 --> 10:40:56,280 But it also gets information the network processed 12876 10:40:56,280 --> 10:40:58,080 from all of the previous iterations. 12877 10:40:58,080 --> 10:41:02,400 So on frame number four, you end up getting the input for frame number four 12878 10:41:02,400 --> 10:41:06,880 plus information the network has calculated from the first three frames. 12879 10:41:06,880 --> 10:41:10,000 And using all of that data combined, this recurrent neural network 12880 10:41:10,000 --> 10:41:14,160 can begin to learn how to extract patterns from a sequence of data 12881 10:41:14,160 --> 10:41:14,960 as well. 12882 10:41:14,960 --> 10:41:17,280 And so you might imagine, if you want to classify a video 12883 10:41:17,280 --> 10:41:20,040 into a number of different genres, like an educational video, 12884 10:41:20,040 --> 10:41:22,220 or a music video, or different types of videos, 12885 10:41:22,220 --> 10:41:24,400 that's a classification task, where you want 12886 10:41:24,400 --> 10:41:27,020 to take as input each of the frames of the video, 12887 10:41:27,020 --> 10:41:31,560 and you want to output something like what it is, what category 12888 10:41:31,560 --> 10:41:33,320 that it happens to belong to. 12889 10:41:33,320 --> 10:41:35,040 And you can imagine doing this sort of thing, 12890 10:41:35,040 --> 10:41:39,840 this sort of many-to-one learning, any time your input is a sequence. 12891 10:41:39,840 --> 10:41:43,240 And so input is a sequence in the context of video. 12892 10:41:43,240 --> 10:41:45,740 It could be in the context of, like, if someone has typed a message 12893 10:41:45,740 --> 10:41:47,840 and you want to be able to categorize that message, 12894 10:41:47,840 --> 10:41:51,560 like if you're trying to take a movie review and trying to classify it 12895 10:41:51,560 --> 10:41:54,080 as, is it a positive review or a negative review? 12896 10:41:54,080 --> 10:41:56,720 That input is a sequence of words, and the output 12897 10:41:56,720 --> 10:41:59,360 is a classification, positive or negative. 12898 10:41:59,360 --> 10:42:01,440 There, too, a recurrent neural network might 12899 10:42:01,440 --> 10:42:04,040 be helpful for analyzing sequences of words. 12900 10:42:04,040 --> 10:42:07,600 And they're quite popular when it comes to dealing with language. 12901 10:42:07,600 --> 10:42:09,880 Could even be used for spoken language as well, 12902 10:42:09,880 --> 10:42:12,480 that spoken language is an audio waveform that 12903 10:42:12,480 --> 10:42:14,800 can be segmented into distinct chunks. 12904 10:42:14,800 --> 10:42:17,440 And each of those could be passed in as an input 12905 10:42:17,440 --> 10:42:21,000 into a recurrent neural network to be able to classify someone's voice, 12906 10:42:21,000 --> 10:42:21,560 for instance. 12907 10:42:21,560 --> 10:42:24,880 If you want to do voice recognition to say, is this one person or is this 12908 10:42:24,880 --> 10:42:27,360 another, here are also cases where you might 12909 10:42:27,360 --> 10:42:32,240 want this many-to-one architecture for a recurrent neural network. 12910 10:42:32,240 --> 10:42:34,040 And then as one final problem, just to take 12911 10:42:34,040 --> 10:42:37,040 a look at in terms of what we can do with these sorts of networks, 12912 10:42:37,040 --> 10:42:39,080 imagine what Google Translate is doing. 12913 10:42:39,080 --> 10:42:42,560 So what Google Translate is doing is it's taking some text written 12914 10:42:42,560 --> 10:42:47,200 in one language and converting it into text written in some other language, 12915 10:42:47,200 --> 10:42:50,440 for example, where now this input is a sequence of data. 12916 10:42:50,440 --> 10:42:52,000 It's a sequence of words. 12917 10:42:52,000 --> 10:42:54,320 And the output is a sequence of words as well. 12918 10:42:54,320 --> 10:42:55,560 It's also a sequence. 12919 10:42:55,560 --> 10:42:58,560 So here we want effectively a many-to-many relationship. 12920 10:42:58,560 --> 10:43:02,560 Our input is a sequence and our output is a sequence as well. 12921 10:43:02,560 --> 10:43:05,000 And it's not quite going to work to just say, 12922 10:43:05,000 --> 10:43:09,840 take each word in the input and translate it into a word in the output. 12923 10:43:09,840 --> 10:43:13,040 Because ultimately, different languages put their words in different orders. 12924 10:43:13,040 --> 10:43:15,200 And maybe one language uses two words for something, 12925 10:43:15,200 --> 10:43:17,240 whereas another language only uses one. 12926 10:43:17,240 --> 10:43:22,240 So we really want some way to take this information, this input, 12927 10:43:22,240 --> 10:43:25,840 encode it somehow, and use that encoding to generate 12928 10:43:25,840 --> 10:43:27,440 what the output ultimately should be. 12929 10:43:27,440 --> 10:43:30,720 And this has been one of the big advancements in automated translation 12930 10:43:30,720 --> 10:43:34,080 technology, is the ability to use the neural networks to do this instead 12931 10:43:34,080 --> 10:43:35,800 of older, more traditional methods. 12932 10:43:35,800 --> 10:43:37,920 And this has improved accuracy dramatically. 12933 10:43:37,920 --> 10:43:40,240 And the way you might imagine doing this is, again, 12934 10:43:40,240 --> 10:43:44,200 using a recurrent neural network with multiple inputs and multiple outputs. 12935 10:43:44,200 --> 10:43:45,800 We start by passing in all the input. 12936 10:43:45,800 --> 10:43:47,320 Input goes into the network. 12937 10:43:47,320 --> 10:43:49,560 Another input, like another word, goes into the network. 12938 10:43:49,560 --> 10:43:53,280 And we do this multiple times, like once for each word in the input 12939 10:43:53,280 --> 10:43:54,680 that I'm trying to translate. 12940 10:43:54,680 --> 10:43:58,000 And only after all of that is done does the network now 12941 10:43:58,000 --> 10:44:01,200 start to generate output, like the first word of the translated sentence, 12942 10:44:01,200 --> 10:44:04,240 and the next word of the translated sentence, so on and so forth, 12943 10:44:04,240 --> 10:44:08,640 where each time the network passes information to itself 12944 10:44:08,640 --> 10:44:12,480 by allowing for this model of giving some sort of state 12945 10:44:12,480 --> 10:44:15,120 from one run in the network to the next run, 12946 10:44:15,120 --> 10:44:17,280 assembling information about all the inputs, 12947 10:44:17,280 --> 10:44:20,600 and then passing in information about which part of the output 12948 10:44:20,600 --> 10:44:22,440 in order to generate next. 12949 10:44:22,440 --> 10:44:25,640 And there are a number of different types of these sorts of recurrent neural 12950 10:44:25,640 --> 10:44:26,140 networks. 12951 10:44:26,140 --> 10:44:29,640 One of the most popular is known as the long short-term memory neural network, 12952 10:44:29,640 --> 10:44:31,400 otherwise known as LSTM. 12953 10:44:31,400 --> 10:44:35,160 But in general, these types of networks can be very, very powerful whenever 12954 10:44:35,160 --> 10:44:38,120 we're dealing with sequences, whether those are sequences of images 12955 10:44:38,120 --> 10:44:40,600 or especially sequences of words when it comes 12956 10:44:40,600 --> 10:44:43,600 towards dealing with natural language. 12957 10:44:43,600 --> 10:44:46,160 And so that then were just some of the different types 12958 10:44:46,160 --> 10:44:49,840 of neural networks that can be used to do all sorts of different computations. 12959 10:44:49,840 --> 10:44:52,080 And these are incredibly versatile tools that 12960 10:44:52,080 --> 10:44:54,200 can be applied to a number of different domains. 12961 10:44:54,200 --> 10:44:57,600 We only looked at a couple of the most popular types of neural networks 12962 10:44:57,600 --> 10:45:00,840 from more traditional feed-forward neural networks, convolutional neural 12963 10:45:00,840 --> 10:45:02,920 networks, and recurrent neural networks. 12964 10:45:02,920 --> 10:45:04,240 But there are other types as well. 12965 10:45:04,240 --> 10:45:07,120 There are adversarial networks where networks compete with each other 12966 10:45:07,120 --> 10:45:10,160 to try and be able to generate new types of data, 12967 10:45:10,160 --> 10:45:13,000 as well as other networks that can solve other tasks based 12968 10:45:13,000 --> 10:45:15,680 on what they happen to be structured and adapted for. 12969 10:45:15,680 --> 10:45:18,080 And these are very powerful tools in machine learning 12970 10:45:18,080 --> 10:45:21,880 from being able to very easily learn based on some set of input data 12971 10:45:21,880 --> 10:45:25,040 and to be able to, therefore, figure out how to calculate some function 12972 10:45:25,040 --> 10:45:28,720 from inputs to outputs, whether it's input to some sort of classification 12973 10:45:28,720 --> 10:45:32,040 like analyzing an image and getting a digit or machine translation 12974 10:45:32,040 --> 10:45:34,920 where the input is in one language and the output is in another. 12975 10:45:34,920 --> 10:45:39,320 These tools have a lot of applications for machine learning more generally. 12976 10:45:39,320 --> 10:45:42,400 Next time, we'll look at machine learning and AI in particular 12977 10:45:42,400 --> 10:45:44,120 in the context of natural language. 12978 10:45:44,120 --> 10:45:47,520 We talked a little bit about this today, but looking at how it is that our AI 12979 10:45:47,520 --> 10:45:50,000 can begin to understand natural language and can 12980 10:45:50,000 --> 10:45:53,360 begin to be able to analyze and do useful tasks with regards 12981 10:45:53,360 --> 10:45:57,040 to human language, which turns out to be a challenging and interesting task. 12982 10:45:57,040 --> 10:46:00,000 So we'll see you next time. 12983 10:46:00,000 --> 10:46:21,360 And welcome back, everybody, to our final class 12984 10:46:21,360 --> 10:46:24,320 in an introduction to artificial intelligence with Python. 12985 10:46:24,320 --> 10:46:26,720 Now, so far in this class, we've been taking problems 12986 10:46:26,720 --> 10:46:29,040 that we want to solve intelligently and framing them 12987 10:46:29,040 --> 10:46:31,720 in ways that computers are going to be able to make sense of. 12988 10:46:31,720 --> 10:46:34,840 We've been taking problems and framing them as search problems 12989 10:46:34,840 --> 10:46:38,920 or constraint satisfaction problems or optimization problems, for example. 12990 10:46:38,920 --> 10:46:40,840 In essence, we have been trying to communicate 12991 10:46:40,840 --> 10:46:45,120 about problems in ways that our computer is going to be able to understand. 12992 10:46:45,120 --> 10:46:47,560 Today, the goal is going to be to get computers 12993 10:46:47,560 --> 10:46:50,280 to understand the way you and I communicate naturally 12994 10:46:50,280 --> 10:46:53,800 via our own natural languages, languages like English. 12995 10:46:53,800 --> 10:46:57,400 But natural language contains a lot of nuance and complexity 12996 10:46:57,400 --> 10:47:00,600 that's going to make it challenging for computers to be able to understand. 12997 10:47:00,600 --> 10:47:04,080 So we'll need to explore some new tools and some new techniques 12998 10:47:04,080 --> 10:47:07,800 to allow computers to make sense of natural language. 12999 10:47:07,800 --> 10:47:10,640 So what is it exactly that we're trying to get computers to do? 13000 10:47:10,640 --> 10:47:14,520 Well, they all fall under this general heading of natural language processing, 13001 10:47:14,520 --> 10:47:17,360 getting computers to work with natural language. 13002 10:47:17,360 --> 10:47:20,840 And these tasks include tasks like automatic summarization. 13003 10:47:20,840 --> 10:47:23,600 Given a long text, can we train the computer 13004 10:47:23,600 --> 10:47:26,240 to be able to come up with a shorter representation of it? 13005 10:47:26,240 --> 10:47:28,280 Information extraction, getting the computer 13006 10:47:28,280 --> 10:47:31,120 to pull out relevant facts or details out of some text. 13007 10:47:31,120 --> 10:47:33,400 Machine translation, like Google Translate, 13008 10:47:33,400 --> 10:47:36,680 translating some text from one language into another language. 13009 10:47:36,680 --> 10:47:39,880 Question answering, if you've ever asked a question to your phone 13010 10:47:39,880 --> 10:47:43,800 or had a conversation with an AI chatbot where you provide some text 13011 10:47:43,800 --> 10:47:47,400 to the computer, the computer is able to understand that text 13012 10:47:47,400 --> 10:47:50,360 and then generate some text in response. 13013 10:47:50,360 --> 10:47:53,680 Text classification, where we provide some text to the computer 13014 10:47:53,680 --> 10:47:56,720 and the computer assigns it a label, positive or negative, 13015 10:47:56,720 --> 10:47:58,600 inbox or spam, for example. 13016 10:47:58,600 --> 10:48:00,360 And there are several other kinds of tasks 13017 10:48:00,360 --> 10:48:03,800 that all fall under this heading of natural language processing. 13018 10:48:03,800 --> 10:48:06,240 But before we take a look at how the computer might 13019 10:48:06,240 --> 10:48:09,240 try to solve these kinds of tasks, it might be useful for us 13020 10:48:09,240 --> 10:48:11,540 to think about language in general. 13021 10:48:11,540 --> 10:48:14,360 What are the kinds of challenges that we might need to deal with 13022 10:48:14,360 --> 10:48:17,320 as we start to think about language and getting a computer 13023 10:48:17,320 --> 10:48:18,880 to be able to understand it? 13024 10:48:18,880 --> 10:48:21,080 So one part of language that we'll need to consider 13025 10:48:21,080 --> 10:48:22,760 is the syntax of language. 13026 10:48:22,760 --> 10:48:25,040 Syntax is all about the structure of language. 13027 10:48:25,040 --> 10:48:27,400 Language is composed of individual words. 13028 10:48:27,400 --> 10:48:31,280 And those words are composed together in some kind of structured whole. 13029 10:48:31,280 --> 10:48:33,960 And if our computer is going to be able to understand language, 13030 10:48:33,960 --> 10:48:37,440 it's going to need to understand something about that structure. 13031 10:48:37,440 --> 10:48:39,160 So let's take a couple of examples. 13032 10:48:39,160 --> 10:48:40,920 Here, for instance, is a sentence. 13033 10:48:40,920 --> 10:48:44,740 Just before 9 o'clock, Sherlock Holmes stepped briskly into the room. 13034 10:48:44,740 --> 10:48:46,680 That sentence is made up of words. 13035 10:48:46,680 --> 10:48:49,640 And those words together form a structured whole. 13036 10:48:49,640 --> 10:48:52,520 This is syntactically valid as a sentence. 13037 10:48:52,520 --> 10:48:55,120 But we could take some of those same words, 13038 10:48:55,120 --> 10:48:59,640 rearrange them, and come up with a sentence that is not syntactically valid. 13039 10:48:59,640 --> 10:49:03,640 Here, for example, just before Sherlock Holmes 9 o'clock stepped briskly 13040 10:49:03,640 --> 10:49:06,640 the room is still composed of valid words. 13041 10:49:06,640 --> 10:49:08,960 But they're not in any kind of logical whole. 13042 10:49:08,960 --> 10:49:12,800 This is not a syntactically well-formed sentence. 13043 10:49:12,800 --> 10:49:15,800 Another interesting challenge is that some sentences will 13044 10:49:15,800 --> 10:49:18,920 have multiple possible valid structures. 13045 10:49:18,920 --> 10:49:20,440 Here's a sentence, for example. 13046 10:49:20,440 --> 10:49:23,480 I saw the man on the mountain with a telescope. 13047 10:49:23,480 --> 10:49:25,200 And here, this is a valid sentence. 13048 10:49:25,200 --> 10:49:28,680 But it actually has two different possible structures 13049 10:49:28,680 --> 10:49:31,360 that lend themselves to two different interpretations 13050 10:49:31,360 --> 10:49:32,520 and two different meanings. 13051 10:49:32,520 --> 10:49:36,040 Maybe I, the one doing the seeing, am the one with the telescope. 13052 10:49:36,040 --> 10:49:39,280 Or maybe the man on the mountain is the one with the telescope. 13053 10:49:39,280 --> 10:49:41,440 And so natural language is ambiguous. 13054 10:49:41,440 --> 10:49:44,800 Sometimes the same sentence can be interpreted in multiple ways. 13055 10:49:44,800 --> 10:49:47,520 And that's something that we'll need to think about as well. 13056 10:49:47,520 --> 10:49:50,000 And this lends itself to another problem within language 13057 10:49:50,000 --> 10:49:52,480 that we'll need to think about, which is semantics. 13058 10:49:52,480 --> 10:49:55,080 While syntax is all about the structure of language, 13059 10:49:55,080 --> 10:49:57,360 semantics is about the meaning of language. 13060 10:49:57,360 --> 10:49:59,880 It's not enough for a computer just to know 13061 10:49:59,880 --> 10:50:02,040 that a sentence is well-structured if it doesn't 13062 10:50:02,040 --> 10:50:04,200 know what that sentence means. 13063 10:50:04,200 --> 10:50:06,240 And so semantics is going to concern itself 13064 10:50:06,240 --> 10:50:09,440 with the meaning of words and the meaning of sentences. 13065 10:50:09,440 --> 10:50:11,680 So if we go back to that same sentence as before, 13066 10:50:11,680 --> 10:50:16,000 just before 9 o'clock, Sherlock Holmes stepped briskly into the room, 13067 10:50:16,000 --> 10:50:19,360 I could come up with another sentence, say the sentence, 13068 10:50:19,360 --> 10:50:23,600 a few minutes before 9, Sherlock Holmes walked quickly into the room. 13069 10:50:23,600 --> 10:50:26,480 And those are two different sentences with some of the words the same 13070 10:50:26,480 --> 10:50:28,000 and some of the words different. 13071 10:50:28,000 --> 10:50:31,280 But the two sentences have essentially the same meaning. 13072 10:50:31,280 --> 10:50:33,440 And so ideally, whatever model we build, we'll 13073 10:50:33,440 --> 10:50:36,560 be able to understand that these two sentences, while different, 13074 10:50:36,560 --> 10:50:38,800 mean something very similar. 13075 10:50:38,800 --> 10:50:42,440 Some syntactically well-formed sentences don't mean anything at all. 13076 10:50:42,440 --> 10:50:44,920 A famous example from linguist Noam Chomsky 13077 10:50:44,920 --> 10:50:48,920 is the sentence, colorless green ideas sleep furiously. 13078 10:50:48,920 --> 10:50:52,120 This is a syntactically, structurally well-formed sentence. 13079 10:50:52,120 --> 10:50:55,280 We've got adjectives modifying a noun, ideas. 13080 10:50:55,280 --> 10:50:58,040 We've got a verb and an adverb in the correct positions. 13081 10:50:58,040 --> 10:51:01,880 But when taken as a whole, the sentence doesn't really mean anything. 13082 10:51:01,880 --> 10:51:05,080 And so if our computers are going to be able to work with natural language 13083 10:51:05,080 --> 10:51:07,520 and perform tasks in natural language processing, 13084 10:51:07,520 --> 10:51:09,520 these are some concerns we'll need to think about. 13085 10:51:09,520 --> 10:51:11,760 We'll need to be thinking about syntax. 13086 10:51:11,760 --> 10:51:14,520 And we'll need to be thinking about semantics. 13087 10:51:14,520 --> 10:51:17,480 So how could we go about trying to teach a computer how 13088 10:51:17,480 --> 10:51:20,280 to understand the structure of natural language? 13089 10:51:20,280 --> 10:51:22,680 Well, one approach we might take is by starting 13090 10:51:22,680 --> 10:51:25,400 by thinking about the rules of natural language. 13091 10:51:25,400 --> 10:51:27,160 Our natural languages have rules. 13092 10:51:27,160 --> 10:51:30,360 In English, for example, nouns tend to come before verbs. 13093 10:51:30,360 --> 10:51:33,240 Nouns can be modified by adjectives, for example. 13094 10:51:33,240 --> 10:51:36,040 And so if only we could formalize those rules, 13095 10:51:36,040 --> 10:51:38,280 then we could give those rules to a computer, 13096 10:51:38,280 --> 10:51:41,880 and the computer would be able to make sense of them and understand them. 13097 10:51:41,880 --> 10:51:43,720 And so let's try to do exactly that. 13098 10:51:43,720 --> 10:51:46,360 We're going to try to define a formal grammar. 13099 10:51:46,360 --> 10:51:49,400 Where a formal grammar is some system of rules 13100 10:51:49,400 --> 10:51:52,040 for generating sentences in a language. 13101 10:51:52,040 --> 10:51:56,000 This is going to be a rule-based approach to natural language processing. 13102 10:51:56,000 --> 10:51:59,400 We're going to give the computer some rules that we know about language 13103 10:51:59,400 --> 10:52:01,840 and have the computer use those rules to make 13104 10:52:01,840 --> 10:52:04,280 sense of the structure of language. 13105 10:52:04,280 --> 10:52:06,600 And there are a number of different types of formal grammars. 13106 10:52:06,600 --> 10:52:09,080 Each one of them has slightly different use cases. 13107 10:52:09,080 --> 10:52:11,080 But today, we're going to focus specifically 13108 10:52:11,080 --> 10:52:14,560 on one kind of grammar known as a context-free grammar. 13109 10:52:14,560 --> 10:52:16,480 So how does the context-free grammar work? 13110 10:52:16,480 --> 10:52:19,720 Well, here is a sentence that we might want a computer to generate. 13111 10:52:19,720 --> 10:52:21,520 She saw the city. 13112 10:52:21,520 --> 10:52:24,760 And we're going to call each of these words a terminal symbol. 13113 10:52:24,760 --> 10:52:27,920 A terminal symbol, because once our computer has generated the word, 13114 10:52:27,920 --> 10:52:29,500 there's nothing else for it to generate. 13115 10:52:29,500 --> 10:52:32,800 Once it's generated the sentence, the computer is done. 13116 10:52:32,800 --> 10:52:35,520 We're going to associate each of these terminal symbols 13117 10:52:35,520 --> 10:52:39,320 with a non-terminal symbol that generates it. 13118 10:52:39,320 --> 10:52:43,200 So here we've got n, which stands for noun, like she or city. 13119 10:52:43,200 --> 10:52:46,600 We've got v as a non-terminal symbol, which stands for a verb. 13120 10:52:46,600 --> 10:52:48,720 And then we have d, which stands for determiner. 13121 10:52:48,720 --> 10:52:52,880 A determiner is a word like the or a or an in English, for example. 13122 10:52:52,880 --> 10:52:57,040 So each of these non-terminal symbols can generate the terminal symbols 13123 10:52:57,040 --> 10:52:59,600 that we ultimately care about generating. 13124 10:52:59,600 --> 10:53:01,720 But how do we know, or how does the computer 13125 10:53:01,720 --> 10:53:05,720 know which non-terminal symbols are associated with which terminal symbols? 13126 10:53:05,720 --> 10:53:08,280 Well, to do that, we need some kind of rule. 13127 10:53:08,280 --> 10:53:11,040 Here are some what we call rewriting rules that 13128 10:53:11,040 --> 10:53:14,320 have a non-terminal symbol on the left-hand side of an arrow. 13129 10:53:14,320 --> 10:53:18,800 And on the right side is what that non-terminal symbol can be replaced with. 13130 10:53:18,800 --> 10:53:21,560 So here we're saying the non-terminal symbol n, again, 13131 10:53:21,560 --> 10:53:25,520 which stands for noun, could be replaced by any of these options separated 13132 10:53:25,520 --> 10:53:26,800 by vertical bars. 13133 10:53:26,800 --> 10:53:30,760 n could be replaced by she or city or car or hairy. 13134 10:53:30,760 --> 10:53:34,800 d for determiner could be replaced by the a or an and so forth. 13135 10:53:34,800 --> 10:53:40,240 Each of these non-terminal symbols could be replaced by any of these words. 13136 10:53:40,240 --> 10:53:42,720 We can also have non-terminal symbols that 13137 10:53:42,720 --> 10:53:45,680 are replaced by other non-terminal symbols. 13138 10:53:45,680 --> 10:53:50,840 Here is an interesting rule, np arrow n bar dn. 13139 10:53:50,840 --> 10:53:52,000 So what does that mean? 13140 10:53:52,000 --> 10:53:55,080 Well, np stands for a noun phrase. 13141 10:53:55,080 --> 10:53:57,400 Sometimes when we have a noun phrase in a sentence, 13142 10:53:57,400 --> 10:54:00,200 it's not just a single word, it could be multiple words. 13143 10:54:00,200 --> 10:54:04,400 And so here we're saying a noun phrase could be just a noun, 13144 10:54:04,400 --> 10:54:07,920 or it could be a determiner followed by a noun. 13145 10:54:07,920 --> 10:54:11,200 So we might have a noun phrase that's just a noun, like she, 13146 10:54:11,200 --> 10:54:12,680 that's a noun phrase. 13147 10:54:12,680 --> 10:54:15,360 Or we could have a noun phrase that's multiple words, something 13148 10:54:15,360 --> 10:54:18,440 like the city also acts as a noun phrase. 13149 10:54:18,440 --> 10:54:22,440 But in this case, it's composed of two words, a determiner, the, 13150 10:54:22,440 --> 10:54:24,520 and a noun city. 13151 10:54:24,520 --> 10:54:26,480 We could do the same for verb phrases. 13152 10:54:26,480 --> 10:54:30,040 A verb phrase, or VP, might be just a verb, 13153 10:54:30,040 --> 10:54:33,160 or it might be a verb followed by a noun phrase. 13154 10:54:33,160 --> 10:54:35,920 So we could have a verb phrase that's just a single word, 13155 10:54:35,920 --> 10:54:38,760 like the word walked, or we could have a verb phrase 13156 10:54:38,760 --> 10:54:42,600 that is an entire phrase, something like saw the city, 13157 10:54:42,600 --> 10:54:45,040 as an entire verb phrase. 13158 10:54:45,040 --> 10:54:48,680 A sentence, meanwhile, we might then define as a noun phrase 13159 10:54:48,680 --> 10:54:50,840 followed by a verb phrase. 13160 10:54:50,840 --> 10:54:54,600 And so this would allow us to generate a sentence like she saw the city, 13161 10:54:54,600 --> 10:54:59,000 an entire sentence made up of a noun phrase, which is just the word she, 13162 10:54:59,000 --> 10:55:03,120 and then a verb phrase, which is saw the city, saw which is a verb, 13163 10:55:03,120 --> 10:55:07,880 and then the city, which itself is also a noun phrase. 13164 10:55:07,880 --> 10:55:11,200 And so if we could give these rules to a computer explaining to it 13165 10:55:11,200 --> 10:55:15,080 what non-terminal symbols could be replaced by what other symbols, 13166 10:55:15,080 --> 10:55:17,400 then a computer could take a sentence and begin 13167 10:55:17,400 --> 10:55:20,520 to understand the structure of that sentence. 13168 10:55:20,520 --> 10:55:23,320 And so let's take a look at an example of how we might do that. 13169 10:55:23,320 --> 10:55:26,960 And to do that, we're going to use a Python library called NLTK, 13170 10:55:26,960 --> 10:55:30,160 or the Natural Language Toolkit, which we'll see a couple of times today. 13171 10:55:30,160 --> 10:55:33,280 It contains a lot of helpful features and functions that we can use 13172 10:55:33,280 --> 10:55:36,440 for trying to deal with and process natural language. 13173 10:55:36,440 --> 10:55:39,540 So here we'll take a look at how we can use NLTK in order 13174 10:55:39,540 --> 10:55:42,280 to parse a context-free grammar. 13175 10:55:42,280 --> 10:55:47,840 So let's go ahead and open up cfg0.py, cfg standing for context-free grammar. 13176 10:55:47,840 --> 10:55:51,680 And what you'll see in this file is that I first import NLTK, the Natural 13177 10:55:51,680 --> 10:55:53,160 Language Toolkit. 13178 10:55:53,160 --> 10:55:57,000 And the first thing I do is define a context-free grammar, 13179 10:55:57,000 --> 10:56:00,400 saying that a sentence is a noun phrase followed by a verb phrase. 13180 10:56:00,400 --> 10:56:03,840 I'm defining what a noun phrase is, defining what a verb phrase is, 13181 10:56:03,840 --> 10:56:05,800 and then giving some examples of what I can 13182 10:56:05,800 --> 10:56:10,400 do with these non-terminal symbols, D for determiner, N for noun, 13183 10:56:10,400 --> 10:56:12,280 and V for verb. 13184 10:56:12,280 --> 10:56:15,400 We're going to use NLTK to parse that grammar. 13185 10:56:15,400 --> 10:56:18,280 Then we'll ask the user for some input in the form of a sentence 13186 10:56:18,280 --> 10:56:20,360 and split it into words. 13187 10:56:20,360 --> 10:56:23,560 And then we'll use this context-free grammar parser 13188 10:56:23,560 --> 10:56:28,400 to try to parse that sentence and print out the resulting syntax tree. 13189 10:56:28,400 --> 10:56:30,760 So let's take a look at an example. 13190 10:56:30,760 --> 10:56:35,560 We'll go ahead and go into my cfg directory, and we'll run cfg0.py. 13191 10:56:35,560 --> 10:56:37,160 And here I'm asked to type in a sentence. 13192 10:56:37,160 --> 10:56:40,600 Let's say I type in she walked. 13193 10:56:40,600 --> 10:56:43,960 And when I do that, I see that she walked is a valid sentence, 13194 10:56:43,960 --> 10:56:49,680 where she is a noun phrase, and walked is the corresponding verb phrase. 13195 10:56:49,680 --> 10:56:52,600 I could try to do this with a more complex sentence too. 13196 10:56:52,600 --> 10:56:55,920 I could do something like she saw the city. 13197 10:56:55,920 --> 10:56:58,920 And here we see that she is the noun phrase, 13198 10:56:58,920 --> 10:57:04,560 and then saw the city is the entire verb phrase that makes up this sentence. 13199 10:57:04,560 --> 10:57:06,200 So that was a very simple grammar. 13200 10:57:06,200 --> 10:57:08,840 Let's take a look at a slightly more complex grammar. 13201 10:57:08,840 --> 10:57:13,440 Here is cfg1.py, where a sentence is still a noun phrase followed 13202 10:57:13,440 --> 10:57:17,680 by a verb phrase, but I've added some other possible non-terminal symbols too. 13203 10:57:17,680 --> 10:57:22,760 I have AP for adjective phrase and PP for prepositional phrase. 13204 10:57:22,760 --> 10:57:25,480 And we specified that we could have an adjective phrase 13205 10:57:25,480 --> 10:57:30,440 before a noun phrase or a prepositional phrase after a noun, for example. 13206 10:57:30,440 --> 10:57:34,320 So lots of additional ways that we might try to structure a sentence 13207 10:57:34,320 --> 10:57:37,880 and interpret and parse one of those resulting sentences. 13208 10:57:37,880 --> 10:57:39,280 So let's see that one in action. 13209 10:57:39,280 --> 10:57:43,600 We'll go ahead and run cfg1.py with this new grammar. 13210 10:57:43,600 --> 10:57:48,400 And we'll try a sentence like she saw the wide street. 13211 10:57:48,400 --> 10:57:51,680 Here, Python's NLTK is able to parse that sentence 13212 10:57:51,680 --> 10:57:55,840 and identify that she saw the wide street has this particular structure, 13213 10:57:55,840 --> 10:57:58,400 a sentence with a noun phrase and a verb phrase, 13214 10:57:58,400 --> 10:58:00,600 where that verb phrase has a noun phrase that within it 13215 10:58:00,600 --> 10:58:02,080 contains an adjective. 13216 10:58:02,080 --> 10:58:06,120 And so it's able to get some sense for what the structure of this language 13217 10:58:06,120 --> 10:58:07,840 actually is. 13218 10:58:07,840 --> 10:58:09,280 Let's try another example. 13219 10:58:09,280 --> 10:58:14,680 Let's say she saw the dog with the binoculars. 13220 10:58:14,680 --> 10:58:16,680 And we'll try that sentence. 13221 10:58:16,680 --> 10:58:19,840 And here, we get one possible syntax tree, 13222 10:58:19,840 --> 10:58:21,840 she saw the dog with the binoculars. 13223 10:58:21,840 --> 10:58:24,120 But notice that this sentence is actually a little bit 13224 10:58:24,120 --> 10:58:26,320 ambiguous in our own natural language. 13225 10:58:26,320 --> 10:58:27,400 Who has the binoculars? 13226 10:58:27,400 --> 10:58:31,320 Is it she who has the binoculars or the dog who has the binoculars? 13227 10:58:31,320 --> 10:58:35,880 And NLTK is able to identify both possible structures for the sentence. 13228 10:58:35,880 --> 10:58:38,720 In this case, the dog with the binoculars 13229 10:58:38,720 --> 10:58:40,440 is an entire noun phrase. 13230 10:58:40,440 --> 10:58:42,720 It's all underneath this NP here. 13231 10:58:42,720 --> 10:58:45,280 So it's the dog that has the binoculars. 13232 10:58:45,280 --> 10:58:48,720 But we also got an alternative parse tree, 13233 10:58:48,720 --> 10:58:52,440 where the dog is just the noun phrase. 13234 10:58:52,440 --> 10:58:57,080 And with the binoculars is a prepositional phrase modifying saw. 13235 10:58:57,080 --> 10:59:01,080 So she saw the dog and she used the binoculars in order 13236 10:59:01,080 --> 10:59:03,120 to see the dog as well. 13237 10:59:03,120 --> 10:59:06,120 So this allows us to get a sense for the structure of natural language. 13238 10:59:06,120 --> 10:59:08,840 But it relies on us writing all of these rules. 13239 10:59:08,840 --> 10:59:12,120 And it would take a lot of effort to write all of the rules for any possible 13240 10:59:12,120 --> 10:59:15,320 sentence that someone might write or say in the English language. 13241 10:59:15,320 --> 10:59:16,520 Language is complicated. 13242 10:59:16,520 --> 10:59:20,080 And as a result, there are going to be some very complex rules. 13243 10:59:20,080 --> 10:59:21,680 So what else might we try? 13244 10:59:21,680 --> 10:59:24,840 We might try to take a statistical lens towards approaching 13245 10:59:24,840 --> 10:59:27,320 this problem of natural language processing. 13246 10:59:27,320 --> 10:59:31,160 If we were able to give the computer a lot of existing data of sentences 13247 10:59:31,160 --> 10:59:35,160 written in the English language, what could we try to learn from that data? 13248 10:59:35,160 --> 10:59:38,480 Well, it might be difficult to try and interpret long pieces of text all 13249 10:59:38,480 --> 10:59:39,200 at once. 13250 10:59:39,200 --> 10:59:42,680 So instead, what we might want to do is break up that longer text 13251 10:59:42,680 --> 10:59:45,120 into smaller pieces of information instead. 13252 10:59:45,120 --> 10:59:50,360 In particular, we might try to create n-grams out of a longer sequence of text. 13253 10:59:50,360 --> 10:59:55,560 An n-gram is just some contiguous sequence of n items from a sample of text. 13254 10:59:55,560 --> 10:59:59,800 It might be n characters in a row or n words in a row, for example. 13255 10:59:59,800 --> 11:00:02,320 So let's take a passage from Sherlock Holmes. 13256 11:00:02,320 --> 11:00:04,560 And let's look for all of the trigrams. 13257 11:00:04,560 --> 11:00:07,640 A trigram is an n-gram where n is equal to 3. 13258 11:00:07,640 --> 11:00:11,480 So in this case, we're looking for sequences of three words in a row. 13259 11:00:11,480 --> 11:00:15,240 So the trigrams here would be phrases like how often have. 13260 11:00:15,240 --> 11:00:16,680 That's three words in a row. 13261 11:00:16,680 --> 11:00:18,640 Often have I is another trigram. 13262 11:00:18,640 --> 11:00:22,080 Have I said, I said to, said to you, to you that. 13263 11:00:22,080 --> 11:00:27,040 These are all trigrams, sequences of three words that appear in sequence. 13264 11:00:27,040 --> 11:00:30,140 And if we could give the computer a large corpus of text 13265 11:00:30,140 --> 11:00:33,320 and have it pull out all of the trigrams in this case, 13266 11:00:33,320 --> 11:00:36,800 it could get a sense for what sequences of three words 13267 11:00:36,800 --> 11:00:40,720 tend to appear next to each other in our own natural language 13268 11:00:40,720 --> 11:00:45,240 and, as a result, get some sense for what the structure of the language 13269 11:00:45,240 --> 11:00:46,840 actually is. 13270 11:00:46,840 --> 11:00:48,560 So let's take a look at an example of that. 13271 11:00:48,560 --> 11:00:55,280 How can we use NLTK to try to get access to information about n-grams? 13272 11:00:55,280 --> 11:00:58,440 So here, we're going to open up ngrams.py. 13273 11:00:58,440 --> 11:01:02,440 And this is a Python program that's going to load a corpus of data, just 13274 11:01:02,440 --> 11:01:05,240 some text files, into our computer's memory. 13275 11:01:05,240 --> 11:01:08,760 And then we're going to use NLTK's ngrams function, which 13276 11:01:08,760 --> 11:01:12,520 is going to go through the corpus of text, pulling out all of the ngrams 13277 11:01:12,520 --> 11:01:14,480 for a particular value of n. 13278 11:01:14,480 --> 11:01:17,720 And then, by using Python's counter class, 13279 11:01:17,720 --> 11:01:21,640 we're going to figure out what are the most common ngrams inside 13280 11:01:21,640 --> 11:01:24,280 of this entire corpus of text. 13281 11:01:24,280 --> 11:01:26,480 And we're going to need a data set in order to do this. 13282 11:01:26,480 --> 11:01:29,960 And I've prepared a data set of some of the stories of Sherlock Holmes. 13283 11:01:29,960 --> 11:01:32,000 So it's just a bunch of text files. 13284 11:01:32,000 --> 11:01:33,680 A lot of words for it to analyze. 13285 11:01:33,680 --> 11:01:38,040 And as a result, we'll get a sense for what sequences of two words or three 13286 11:01:38,040 --> 11:01:42,440 words that tend to be most common in natural language. 13287 11:01:42,440 --> 11:01:43,440 So let's give this a try. 13288 11:01:43,440 --> 11:01:45,360 We'll go into my ngrams directory. 13289 11:01:45,360 --> 11:01:47,440 And we'll run ngrams.py. 13290 11:01:47,440 --> 11:01:49,200 We'll try an n value of 2. 13291 11:01:49,200 --> 11:01:51,960 So we're looking for sequences of two words in a row. 13292 11:01:51,960 --> 11:01:55,760 And we'll use our corpus of stories from Sherlock Holmes. 13293 11:01:55,760 --> 11:01:59,680 And when we run this program, we get a list of the most common ngrams 13294 11:01:59,680 --> 11:02:02,440 where n is equal to 2, otherwise known as a bigram. 13295 11:02:02,440 --> 11:02:04,720 So the most common one is of the. 13296 11:02:04,720 --> 11:02:07,440 That's a sequence of two words that appears quite frequently 13297 11:02:07,440 --> 11:02:08,720 in natural language. 13298 11:02:08,720 --> 11:02:09,720 Then in the. 13299 11:02:09,720 --> 11:02:10,720 And it was. 13300 11:02:10,720 --> 11:02:14,800 These are all common sequences of two words that appear in a row. 13301 11:02:14,800 --> 11:02:18,980 Let's instead now try running ngrams with n equal to 3. 13302 11:02:18,980 --> 11:02:21,760 Let's get all of the trigrams and see what we get. 13303 11:02:21,760 --> 11:02:25,360 And now we see the most common trigrams are it was a. 13304 11:02:25,360 --> 11:02:26,520 One of the. 13305 11:02:26,520 --> 11:02:27,760 I think that. 13306 11:02:27,760 --> 11:02:32,040 These are all sequences of three words that appear quite frequently. 13307 11:02:32,040 --> 11:02:36,040 And we were able to do this essentially via a process known as tokenization. 13308 11:02:36,040 --> 11:02:39,440 Tokenization is the process of splitting a sequence of characters 13309 11:02:39,440 --> 11:02:40,280 into pieces. 13310 11:02:40,280 --> 11:02:44,400 In this case, we're splitting a long sequence of text into individual words 13311 11:02:44,400 --> 11:02:46,640 and then looking at sequences of those words 13312 11:02:46,640 --> 11:02:49,840 to get a sense for the structure of natural language. 13313 11:02:49,840 --> 11:02:52,400 So once we've done this, once we've done the tokenization, 13314 11:02:52,400 --> 11:02:55,520 once we've built up our corpus of ngrams, what 13315 11:02:55,520 --> 11:02:57,160 can we do with that information? 13316 11:02:57,160 --> 11:03:00,040 So the one thing that we might try is we could build a Markov chain, 13317 11:03:00,040 --> 11:03:02,680 which you might recall from when we talked about probability. 13318 11:03:02,680 --> 11:03:05,800 Recall that a Markov chain is some sequence of values 13319 11:03:05,800 --> 11:03:10,160 where we can predict one value based on the values that came before it. 13320 11:03:10,160 --> 11:03:14,760 And as a result, if we know all of the common ngrams in the English language, 13321 11:03:14,760 --> 11:03:18,480 what words tend to be associated with what other words in sequence, 13322 11:03:18,480 --> 11:03:23,520 we can use that to predict what word might come next in a sequence of words. 13323 11:03:23,520 --> 11:03:26,180 And so we could build a Markov chain for language 13324 11:03:26,180 --> 11:03:28,640 in order to try to generate natural language that 13325 11:03:28,640 --> 11:03:33,280 follows the same statistical patterns as some input data. 13326 11:03:33,280 --> 11:03:37,520 So let's take a look at that and build a Markov chain for natural language. 13327 11:03:37,520 --> 11:03:41,960 And as input, I'm going to use the works of William Shakespeare. 13328 11:03:41,960 --> 11:03:45,120 So here I have a file Shakespeare.txt, which 13329 11:03:45,120 --> 11:03:48,120 is just a bunch of the works of William Shakespeare. 13330 11:03:48,120 --> 11:03:51,440 It's a long text file, so plenty of data to analyze. 13331 11:03:51,440 --> 11:03:55,480 And here in generator.py, I'm using a third party Python library 13332 11:03:55,480 --> 11:03:57,520 in order to do this analysis. 13333 11:03:57,520 --> 11:04:00,240 We're going to read in the sample of text, 13334 11:04:00,240 --> 11:04:03,960 and then we're going to train a Markov model based on that text. 13335 11:04:03,960 --> 11:04:07,840 And then we're going to have the Markov chain generate some sentences. 13336 11:04:07,840 --> 11:04:11,520 We're going to generate a sentence that doesn't appear in the original text, 13337 11:04:11,520 --> 11:04:14,920 but that follows the same statistical patterns that's generating it 13338 11:04:14,920 --> 11:04:19,360 based on the ngrams trying to predict what word is likely to come next 13339 11:04:19,360 --> 11:04:23,120 that we would expect based on those statistical patterns. 13340 11:04:23,120 --> 11:04:27,280 So we'll go ahead and go into our Markov directory, 13341 11:04:27,280 --> 11:04:31,200 run this generator with the works of William Shakespeare's input. 13342 11:04:31,200 --> 11:04:34,760 And what we're going to get are five new sentences, where 13343 11:04:34,760 --> 11:04:37,280 these sentences are not necessarily sentences 13344 11:04:37,280 --> 11:04:39,800 from the original input text itself, but just that 13345 11:04:39,800 --> 11:04:41,920 follow the same statistical patterns. 13346 11:04:41,920 --> 11:04:45,720 It's predicting what word is likely to come next based on the input data 13347 11:04:45,720 --> 11:04:47,720 that we've seen and the types of words that 13348 11:04:47,720 --> 11:04:50,200 tend to appear in sequence there too. 13349 11:04:50,200 --> 11:04:53,000 And so we're able to generate these sentences. 13350 11:04:53,000 --> 11:04:56,360 Of course, so far, there's no guarantee that any of the sentences that 13351 11:04:56,360 --> 11:04:59,040 are generated actually mean anything or make any sense. 13352 11:04:59,040 --> 11:05:01,880 They just happen to follow the statistical patterns 13353 11:05:01,880 --> 11:05:04,040 that our computer is already aware of. 13354 11:05:04,040 --> 11:05:06,520 So we'll return to this issue of how to generate text 13355 11:05:06,520 --> 11:05:09,840 in perhaps a more accurate or more meaningful way a little bit later. 13356 11:05:09,840 --> 11:05:12,800 So let's now turn our attention to a slightly different problem, 13357 11:05:12,800 --> 11:05:15,280 and that's the problem of text classification. 13358 11:05:15,280 --> 11:05:18,360 Text classification is the problem where we have some text 13359 11:05:18,360 --> 11:05:21,320 and we want to put that text into some kind of category. 13360 11:05:21,320 --> 11:05:24,240 We want to apply some sort of label to that text. 13361 11:05:24,240 --> 11:05:27,280 And this kind of problem shows up in a wide variety of places. 13362 11:05:27,280 --> 11:05:29,800 A commonplace might be your email inbox, for example. 13363 11:05:29,800 --> 11:05:31,920 You get an email and you want your computer 13364 11:05:31,920 --> 11:05:35,080 to be able to identify whether the email belongs in your inbox 13365 11:05:35,080 --> 11:05:37,320 or whether it should be filtered out into spam. 13366 11:05:37,320 --> 11:05:39,360 So we need to classify the text. 13367 11:05:39,360 --> 11:05:42,040 Is it a good email or is it spam? 13368 11:05:42,040 --> 11:05:44,760 Another common use case is sentiment analysis. 13369 11:05:44,760 --> 11:05:47,640 We might want to know whether the sentiment of some text 13370 11:05:47,640 --> 11:05:50,080 is positive or negative. 13371 11:05:50,080 --> 11:05:51,280 And so how might we do that? 13372 11:05:51,280 --> 11:05:53,920 This comes up in situations like product reviews, 13373 11:05:53,920 --> 11:05:57,120 where we might have a bunch of reviews for a product on some website. 13374 11:05:57,120 --> 11:05:58,840 My grandson loved it so much fun. 13375 11:05:58,840 --> 11:06:00,600 Product broke after a few days. 13376 11:06:00,600 --> 11:06:03,800 One of the best games I've played in a long time and kind of cheap 13377 11:06:03,800 --> 11:06:05,040 and flimsy, not worth it. 13378 11:06:05,040 --> 11:06:09,600 Here's some example sentences that you might see on a product review website. 13379 11:06:09,600 --> 11:06:12,680 And you and I could pretty easily look at this list of product reviews 13380 11:06:12,680 --> 11:06:15,960 and decide which ones are positive and which ones are negative. 13381 11:06:15,960 --> 11:06:17,880 We might say the first one and the third one, 13382 11:06:17,880 --> 11:06:20,160 those seem like positive sentiment messages. 13383 11:06:20,160 --> 11:06:24,160 But the second one and the fourth one seem like negative sentiment messages. 13384 11:06:24,160 --> 11:06:25,320 But how did we know that? 13385 11:06:25,320 --> 11:06:29,160 And how could we train a computer to be able to figure that out as well? 13386 11:06:29,160 --> 11:06:32,360 Well, you might have clued your eye in on particular key words, 13387 11:06:32,360 --> 11:06:36,520 where those particular words tend to mean something positive or negative. 13388 11:06:36,520 --> 11:06:40,160 So you might have identified words like loved and fun and best 13389 11:06:40,160 --> 11:06:42,880 tend to be associated with positive messages. 13390 11:06:42,880 --> 11:06:45,360 And words like broke and cheap and flimsy 13391 11:06:45,360 --> 11:06:48,000 tend to be associated with negative messages. 13392 11:06:48,000 --> 11:06:51,000 So if only we could train a computer to be able to learn 13393 11:06:51,000 --> 11:06:55,120 what words tend to be associated with positive versus negative messages, 13394 11:06:55,120 --> 11:06:59,000 then maybe we could train a computer to do this kind of sentiment analysis 13395 11:06:59,000 --> 11:07:00,160 as well. 13396 11:07:00,160 --> 11:07:01,760 So we're going to try to do just that. 13397 11:07:01,760 --> 11:07:05,120 We're going to use a model known as the bag of words model, which 13398 11:07:05,120 --> 11:07:09,720 is a model that represents text as just an unordered collection of words. 13399 11:07:09,720 --> 11:07:11,220 For the purpose of this model, we're not 13400 11:07:11,220 --> 11:07:13,760 going to worry about the sequence and the ordering of the words, 13401 11:07:13,760 --> 11:07:15,600 which word came first, second, or third. 13402 11:07:15,600 --> 11:07:18,440 We're just going to treat the text as a collection of words 13403 11:07:18,440 --> 11:07:19,680 in no particular order. 13404 11:07:19,680 --> 11:07:21,360 And we're losing information there, right? 13405 11:07:21,360 --> 11:07:22,880 The order of words is important. 13406 11:07:22,880 --> 11:07:24,880 And we'll come back to that a little bit later. 13407 11:07:24,880 --> 11:07:26,680 But for now, to simplify our model, it'll 13408 11:07:26,680 --> 11:07:29,440 help us tremendously just to think about text 13409 11:07:29,440 --> 11:07:32,320 as some unordered collection of words. 13410 11:07:32,320 --> 11:07:35,120 And in particular, we're going to use the bag of words model 13411 11:07:35,120 --> 11:07:38,240 to build something known as a naive Bayes classifier. 13412 11:07:38,240 --> 11:07:40,240 So what is a naive Bayes classifier? 13413 11:07:40,240 --> 11:07:43,960 Well, it's a tool that's going to allow us to classify text based on Bayes 13414 11:07:43,960 --> 11:07:47,200 rule, again, which you might remember from when we talked about probability. 13415 11:07:47,200 --> 11:07:51,520 Bayes rule says that the probability of B given A 13416 11:07:51,520 --> 11:07:54,920 is equal to the probability of A given B multiplied 13417 11:07:54,920 --> 11:07:59,480 by the probability of B divided by the probability of A. 13418 11:07:59,480 --> 11:08:03,560 So how are we going to use this rule to be able to analyze text? 13419 11:08:03,560 --> 11:08:04,920 Well, what are we interested in? 13420 11:08:04,920 --> 11:08:07,480 We're interested in the probability that a message has 13421 11:08:07,480 --> 11:08:10,360 a positive sentiment and the probability that a message has 13422 11:08:10,360 --> 11:08:12,920 a negative sentiment, which I'm here for simplicity 13423 11:08:12,920 --> 11:08:16,120 going to represent just with these emoji, happy face and frown face, 13424 11:08:16,120 --> 11:08:18,480 as positive and negative sentiment. 13425 11:08:18,480 --> 11:08:22,320 And so if I had a review, something like my grandson loved it, 13426 11:08:22,320 --> 11:08:25,460 then what I'm interested in is not just the probability 13427 11:08:25,460 --> 11:08:29,600 that a message has positive sentiment, but the conditional probability 13428 11:08:29,600 --> 11:08:32,120 that a message has positive sentiment given 13429 11:08:32,120 --> 11:08:35,120 that this is the message my grandson loved it. 13430 11:08:35,120 --> 11:08:38,360 But how do I go about calculating this value, the probability 13431 11:08:38,360 --> 11:08:42,880 that the message is positive given that the review is this sequence of words? 13432 11:08:42,880 --> 11:08:45,280 Well, here's where the bag of words model comes in. 13433 11:08:45,280 --> 11:08:49,680 Rather than treat this review as a string of a sequence of words in order, 13434 11:08:49,680 --> 11:08:52,840 we're just going to treat it as an unordered collection of words. 13435 11:08:52,840 --> 11:08:56,600 We're going to try to calculate the probability that the review is positive 13436 11:08:56,600 --> 11:08:59,800 given that all of these words, my grandson loved it, 13437 11:08:59,800 --> 11:09:02,400 are in the review in no particular order, just 13438 11:09:02,400 --> 11:09:05,120 this unordered collection of words. 13439 11:09:05,120 --> 11:09:09,920 And this is a conditional probability, which we can then apply Bayes rule 13440 11:09:09,920 --> 11:09:11,680 to try to make sense of. 13441 11:09:11,680 --> 11:09:16,080 And so according to Bayes rule, this conditional probability is equal to what? 13442 11:09:16,080 --> 11:09:19,480 It's equal to the probability that all of these four words 13443 11:09:19,480 --> 11:09:23,180 are in the review given that the review is positive multiplied 13444 11:09:23,180 --> 11:09:27,280 by the probability that the review is positive divided by the probability 13445 11:09:27,280 --> 11:09:30,680 that all of these words happen to be in the review. 13446 11:09:30,680 --> 11:09:33,880 So this is the value now that we're going to try to calculate. 13447 11:09:33,880 --> 11:09:36,440 Now, one thing you might notice is that the denominator here, 13448 11:09:36,440 --> 11:09:40,000 the probability that all of these words appear in the review, 13449 11:09:40,000 --> 11:09:42,280 doesn't actually depend on whether or not 13450 11:09:42,280 --> 11:09:45,680 we're looking at the positive sentiment or negative sentiment case. 13451 11:09:45,680 --> 11:09:47,640 So we can actually get rid of this denominator. 13452 11:09:47,640 --> 11:09:48,880 We don't need to calculate it. 13453 11:09:48,880 --> 11:09:53,280 We can just say that this probability is proportional to the numerator. 13454 11:09:53,280 --> 11:09:56,140 And then at the end, we're going to need to normalize the probability 13455 11:09:56,140 --> 11:10:00,840 distribution to make sure that all of the values sum up to the value 1. 13456 11:10:00,840 --> 11:10:03,480 So now, how do we calculate this value? 13457 11:10:03,480 --> 11:10:08,120 Well, this is the probability of all of these words given positive times 13458 11:10:08,120 --> 11:10:09,920 probability of positive. 13459 11:10:09,920 --> 11:10:12,640 And that, by the definition of joint probability, 13460 11:10:12,640 --> 11:10:15,680 is just one big joint probability, the probability 13461 11:10:15,680 --> 11:10:18,840 that all of these things are the case, that it's a positive review, 13462 11:10:18,840 --> 11:10:22,760 and that all four of these words are in the review. 13463 11:10:22,760 --> 11:10:26,720 But still, it's not entirely obvious how we calculate that value. 13464 11:10:26,720 --> 11:10:28,960 And here is where we need to make one more assumption. 13465 11:10:28,960 --> 11:10:32,240 And this is where the naive part of naive Bayes comes in. 13466 11:10:32,240 --> 11:10:34,880 We're going to make the assumption that all of the words 13467 11:10:34,880 --> 11:10:36,920 are independent of each other. 13468 11:10:36,920 --> 11:10:40,920 And by that, I mean that if the word grandson is in the review, 13469 11:10:40,920 --> 11:10:43,880 that doesn't change the probability that the word loved is in the review 13470 11:10:43,880 --> 11:10:46,320 or that the word it is in the review, for example. 13471 11:10:46,320 --> 11:10:48,840 And in practice, this assumption might not be true. 13472 11:10:48,840 --> 11:10:51,320 It's almost certainly the case that the probability of words 13473 11:10:51,320 --> 11:10:52,840 do depend on each other. 13474 11:10:52,840 --> 11:10:56,040 But it's going to simplify our analysis and still give us reasonably good 13475 11:10:56,040 --> 11:10:59,760 results just to assume that the words are independent of each other 13476 11:10:59,760 --> 11:11:03,880 and they only depend on whether it's positive or negative. 13477 11:11:03,880 --> 11:11:06,400 You might, for example, expect the word loved 13478 11:11:06,400 --> 11:11:10,480 to appear more often in a positive review than in a negative review. 13479 11:11:10,480 --> 11:11:11,640 So what does that mean? 13480 11:11:11,640 --> 11:11:13,600 Well, if we make this assumption, then we 13481 11:11:13,600 --> 11:11:16,840 can say that this value, the probability we're interested in, 13482 11:11:16,840 --> 11:11:22,200 is not directly proportional to, but it's naively proportional to this value. 13483 11:11:22,200 --> 11:11:26,280 The probability that the review is positive times the probability 13484 11:11:26,280 --> 11:11:29,120 that my is in the review, given that it's positive, 13485 11:11:29,120 --> 11:11:31,640 times the probability that grandson is in the review, 13486 11:11:31,640 --> 11:11:34,640 given that it's positive, and so on for the other two words that 13487 11:11:34,640 --> 11:11:36,320 happen to be in this review. 13488 11:11:36,320 --> 11:11:39,080 And now this value, which looks a little more complex, 13489 11:11:39,080 --> 11:11:42,720 is actually a value that we can calculate pretty easily. 13490 11:11:42,720 --> 11:11:46,320 So how are we going to estimate the probability that the review is positive? 13491 11:11:46,320 --> 11:11:50,360 Well, if we have some training data, some example data of example reviews 13492 11:11:50,360 --> 11:11:53,240 where each one has already been labeled as positive or negative, 13493 11:11:53,240 --> 11:11:56,280 then we can estimate the probability that a review is positive 13494 11:11:56,280 --> 11:11:58,760 just by counting the number of positive samples 13495 11:11:58,760 --> 11:12:02,520 and dividing by the total number of samples that we have in our training 13496 11:12:02,520 --> 11:12:03,600 data. 13497 11:12:03,600 --> 11:12:06,800 And for the conditional probabilities, the probability of loved, 13498 11:12:06,800 --> 11:12:08,760 given that it's positive, well, that's going 13499 11:12:08,760 --> 11:12:11,760 to be the number of positive samples with loved in it 13500 11:12:11,760 --> 11:12:15,360 divided by the total number of positive samples. 13501 11:12:15,360 --> 11:12:17,880 So let's take a look at an actual example to see how 13502 11:12:17,880 --> 11:12:19,760 we could try to calculate these values. 13503 11:12:19,760 --> 11:12:21,840 Here I've put together some sample data. 13504 11:12:21,840 --> 11:12:24,960 The way to interpret the sample data is that based on the training data, 13505 11:12:24,960 --> 11:12:29,200 49% of the reviews are positive, 51% are negative. 13506 11:12:29,200 --> 11:12:33,480 And then over here in this table, we have some conditional probabilities. 13507 11:12:33,480 --> 11:12:35,800 And then we have if the review is positive, 13508 11:12:35,800 --> 11:12:38,720 then there is a 30% chance that my appears in it. 13509 11:12:38,720 --> 11:12:42,880 And if the review is negative, there is a 20% chance that my appears in it. 13510 11:12:42,880 --> 11:12:45,840 And based on our training data among the positive reviews, 13511 11:12:45,840 --> 11:12:48,520 1% of them contain the word grandson. 13512 11:12:48,520 --> 11:12:52,360 And among the negative reviews, 2% contain the word grandson. 13513 11:12:52,360 --> 11:12:56,400 So using this data, let's try to calculate this value, 13514 11:12:56,400 --> 11:12:57,880 the value we're interested in. 13515 11:12:57,880 --> 11:13:02,040 And to do that, we'll need to multiply all of these values together. 13516 11:13:02,040 --> 11:13:04,280 The probability of positive, and then all 13517 11:13:04,280 --> 11:13:06,960 of these positive conditional probabilities. 13518 11:13:06,960 --> 11:13:09,400 And when we do that, we get some value. 13519 11:13:09,400 --> 11:13:12,160 And then we can do the same thing for the negative case. 13520 11:13:12,160 --> 11:13:15,520 We're going to do the same thing, take the probability that it's negative, 13521 11:13:15,520 --> 11:13:18,480 multiply it by all of these conditional probabilities, 13522 11:13:18,480 --> 11:13:20,680 and we're going to get some other value. 13523 11:13:20,680 --> 11:13:22,400 And now these values don't sum to one. 13524 11:13:22,400 --> 11:13:24,520 They're not a probability distribution yet. 13525 11:13:24,520 --> 11:13:27,320 But I can normalize them and get some values. 13526 11:13:27,320 --> 11:13:31,320 And that tells me that we're going to predict that my grandson loved it. 13527 11:13:31,320 --> 11:13:35,400 We think there's a 68% chance, probability 0.68, 13528 11:13:35,400 --> 11:13:40,080 that that is a positive sentiment review, and 0.32 probability 13529 11:13:40,080 --> 11:13:42,160 that it's a negative review. 13530 11:13:42,160 --> 11:13:44,480 So what problems might we run into here? 13531 11:13:44,480 --> 11:13:47,920 What could potentially go wrong when doing this kind of analysis 13532 11:13:47,920 --> 11:13:51,720 in order to analyze whether text has a positive or negative sentiment? 13533 11:13:51,720 --> 11:13:53,800 Well, a couple of problems might arise. 13534 11:13:53,800 --> 11:13:57,480 One problem might be, what if the word grandson never 13535 11:13:57,480 --> 11:14:00,960 appears for any of the positive reviews? 13536 11:14:00,960 --> 11:14:03,720 If that were the case, then when we try to calculate the value, 13537 11:14:03,720 --> 11:14:06,360 the probability that we think the review is positive, 13538 11:14:06,360 --> 11:14:08,600 we're going to multiply all these values together, 13539 11:14:08,600 --> 11:14:11,120 and we're just going to get 0 for the positive case, 13540 11:14:11,120 --> 11:14:14,520 because we're all going to ultimately multiply by that 0 value. 13541 11:14:14,520 --> 11:14:17,440 And so we're going to say that we think there is no chance 13542 11:14:17,440 --> 11:14:20,560 that the review is positive because it contains the word grandson. 13543 11:14:20,560 --> 11:14:23,040 And in our training data, we've never seen the word grandson 13544 11:14:23,040 --> 11:14:27,040 appear in a positive sentiment message before. 13545 11:14:27,040 --> 11:14:29,360 And that's probably not the right analysis, 13546 11:14:29,360 --> 11:14:32,080 because in cases of rare words, it might be the case 13547 11:14:32,080 --> 11:14:34,280 that in nowhere in our training data did we ever 13548 11:14:34,280 --> 11:14:38,360 see the word grandson appear in a message that has positive sentiment. 13549 11:14:38,360 --> 11:14:40,320 So what can we do to solve this problem? 13550 11:14:40,320 --> 11:14:43,160 Well, one thing we'll often do is some kind of additive smoothing, 13551 11:14:43,160 --> 11:14:46,640 where we add some value alpha to each value in our distribution 13552 11:14:46,640 --> 11:14:48,480 just to smooth out the data a little bit. 13553 11:14:48,480 --> 11:14:50,920 And a common form of this is Laplace smoothing, 13554 11:14:50,920 --> 11:14:53,680 where we add 1 to each value in our distribution. 13555 11:14:53,680 --> 11:14:56,880 In essence, we pretend we've seen each value one more time 13556 11:14:56,880 --> 11:14:58,000 than we actually have. 13557 11:14:58,000 --> 11:15:01,160 So if we've never seen the word grandson for a positive review, 13558 11:15:01,160 --> 11:15:02,400 we pretend we've seen it once. 13559 11:15:02,400 --> 11:15:04,880 If we've seen it once, we pretend we've seen it twice, 13560 11:15:04,880 --> 11:15:09,600 just to avoid the possibility that we might multiply by 0 and as a result, 13561 11:15:09,600 --> 11:15:12,560 get some results we don't want in our analysis. 13562 11:15:12,560 --> 11:15:14,520 So let's see what this looks like in practice. 13563 11:15:14,520 --> 11:15:18,360 Let's try to do some naive Bayes classification in order 13564 11:15:18,360 --> 11:15:22,360 to classify text as either positive or negative. 13565 11:15:22,360 --> 11:15:25,480 We'll take a look at sentiment.py. 13566 11:15:25,480 --> 11:15:28,960 And what this is going to do is load some sample data into memory, 13567 11:15:28,960 --> 11:15:32,440 some examples of positive reviews and negative reviews. 13568 11:15:32,440 --> 11:15:35,980 And then we're going to train a naive Bayes classifier 13569 11:15:35,980 --> 11:15:39,080 on all of this training data, training data that 13570 11:15:39,080 --> 11:15:42,260 includes all of the words we see in positive reviews 13571 11:15:42,260 --> 11:15:44,920 and all of the words we see in negative reviews. 13572 11:15:44,920 --> 11:15:48,160 And then we're going to try to classify some input. 13573 11:15:48,160 --> 11:15:50,840 And so we're going to do this based on a corpus of data. 13574 11:15:50,840 --> 11:15:52,520 I have some example positive reviews. 13575 11:15:52,520 --> 11:15:53,840 Here are some positive reviews. 13576 11:15:53,840 --> 11:15:56,080 It was great, so much fun, for example. 13577 11:15:56,080 --> 11:15:59,060 And then some negative reviews, not worth it, kind of cheap. 13578 11:15:59,060 --> 11:16:02,080 These are some examples of negative reviews. 13579 11:16:02,080 --> 11:16:04,640 So now let's try to run this classifier and see 13580 11:16:04,640 --> 11:16:09,400 how it would classify particular text as either positive or negative. 13581 11:16:09,400 --> 11:16:14,360 We'll go ahead and run our sentiment analysis on this corpus. 13582 11:16:14,360 --> 11:16:16,080 And we need to provide it with a review. 13583 11:16:16,080 --> 11:16:19,600 So I'll say something like, I enjoyed it. 13584 11:16:19,600 --> 11:16:23,880 And we see that the classifier says there is about a 0.92 probability 13585 11:16:23,880 --> 11:16:27,120 that we think that this particular review is positive. 13586 11:16:27,120 --> 11:16:28,520 Let's try something negative. 13587 11:16:28,520 --> 11:16:31,720 We'll try kind of overpriced. 13588 11:16:31,720 --> 11:16:34,400 And we see that there is a 0.96 probability 13589 11:16:34,400 --> 11:16:37,280 now that we think that this particular review is negative. 13590 11:16:37,280 --> 11:16:40,600 And so our naive Bayes classifier has learned what kinds of words 13591 11:16:40,600 --> 11:16:43,800 tend to appear in positive reviews and what kinds of words 13592 11:16:43,800 --> 11:16:45,480 tend to appear in negative reviews. 13593 11:16:45,480 --> 11:16:49,100 And as a result of that, we've been able to design a classifier that 13594 11:16:49,100 --> 11:16:54,240 can predict whether a particular review is positive or negative. 13595 11:16:54,240 --> 11:16:56,800 And so this definitely is a useful tool that we can use 13596 11:16:56,800 --> 11:16:58,400 to try and make some predictions. 13597 11:16:58,400 --> 11:17:01,000 But we had to make some assumptions in order to get there. 13598 11:17:01,000 --> 11:17:04,160 So what if we want to now try to build some more sophisticated models, 13599 11:17:04,160 --> 11:17:07,100 use some tools from machine learning to try and take 13600 11:17:07,100 --> 11:17:09,560 better advantage of language data to be able to draw 13601 11:17:09,560 --> 11:17:12,320 more accurate conclusions and solve new kinds of tasks 13602 11:17:12,320 --> 11:17:13,840 and new kinds of problems? 13603 11:17:13,840 --> 11:17:17,280 Well, we've seen a couple of times now that when we want to take some data 13604 11:17:17,280 --> 11:17:19,480 and take some input, put it in a way that the computer is 13605 11:17:19,480 --> 11:17:22,760 going to be able to make sense of, it can be helpful to take that data 13606 11:17:22,760 --> 11:17:25,040 and turn it into numbers, ultimately. 13607 11:17:25,040 --> 11:17:27,200 And so what we might want to try to do is come up 13608 11:17:27,200 --> 11:17:30,860 with some word representation, some way to take a word 13609 11:17:30,860 --> 11:17:33,480 and translate its meaning into numbers. 13610 11:17:33,480 --> 11:17:35,940 Because, for example, if we wanted to use a neural network 13611 11:17:35,940 --> 11:17:39,080 to be able to process language, give our language to a neural network 13612 11:17:39,080 --> 11:17:42,400 and have it make some predictions or perform some analysis there, 13613 11:17:42,400 --> 11:17:45,920 a neural network takes its input and produces its output 13614 11:17:45,920 --> 11:17:48,520 a vector of values, a vector of numbers. 13615 11:17:48,520 --> 11:17:51,280 And so what we might want to do is take our data 13616 11:17:51,280 --> 11:17:54,800 and somehow take words and convert them into some kind 13617 11:17:54,800 --> 11:17:56,760 of numeric representation. 13618 11:17:56,760 --> 11:17:57,880 So how might we do that? 13619 11:17:57,880 --> 11:18:01,600 How might we take words and turn them into numbers? 13620 11:18:01,600 --> 11:18:03,440 Let's take a look at an example. 13621 11:18:03,440 --> 11:18:05,680 Here's a sentence, he wrote a book. 13622 11:18:05,680 --> 11:18:08,080 And let's say I wanted to take each of those words 13623 11:18:08,080 --> 11:18:10,200 and turn it into a vector of values. 13624 11:18:10,200 --> 11:18:11,640 Here's one way I might do that. 13625 11:18:11,640 --> 11:18:15,720 We'll say he is going to be a vector that has a 1 in the first position 13626 11:18:15,720 --> 11:18:17,720 and the rest of the values are 0. 13627 11:18:17,720 --> 11:18:20,680 Wrote will have a 1 in the second position and the rest of the values 13628 11:18:20,680 --> 11:18:21,560 are 0. 13629 11:18:21,560 --> 11:18:24,960 A has a 1 in the third position with the rest of the value 0. 13630 11:18:24,960 --> 11:18:28,760 And book has a 1 in the fourth position with the rest of the value 0. 13631 11:18:28,760 --> 11:18:33,360 So each of these words now has a distinct vector representation. 13632 11:18:33,360 --> 11:18:36,760 And this is what we often call a one-hot representation, 13633 11:18:36,760 --> 11:18:41,400 a representation of the meaning of a word as a vector with a single 1 13634 11:18:41,400 --> 11:18:43,920 and all of the rest of the values are 0. 13635 11:18:43,920 --> 11:18:47,480 And so when doing this, we now have a numeric representation for every word 13636 11:18:47,480 --> 11:18:50,120 and we could pass in those vector representations 13637 11:18:50,120 --> 11:18:52,520 into a neural network or other models that 13638 11:18:52,520 --> 11:18:55,840 require some kind of numeric data as input. 13639 11:18:55,840 --> 11:18:59,080 But this one-hot representation actually has a couple of problems 13640 11:18:59,080 --> 11:19:01,360 and it's not ideal for a few reasons. 13641 11:19:01,360 --> 11:19:03,960 One reason is, here we're just looking at four words. 13642 11:19:03,960 --> 11:19:07,720 But if you imagine a vocabulary of thousands of words or more, 13643 11:19:07,720 --> 11:19:09,720 these vectors are going to get quite long in order 13644 11:19:09,720 --> 11:19:14,160 to have a distinct vector for every possible word in a vocabulary. 13645 11:19:14,160 --> 11:19:16,280 And as a result of that, these longer vectors 13646 11:19:16,280 --> 11:19:19,280 are going to be more difficult to deal with, more difficult to train, 13647 11:19:19,280 --> 11:19:19,760 and so forth. 13648 11:19:19,760 --> 11:19:21,720 And so that might be a problem. 13649 11:19:21,720 --> 11:19:24,280 Another problem is a little bit more subtle. 13650 11:19:24,280 --> 11:19:27,040 If we want to represent a word as a vector, 13651 11:19:27,040 --> 11:19:29,880 and in particular the meaning of a word as a vector, 13652 11:19:29,880 --> 11:19:33,960 then ideally it should be the case that words that have similar meanings 13653 11:19:33,960 --> 11:19:36,880 should also have similar vector representations, 13654 11:19:36,880 --> 11:19:40,800 so that they're close to each other together inside a vector space. 13655 11:19:40,800 --> 11:19:44,400 But that's not really going to be the case with these one-hot representations, 13656 11:19:44,400 --> 11:19:46,840 because if we take some similar words, say the word 13657 11:19:46,840 --> 11:19:50,240 wrote and the word authored, which means similar things, 13658 11:19:50,240 --> 11:19:54,040 they have entirely different vector representations. 13659 11:19:54,040 --> 11:19:57,880 Likewise, book and novel, those two words mean somewhat similar things, 13660 11:19:57,880 --> 11:20:00,840 but they have entirely different vector representations 13661 11:20:00,840 --> 11:20:04,120 because they each have a one in some different position. 13662 11:20:04,120 --> 11:20:05,960 And so that's not ideal either. 13663 11:20:05,960 --> 11:20:08,080 So what we might be interested in instead 13664 11:20:08,080 --> 11:20:10,640 is some kind of distributed representation. 13665 11:20:10,640 --> 11:20:13,320 A distributed representation is the representation 13666 11:20:13,320 --> 11:20:17,200 of the meaning of a word distributed across multiple values, 13667 11:20:17,200 --> 11:20:20,720 instead of just being one-hot with a one in one position. 13668 11:20:20,720 --> 11:20:25,000 Here is what a distributed representation of words might be. 13669 11:20:25,000 --> 11:20:28,360 Each word is associated with some vector of values, 13670 11:20:28,360 --> 11:20:31,080 with the meaning distributed across multiple values, 13671 11:20:31,080 --> 11:20:34,320 ideally in such a way that similar words have 13672 11:20:34,320 --> 11:20:37,080 a similar vector representation. 13673 11:20:37,080 --> 11:20:39,080 But how are we going to come up with those values? 13674 11:20:39,080 --> 11:20:40,600 Where do those values come from? 13675 11:20:40,600 --> 11:20:45,800 How can we define the meaning of a word in this distributed sequence of numbers? 13676 11:20:45,800 --> 11:20:47,840 Well, to do that, we're going to draw inspiration 13677 11:20:47,840 --> 11:20:50,880 from a quote from British linguist J.R. Firth, who said, 13678 11:20:50,880 --> 11:20:54,200 you shall know a word by the company it keeps. 13679 11:20:54,200 --> 11:20:56,920 In other words, we're going to define the meaning of a word 13680 11:20:56,920 --> 11:21:01,160 based on the words that appear around it, the context words around it. 13681 11:21:01,160 --> 11:21:05,200 Take, for example, this context, for blank he ate. 13682 11:21:05,200 --> 11:21:08,760 You might wonder, what words could reasonably fill in that blank? 13683 11:21:08,760 --> 11:21:11,920 Well, it might be words like breakfast or lunch or dinner. 13684 11:21:11,920 --> 11:21:14,520 All of those could reasonably fill in that blank. 13685 11:21:14,520 --> 11:21:17,920 And so what we're going to say is because the words breakfast and lunch 13686 11:21:17,920 --> 11:21:23,240 and dinner appear in a similar context, that they must have a similar meaning. 13687 11:21:23,240 --> 11:21:26,400 And that's something our computer could understand and try to learn. 13688 11:21:26,400 --> 11:21:28,880 A computer could look at a big corpus of text, 13689 11:21:28,880 --> 11:21:32,360 look at what words tend to appear in similar context to each other, 13690 11:21:32,360 --> 11:21:35,880 and use that to identify which words have a similar meaning 13691 11:21:35,880 --> 11:21:40,240 and should therefore appear close to each other inside a vector space. 13692 11:21:40,240 --> 11:21:44,200 And so one common model for doing this is known as the word to vec model. 13693 11:21:44,200 --> 11:21:48,640 It's a model for generating word vectors, a vector representation for every word 13694 11:21:48,640 --> 11:21:52,960 by looking at data and looking at the context in which a word appears. 13695 11:21:52,960 --> 11:21:54,240 The idea is going to be this. 13696 11:21:54,240 --> 11:21:58,680 If you start out with all of the words just in some random position in space 13697 11:21:58,680 --> 11:22:02,640 and train it on some training data, what the word to vec model will do 13698 11:22:02,640 --> 11:22:05,880 is start to learn what words appear in similar contexts. 13699 11:22:05,880 --> 11:22:08,720 And it will move these vectors around in such a way 13700 11:22:08,720 --> 11:22:12,600 that hopefully words with similar meanings, breakfast, lunch, and dinner, 13701 11:22:12,600 --> 11:22:17,040 book, memoir, novel, will hopefully appear to be near to each other 13702 11:22:17,040 --> 11:22:19,080 as vectors as well. 13703 11:22:19,080 --> 11:22:21,280 So let's now take a look at what word to vec 13704 11:22:21,280 --> 11:22:24,880 might look like in practice when implemented in code. 13705 11:22:24,880 --> 11:22:29,560 What I have here inside of words.txt is a pre-trained model 13706 11:22:29,560 --> 11:22:32,640 where each of these words has some vector representation 13707 11:22:32,640 --> 11:22:33,960 trained by word to vec. 13708 11:22:33,960 --> 11:22:38,600 Each of these words has some sequence of values representing its meaning, 13709 11:22:38,600 --> 11:22:43,600 hopefully in such a way that similar words are represented by similar vectors. 13710 11:22:43,600 --> 11:22:47,280 I also have this file vectors.py, which is going to open up the words 13711 11:22:47,280 --> 11:22:48,800 and form them into a dictionary. 13712 11:22:48,800 --> 11:22:51,400 And we also define some useful functions like distance 13713 11:22:51,400 --> 11:22:55,160 to get the distance between two word vectors and closest words 13714 11:22:55,160 --> 11:23:00,200 to find which words are nearby in terms of having close vectors to each other. 13715 11:23:00,200 --> 11:23:02,360 And so let's give this a try. 13716 11:23:02,360 --> 11:23:05,760 We'll go ahead and open a Python interpreter. 13717 11:23:05,760 --> 11:23:10,160 And I'm going to import these vectors. 13718 11:23:10,160 --> 11:23:13,360 And we might say, all right, what is the vector representation 13719 11:23:13,360 --> 11:23:15,680 of the word book? 13720 11:23:15,680 --> 11:23:19,520 And we get this big long vector that represents the word book 13721 11:23:19,520 --> 11:23:21,120 as a sequence of values. 13722 11:23:21,120 --> 11:23:24,320 And this sequence of values by itself is not all that meaningful. 13723 11:23:24,320 --> 11:23:27,440 But it is meaningful in the context of comparing it 13724 11:23:27,440 --> 11:23:30,400 to other vectors for other words. 13725 11:23:30,400 --> 11:23:32,280 So we could use this distance function, which 13726 11:23:32,280 --> 11:23:35,520 is going to get us the distance between two word vectors. 13727 11:23:35,520 --> 11:23:37,880 And we might say, what is the distance between the vector 13728 11:23:37,880 --> 11:23:42,200 representation for the word book and the vector representation 13729 11:23:42,200 --> 11:23:44,320 for the word novel? 13730 11:23:44,320 --> 11:23:46,280 And we see that it's 0.34. 13731 11:23:46,280 --> 11:23:49,360 You can kind of interpret 0 as being really close together and 1 13732 11:23:49,360 --> 11:23:51,040 being very far apart. 13733 11:23:51,040 --> 11:23:55,840 And so now, what is the distance between book and, let's say, breakfast? 13734 11:23:55,840 --> 11:23:58,560 Well, book and breakfast are more different from each other 13735 11:23:58,560 --> 11:23:59,840 than book and novel are. 13736 11:23:59,840 --> 11:24:02,600 So I would hopefully expect the distance to be larger. 13737 11:24:02,600 --> 11:24:05,600 And in fact, it is 0.64 approximately. 13738 11:24:05,600 --> 11:24:08,440 These two words are further away from each other. 13739 11:24:08,440 --> 11:24:13,600 And what about now the distance between, let's say, lunch and breakfast? 13740 11:24:13,600 --> 11:24:15,040 Well, that's about 0.2. 13741 11:24:15,040 --> 11:24:16,400 Those are even closer together. 13742 11:24:16,400 --> 11:24:19,920 They have a meaning that is closer to each other. 13743 11:24:19,920 --> 11:24:24,400 Another interesting thing we might do is calculate the closest words. 13744 11:24:24,400 --> 11:24:28,200 We might say, what are the closest words, according to Word2Vec, 13745 11:24:28,200 --> 11:24:29,840 to the word book? 13746 11:24:29,840 --> 11:24:32,120 And let's say, let's get the 10 closest words. 13747 11:24:32,120 --> 11:24:35,960 What are the 10 closest vectors to the vector representation 13748 11:24:35,960 --> 11:24:37,680 for the word book? 13749 11:24:37,680 --> 11:24:40,920 And when we perform that analysis, we get this list of words. 13750 11:24:40,920 --> 11:24:44,760 The closest one is book itself, but we also have books plural, 13751 11:24:44,760 --> 11:24:48,760 and then essay, memoir, essays, novella, anthology, and so on. 13752 11:24:48,760 --> 11:24:52,240 All of these words mean something similar to the word book, 13753 11:24:52,240 --> 11:24:54,320 according to Word2Vec, at least, because they 13754 11:24:54,320 --> 11:24:56,560 have a similar vector representation. 13755 11:24:56,560 --> 11:24:59,240 So it seems like we've done a pretty good job of trying 13756 11:24:59,240 --> 11:25:03,920 to capture this kind of vector representation of word meaning. 13757 11:25:03,920 --> 11:25:06,720 One other interesting side effect of Word2Vec 13758 11:25:06,720 --> 11:25:10,160 is that it's also able to capture something about the relationships 13759 11:25:10,160 --> 11:25:12,080 between words as well. 13760 11:25:12,080 --> 11:25:13,480 Let's take a look at an example. 13761 11:25:13,480 --> 11:25:16,880 Here, for instance, are two words, man and king. 13762 11:25:16,880 --> 11:25:20,480 And these are each represented by Word2Vec as vectors. 13763 11:25:20,480 --> 11:25:23,960 So what might happen if I subtracted one from the other, 13764 11:25:23,960 --> 11:25:27,360 calculated the value king minus man? 13765 11:25:27,360 --> 11:25:31,040 Well, that will be the vector that will take us from man to king, 13766 11:25:31,040 --> 11:25:35,000 somehow represent this relationship between the vector representation 13767 11:25:35,000 --> 11:25:38,960 of the word man and the vector representation of the word king. 13768 11:25:38,960 --> 11:25:42,520 And that's what this value, king minus man, represents. 13769 11:25:42,520 --> 11:25:45,920 So what would happen if I took the vector representation of the word 13770 11:25:45,920 --> 11:25:51,200 woman and added that same value, king minus man, to it? 13771 11:25:51,200 --> 11:25:54,720 What would we get as the closest word to that, for example? 13772 11:25:54,720 --> 11:25:55,680 Well, we could try it. 13773 11:25:55,680 --> 11:25:59,880 Let's go ahead and go back to our Python interpreter and give this a try. 13774 11:25:59,880 --> 11:26:03,680 I could say, what is the closest word to the vector representation 13775 11:26:03,680 --> 11:26:07,440 of the word king minus the representation of the word man 13776 11:26:07,440 --> 11:26:11,440 plus the representation of the word woman? 13777 11:26:11,440 --> 11:26:14,320 And we see that the closest word is the word queen. 13778 11:26:14,320 --> 11:26:17,760 We've somehow been able to capture the relationship between king and man. 13779 11:26:17,760 --> 11:26:19,920 And then when we apply it to the word woman, 13780 11:26:19,920 --> 11:26:24,120 we get, as the result, the word queen. 13781 11:26:24,120 --> 11:26:27,320 So Word2Vec has been able to capture not just the words 13782 11:26:27,320 --> 11:26:29,720 and how they're similar to each other, but also something 13783 11:26:29,720 --> 11:26:33,400 about the relationships between words and how those words are connected 13784 11:26:33,400 --> 11:26:34,760 to each other. 13785 11:26:34,760 --> 11:26:37,280 So now that we have this vector representation of words, 13786 11:26:37,280 --> 11:26:38,600 what can we now do with it? 13787 11:26:38,600 --> 11:26:40,680 Now we can represent words as numbers. 13788 11:26:40,680 --> 11:26:43,480 And so we might try to pass those words as input 13789 11:26:43,480 --> 11:26:45,080 to, say, a neural network. 13790 11:26:45,080 --> 11:26:47,200 Neural networks we've seen are very powerful tools 13791 11:26:47,200 --> 11:26:50,640 for identifying patterns and making predictions. 13792 11:26:50,640 --> 11:26:53,800 Recall that a neural network you can think of as all of these units. 13793 11:26:53,800 --> 11:26:55,720 But really what the neural network is doing 13794 11:26:55,720 --> 11:26:58,720 is taking some input, passing it into the network, 13795 11:26:58,720 --> 11:27:00,360 and then producing some output. 13796 11:27:00,360 --> 11:27:02,800 And by providing the neural network with training data, 13797 11:27:02,800 --> 11:27:05,600 we're able to update the weights inside of the network 13798 11:27:05,600 --> 11:27:09,160 so that the neural network can do a more accurate job of translating 13799 11:27:09,160 --> 11:27:11,760 those inputs into those outputs. 13800 11:27:11,760 --> 11:27:14,560 And now that we can represent words as numbers that 13801 11:27:14,560 --> 11:27:18,280 could be the input or output, you could imagine passing a word in 13802 11:27:18,280 --> 11:27:21,720 as input to a neural network and getting a word as output. 13803 11:27:21,720 --> 11:27:23,320 And so when might that be useful? 13804 11:27:23,320 --> 11:27:26,840 One common use for neural networks is in machine translation, 13805 11:27:26,840 --> 11:27:29,960 when we want to translate text from one language into another, 13806 11:27:29,960 --> 11:27:33,760 say translate English into French by passing English into the neural 13807 11:27:33,760 --> 11:27:36,000 network and getting some French output. 13808 11:27:36,000 --> 11:27:39,720 You might imagine, for instance, that we could take the English word for lamp, 13809 11:27:39,720 --> 11:27:43,760 pass it into the neural network, get the French word for lamp as output. 13810 11:27:43,760 --> 11:27:48,000 But in practice, when we're translating text from one language to another, 13811 11:27:48,000 --> 11:27:50,200 we're usually not just interested in translating 13812 11:27:50,200 --> 11:27:53,800 a single word from one language to another, but a sequence, 13813 11:27:53,800 --> 11:27:56,240 say a sentence or a paragraph of words. 13814 11:27:56,240 --> 11:27:58,440 Here, for example, is another paragraph, again taken 13815 11:27:58,440 --> 11:28:00,640 from Sherlock Holmes, written in English. 13816 11:28:00,640 --> 11:28:03,960 And what I might want to do is take that entire sentence, 13817 11:28:03,960 --> 11:28:08,300 pass it into the neural network, and get as output a French translation 13818 11:28:08,300 --> 11:28:10,120 of the same sentence. 13819 11:28:10,120 --> 11:28:12,680 But recall that a neural network's input and output 13820 11:28:12,680 --> 11:28:14,880 needs to be of some fixed size. 13821 11:28:14,880 --> 11:28:16,480 And a sentence is not a fixed size. 13822 11:28:16,480 --> 11:28:17,080 It's variable. 13823 11:28:17,080 --> 11:28:20,640 You might have shorter sentences, and you might have longer sentences. 13824 11:28:20,640 --> 11:28:23,480 So somehow, we need to solve the problem of translating 13825 11:28:23,480 --> 11:28:27,680 a sequence into another sequence by means of a neural network. 13826 11:28:27,680 --> 11:28:30,520 And that's going to be true not only for machine translation, 13827 11:28:30,520 --> 11:28:33,960 but also for other problems, problems like question answering. 13828 11:28:33,960 --> 11:28:36,360 If I want to pass as input a question, something 13829 11:28:36,360 --> 11:28:38,960 like what is the capital of Massachusetts, 13830 11:28:38,960 --> 11:28:41,280 feed that as input into the neural network, 13831 11:28:41,280 --> 11:28:43,160 I would hope that what I would get as output 13832 11:28:43,160 --> 11:28:46,360 is a sentence like the capital is Boston, again, 13833 11:28:46,360 --> 11:28:50,080 translating some sequence into some other sequence. 13834 11:28:50,080 --> 11:28:53,480 And if you've ever had a conversation with an AI chatbot, 13835 11:28:53,480 --> 11:28:55,960 or have ever asked your phone a question, 13836 11:28:55,960 --> 11:28:57,400 it needs to do something like this. 13837 11:28:57,400 --> 11:29:00,680 It needs to understand the sequence of words that you, the human, 13838 11:29:00,680 --> 11:29:02,000 provided as input. 13839 11:29:02,000 --> 11:29:06,160 And then the computer needs to generate some sequence of words as output. 13840 11:29:06,160 --> 11:29:07,520 So how can we do this? 13841 11:29:07,520 --> 11:29:10,880 Well, one tool that we can use is the recurrent neural network, which 13842 11:29:10,880 --> 11:29:13,280 we took a look at last time, which is a way for us 13843 11:29:13,280 --> 11:29:16,120 to provide a sequence of values to a neural network 13844 11:29:16,120 --> 11:29:18,640 by running the neural network multiple times. 13845 11:29:18,640 --> 11:29:22,280 And each time we run the neural network, what we're going to do 13846 11:29:22,280 --> 11:29:25,040 is we're going to keep track of some hidden state. 13847 11:29:25,040 --> 11:29:26,880 And that hidden state is going to be passed 13848 11:29:26,880 --> 11:29:30,200 from one run of the neural network to the next run of the neural network, 13849 11:29:30,200 --> 11:29:33,240 keeping track of all of the relevant information. 13850 11:29:33,240 --> 11:29:35,320 And so let's take a look at how we can apply that 13851 11:29:35,320 --> 11:29:36,440 to something like this. 13852 11:29:36,440 --> 11:29:39,280 And in particular, we're going to look at an architecture known 13853 11:29:39,280 --> 11:29:41,960 as an encoder-decoder architecture, where 13854 11:29:41,960 --> 11:29:46,320 we're going to encode this question into some kind of hidden state, 13855 11:29:46,320 --> 11:29:50,320 and then use a decoder to decode that hidden state into the output 13856 11:29:50,320 --> 11:29:52,080 that we're interested in. 13857 11:29:52,080 --> 11:29:53,560 So what's that going to look like? 13858 11:29:53,560 --> 11:29:55,760 We'll start with the first word, the word what. 13859 11:29:55,760 --> 11:29:58,040 That goes into our neural network, and it's 13860 11:29:58,040 --> 11:30:00,720 going to produce some hidden state. 13861 11:30:00,720 --> 11:30:04,760 This is some information about the word what that our neural network is 13862 11:30:04,760 --> 11:30:06,720 going to need to keep track of. 13863 11:30:06,720 --> 11:30:09,280 Then when the second word comes along, we're 13864 11:30:09,280 --> 11:30:12,360 going to feed it into that same encoder neural network, 13865 11:30:12,360 --> 11:30:15,920 but it's going to get as input that hidden state as well. 13866 11:30:15,920 --> 11:30:17,440 So we pass in the second word. 13867 11:30:17,440 --> 11:30:19,960 We also get the information about the hidden state, 13868 11:30:19,960 --> 11:30:23,360 and that's going to continue for the other words in the input. 13869 11:30:23,360 --> 11:30:25,520 This is going to produce a new hidden state. 13870 11:30:25,520 --> 11:30:30,200 And so then when we get to the third word, the, that goes into the encoder. 13871 11:30:30,200 --> 11:30:32,840 It also gets access to the hidden state, and then it 13872 11:30:32,840 --> 11:30:35,720 produces a new hidden state that gets passed into the next run 13873 11:30:35,720 --> 11:30:37,160 when we use the word capital. 13874 11:30:37,160 --> 11:30:39,720 And the same thing is going to repeat for the other words 13875 11:30:39,720 --> 11:30:41,520 that appear in the input. 13876 11:30:41,520 --> 11:30:47,320 So of Massachusetts, that produces one final piece of hidden state. 13877 11:30:47,320 --> 11:30:50,040 Now somehow, we need to signal the fact that we're done. 13878 11:30:50,040 --> 11:30:51,640 There's nothing left in the input. 13879 11:30:51,640 --> 11:30:54,440 And we typically do this by passing some kind of special token, 13880 11:30:54,440 --> 11:30:57,400 say an end token, into the neural network. 13881 11:30:57,400 --> 11:31:00,480 And now the decoding process is going to start. 13882 11:31:00,480 --> 11:31:03,320 We're going to generate the word the. 13883 11:31:03,320 --> 11:31:06,120 But in addition to generating the word the, 13884 11:31:06,120 --> 11:31:11,160 this decoder network is also going to generate some kind of hidden state. 13885 11:31:11,160 --> 11:31:13,160 And so what happens the next time? 13886 11:31:13,160 --> 11:31:15,200 Well, to generate the next word, it might 13887 11:31:15,200 --> 11:31:18,520 be helpful to know what the first word was. 13888 11:31:18,520 --> 11:31:22,840 So we might pass the first word the back into the decoder network. 13889 11:31:22,840 --> 11:31:24,880 It's going to get as input this hidden state, 13890 11:31:24,880 --> 11:31:27,640 and it's going to generate the next word capital. 13891 11:31:27,640 --> 11:31:30,040 And that's also going to generate some hidden state. 13892 11:31:30,040 --> 11:31:32,280 And we'll repeat that, passing capital into the network 13893 11:31:32,280 --> 11:31:35,400 to generate the third word is, and then one more time 13894 11:31:35,400 --> 11:31:38,040 in order to get the fourth word Boston. 13895 11:31:38,040 --> 11:31:39,400 And at that point, we're done. 13896 11:31:39,400 --> 11:31:40,840 But how do we know we're done? 13897 11:31:40,840 --> 11:31:45,560 Usually, we'll do this one more time, pass Boston into the decoder network, 13898 11:31:45,560 --> 11:31:50,720 and get an output some end token to indicate that that is the end of our input. 13899 11:31:50,720 --> 11:31:53,640 And so this then is how we could use a recurrent neural network 13900 11:31:53,640 --> 11:31:57,140 to take some input, encode it into some hidden state, 13901 11:31:57,140 --> 11:32:01,160 and then use that hidden state to decode it into the output we're interested in. 13902 11:32:01,160 --> 11:32:04,560 To visualize it in a slightly different way, we have some input sequence. 13903 11:32:04,560 --> 11:32:06,740 This is just some sequence of words. 13904 11:32:06,740 --> 11:32:10,280 That input sequence goes into the encoder, which in this case 13905 11:32:10,280 --> 11:32:14,160 is a recurrent neural network generating these hidden states along the way 13906 11:32:14,160 --> 11:32:17,320 until we generate some final hidden state, at which point 13907 11:32:17,320 --> 11:32:19,120 we start the decoding process. 13908 11:32:19,120 --> 11:32:21,360 Again, using a recurrent neural network, that's 13909 11:32:21,360 --> 11:32:23,960 going to generate the output sequence as well. 13910 11:32:23,960 --> 11:32:26,960 So we've got the encoder, which is encoding the information 13911 11:32:26,960 --> 11:32:29,560 about the input sequence into this hidden state, 13912 11:32:29,560 --> 11:32:32,360 and then the decoder, which takes that hidden state 13913 11:32:32,360 --> 11:32:36,320 and uses it in order to generate the output sequence. 13914 11:32:36,320 --> 11:32:37,640 But there are some problems. 13915 11:32:37,640 --> 11:32:39,840 And for many years, this was the state of the art. 13916 11:32:39,840 --> 11:32:42,360 The recurrent neural network and variance on this approach 13917 11:32:42,360 --> 11:32:44,480 were some of the best ways we knew in order 13918 11:32:44,480 --> 11:32:46,620 to perform tasks in natural language processing. 13919 11:32:46,620 --> 11:32:49,280 But there are some problems that we might want to try to deal with 13920 11:32:49,280 --> 11:32:51,320 and that have been dealt with over the years 13921 11:32:51,320 --> 11:32:54,460 to try and improve upon this kind of model. 13922 11:32:54,460 --> 11:32:58,240 And one problem you might notice happens in this encoder stage. 13923 11:32:58,240 --> 11:33:01,040 We've taken this input sequence, the sequence of words, 13924 11:33:01,040 --> 11:33:05,480 and encoded it all into this final piece of hidden state. 13925 11:33:05,480 --> 11:33:07,440 And that final piece of hidden state needs 13926 11:33:07,440 --> 11:33:10,560 to contain all of the information from the input sequence 13927 11:33:10,560 --> 11:33:14,800 that we need in order to generate the output sequence. 13928 11:33:14,800 --> 11:33:18,080 And while that's possible, it becomes increasingly difficult 13929 11:33:18,080 --> 11:33:20,260 as the sequence gets larger and larger. 13930 11:33:20,260 --> 11:33:22,720 For larger and larger input sequences, it's 13931 11:33:22,720 --> 11:33:24,800 going to become more and more difficult to store 13932 11:33:24,800 --> 11:33:27,180 all of the information we need about the input 13933 11:33:27,180 --> 11:33:30,600 inside this single hidden state piece of context. 13934 11:33:30,600 --> 11:33:33,720 That's a lot of information to pack into just a single value. 13935 11:33:33,720 --> 11:33:36,840 It might be useful for us, when generating output, 13936 11:33:36,840 --> 11:33:40,460 to not just refer to this one value, but to all 13937 11:33:40,460 --> 11:33:44,620 of the previous hidden values that have been generated by the encoder. 13938 11:33:44,620 --> 11:33:46,880 And so that might be useful, but how could we do that? 13939 11:33:46,880 --> 11:33:48,380 We've got a lot of different values. 13940 11:33:48,380 --> 11:33:50,080 We need to combine them somehow. 13941 11:33:50,080 --> 11:33:52,320 So you could imagine adding them together, 13942 11:33:52,320 --> 11:33:54,440 taking the average of them, for example. 13943 11:33:54,440 --> 11:33:57,960 But doing that would assume that all of these pieces of hidden state 13944 11:33:57,960 --> 11:33:59,680 are equally important. 13945 11:33:59,680 --> 11:34:01,280 But that's not necessarily true either. 13946 11:34:01,280 --> 11:34:03,480 Some of these pieces of hidden state are going 13947 11:34:03,480 --> 11:34:05,680 to be more important than others, depending 13948 11:34:05,680 --> 11:34:08,520 on what word they most closely correspond to. 13949 11:34:08,520 --> 11:34:11,040 This piece of hidden state very closely corresponds 13950 11:34:11,040 --> 11:34:13,040 to the first word of the input sequence. 13951 11:34:13,040 --> 11:34:16,600 This one very closely corresponds to the second word of the input sequence, 13952 11:34:16,600 --> 11:34:17,800 for example. 13953 11:34:17,800 --> 11:34:21,200 And some of those are going to be more important than others. 13954 11:34:21,200 --> 11:34:23,400 To make matters more complicated, depending 13955 11:34:23,400 --> 11:34:26,520 on which word of the output sequence we're generating, 13956 11:34:26,520 --> 11:34:30,000 different input words might be more or less important. 13957 11:34:30,000 --> 11:34:33,520 And so what we really want is some way to decide for ourselves 13958 11:34:33,520 --> 11:34:37,040 which of the input values are worth paying attention to, 13959 11:34:37,040 --> 11:34:38,640 at what point in time. 13960 11:34:38,640 --> 11:34:42,160 And this is the key idea behind a mechanism known as attention. 13961 11:34:42,160 --> 11:34:45,760 Attention is all about letting us decide which values 13962 11:34:45,760 --> 11:34:49,120 are important to pay attention to, when generating, in this case, 13963 11:34:49,120 --> 11:34:51,880 the next word in our sequence. 13964 11:34:51,880 --> 11:34:54,160 So let's take a look at an example of that. 13965 11:34:54,160 --> 11:34:55,200 Here's a sentence. 13966 11:34:55,200 --> 11:34:57,520 What is the capital of Massachusetts? 13967 11:34:57,520 --> 11:34:59,080 Same sentence as before. 13968 11:34:59,080 --> 11:35:02,120 And let's imagine that we were trying to answer that question 13969 11:35:02,120 --> 11:35:04,200 by generating tokens of output. 13970 11:35:04,200 --> 11:35:05,800 So what would the output look like? 13971 11:35:05,800 --> 11:35:09,080 Well, it's going to look like something like the capital is. 13972 11:35:09,080 --> 11:35:12,520 And let's say we're now trying to generate this last word here. 13973 11:35:12,520 --> 11:35:13,800 What is that last word? 13974 11:35:13,800 --> 11:35:16,680 How is the computer going to figure it out? 13975 11:35:16,680 --> 11:35:19,440 Well, what it's going to need to do is decide 13976 11:35:19,440 --> 11:35:22,320 which values it's going to pay attention to. 13977 11:35:22,320 --> 11:35:24,480 And so the attention mechanism will allow 13978 11:35:24,480 --> 11:35:28,120 us to calculate some attention scores for each word, 13979 11:35:28,120 --> 11:35:32,480 some value corresponding to each word, determining how relevant 13980 11:35:32,480 --> 11:35:36,320 is it for us to pay attention to that word right now? 13981 11:35:36,320 --> 11:35:39,880 And in this case, when generating the fourth word of the output sequence, 13982 11:35:39,880 --> 11:35:42,240 the most important words to pay attention to 13983 11:35:42,240 --> 11:35:46,240 might be capital and Massachusetts, for example. 13984 11:35:46,240 --> 11:35:49,000 That those words are going to be particularly relevant. 13985 11:35:49,000 --> 11:35:50,920 And there are a number of different mechanisms 13986 11:35:50,920 --> 11:35:53,760 that have been used in order to calculate these attention scores. 13987 11:35:53,760 --> 11:35:56,400 It could be something as simple as a dot product 13988 11:35:56,400 --> 11:35:58,600 to see how similar two vectors are, or we 13989 11:35:58,600 --> 11:36:02,000 could train an entire neural network to calculate these attention scores. 13990 11:36:02,000 --> 11:36:06,000 But the key idea is that during the training process for our neural network, 13991 11:36:06,000 --> 11:36:09,400 we're going to learn how to calculate these attention scores. 13992 11:36:09,400 --> 11:36:12,640 Our model is going to learn what is important to pay attention 13993 11:36:12,640 --> 11:36:17,120 to in order to decide what the next word should be. 13994 11:36:17,120 --> 11:36:20,360 So the result of all of this, calculating these attention scores, 13995 11:36:20,360 --> 11:36:24,520 is that we can calculate some value, some value for each input word, 13996 11:36:24,520 --> 11:36:28,080 determining how important is it for us to pay attention 13997 11:36:28,080 --> 11:36:29,880 to that particular value. 13998 11:36:29,880 --> 11:36:32,000 And recall that each of these input words 13999 11:36:32,000 --> 11:36:36,400 is also associated with one of these hidden state context vectors, 14000 11:36:36,400 --> 11:36:39,600 capturing information about the sentence up to that point, 14001 11:36:39,600 --> 11:36:43,560 but primarily focused on that word in particular. 14002 11:36:43,560 --> 11:36:46,440 And so what we can now do is if we have all of these vectors 14003 11:36:46,440 --> 11:36:49,560 and we have values representing how important is it for us 14004 11:36:49,560 --> 11:36:52,320 to pay attention to those particular vectors, 14005 11:36:52,320 --> 11:36:54,320 is we can take a weighted average. 14006 11:36:54,320 --> 11:36:58,560 We can take all of these vectors, multiply them by their attention scores, 14007 11:36:58,560 --> 11:37:01,600 and add them up to get some new vector value, which 14008 11:37:01,600 --> 11:37:04,160 is going to represent the context from the input, 14009 11:37:04,160 --> 11:37:07,000 but specifically paying attention to the words 14010 11:37:07,000 --> 11:37:09,520 that we think are most important. 14011 11:37:09,520 --> 11:37:12,400 And once we've done that, that context vector 14012 11:37:12,400 --> 11:37:14,840 can be fed into our decoder in order to say 14013 11:37:14,840 --> 11:37:18,640 that the word should be, in this case, Boston. 14014 11:37:18,640 --> 11:37:21,600 So attention is this very powerful tool that 14015 11:37:21,600 --> 11:37:24,280 allows any word when we're trying to decode it 14016 11:37:24,280 --> 11:37:28,400 to decide which words from the input should we pay attention to in order 14017 11:37:28,400 --> 11:37:33,440 to determine what's important for generating the next word of the output. 14018 11:37:33,440 --> 11:37:35,640 And one of the first places this was really used 14019 11:37:35,640 --> 11:37:37,920 was in the field of machine translation. 14020 11:37:37,920 --> 11:37:39,960 Here's an example of a diagram from the paper 14021 11:37:39,960 --> 11:37:42,160 that introduced this idea, which was focused 14022 11:37:42,160 --> 11:37:45,760 on trying to translate English sentences into French sentences. 14023 11:37:45,760 --> 11:37:48,560 So we have an input English sentence up along the top, 14024 11:37:48,560 --> 11:37:51,120 and then along the left side, the output French equivalent 14025 11:37:51,120 --> 11:37:52,680 of that same sentence. 14026 11:37:52,680 --> 11:37:56,280 And what you see in all of these squares are the attention scores 14027 11:37:56,280 --> 11:38:01,280 visualized, where a lighter square indicates a higher attention score. 14028 11:38:01,280 --> 11:38:04,200 And what you'll notice is that there's a strong correspondence 14029 11:38:04,200 --> 11:38:07,360 between the French word and the equivalent English word, 14030 11:38:07,360 --> 11:38:10,040 that the French word for agreement is really 14031 11:38:10,040 --> 11:38:12,600 paying attention to the English word for agreement 14032 11:38:12,600 --> 11:38:16,320 in order to decide what French word should be generated at that point 14033 11:38:16,320 --> 11:38:17,080 in time. 14034 11:38:17,080 --> 11:38:19,280 And sometimes you might pay attention to multiple words 14035 11:38:19,280 --> 11:38:22,280 if you look at the French word for economic. 14036 11:38:22,280 --> 11:38:25,800 That's primarily paying attention to the English word for economic, 14037 11:38:25,800 --> 11:38:30,440 but also paying attention to the English word for European in this case too. 14038 11:38:30,440 --> 11:38:33,460 And so attention scores are very easy to visualize 14039 11:38:33,460 --> 11:38:37,040 to get a sense for what is our machine learning model really 14040 11:38:37,040 --> 11:38:40,200 paying attention to, what information is it using in order 14041 11:38:40,200 --> 11:38:42,960 to determine what's important and what's not in order 14042 11:38:42,960 --> 11:38:46,800 to determine what the ultimate output token should be. 14043 11:38:46,800 --> 11:38:49,160 And so when we combine the attention mechanism 14044 11:38:49,160 --> 11:38:52,880 with a recurrent neural network, we can get very powerful and useful results 14045 11:38:52,880 --> 11:38:56,400 where we're able to generate an output sequence by paying attention 14046 11:38:56,400 --> 11:38:58,080 to the input sequence too. 14047 11:38:58,080 --> 11:39:00,080 But there are other problems with this approach 14048 11:39:00,080 --> 11:39:02,400 of using a recurrent neural network as well. 14049 11:39:02,400 --> 11:39:05,440 In particular, notice that every run of the neural network 14050 11:39:05,440 --> 11:39:07,760 depends on the output of the previous step. 14051 11:39:07,760 --> 11:39:09,520 And that was important for getting a sense 14052 11:39:09,520 --> 11:39:12,800 for the sequence of words and the ordering of those particular words. 14053 11:39:12,800 --> 11:39:15,880 But we can't run this unit of the neural network 14054 11:39:15,880 --> 11:39:19,680 until after we've calculated the hidden state from the run before it 14055 11:39:19,680 --> 11:39:21,600 from the previous input token. 14056 11:39:21,600 --> 11:39:25,800 And what that means is that it's very difficult to parallelize this process. 14057 11:39:25,800 --> 11:39:28,480 That as the input sequence get longer and longer, 14058 11:39:28,480 --> 11:39:31,280 we might want to use parallelism to try and speed up 14059 11:39:31,280 --> 11:39:33,400 this process of training the neural network 14060 11:39:33,400 --> 11:39:35,600 and making sense of all of this language data. 14061 11:39:35,600 --> 11:39:36,840 But it's difficult to do that. 14062 11:39:36,840 --> 11:39:39,320 And it's slow to do that with a recurrent neural network 14063 11:39:39,320 --> 11:39:42,480 because all of it needs to be performed in sequence. 14064 11:39:42,480 --> 11:39:45,040 And that's become an increasing challenge as we've 14065 11:39:45,040 --> 11:39:47,840 started to get larger and larger language models. 14066 11:39:47,840 --> 11:39:50,120 The more language data that we have available to us 14067 11:39:50,120 --> 11:39:52,480 to use to train our machine learning models, 14068 11:39:52,480 --> 11:39:55,640 the more accurate it can be, the better representation of language 14069 11:39:55,640 --> 11:39:58,000 it can have, the better understanding it can have, 14070 11:39:58,000 --> 11:40:00,160 and the better results that we can see. 14071 11:40:00,160 --> 11:40:02,880 And so we've seen this growth of large language models 14072 11:40:02,880 --> 11:40:05,120 that are using larger and larger data sets. 14073 11:40:05,120 --> 11:40:08,080 But as a result, they take longer and longer to train. 14074 11:40:08,080 --> 11:40:10,680 And so this problem that recurrent neural networks 14075 11:40:10,680 --> 11:40:15,120 are not easy to parallelize has become an increasing problem. 14076 11:40:15,120 --> 11:40:18,000 And as a result of that, that was one of the main motivations 14077 11:40:18,000 --> 11:40:20,640 for a different architecture, for thinking about how 14078 11:40:20,640 --> 11:40:22,600 to deal with natural language. 14079 11:40:22,600 --> 11:40:25,200 And that's known as the transformer architecture. 14080 11:40:25,200 --> 11:40:28,480 And this has been a significant milestone in the world of natural language 14081 11:40:28,480 --> 11:40:32,000 processing for really increasing how well we can perform 14082 11:40:32,000 --> 11:40:34,400 these kinds of natural language processing tasks, 14083 11:40:34,400 --> 11:40:37,760 as well as how quickly we can train a machine learning model to be 14084 11:40:37,760 --> 11:40:39,880 able to produce effective results. 14085 11:40:39,880 --> 11:40:42,080 There are a number of different types of transformers 14086 11:40:42,080 --> 11:40:43,280 in terms of how they work. 14087 11:40:43,280 --> 11:40:45,000 But what we're going to take a look at here 14088 11:40:45,000 --> 11:40:48,760 is the basic architecture for how one might work with a transformer 14089 11:40:48,760 --> 11:40:52,080 to get a sense for what's involved and what we're doing. 14090 11:40:52,080 --> 11:40:54,820 So let's start with the model we were looking at before, 14091 11:40:54,820 --> 11:40:59,040 specifically at this encoder part of our encoder-decoder architecture, 14092 11:40:59,040 --> 11:41:01,880 where we used a recurrent neural network to take this input 14093 11:41:01,880 --> 11:41:06,160 sequence and capture all of this information about the hidden state 14094 11:41:06,160 --> 11:41:09,520 and the information we need to know about that input sequence. 14095 11:41:09,520 --> 11:41:13,200 Right now, it all needs to happen in this linear progression. 14096 11:41:13,200 --> 11:41:15,600 But what the transformer is going to allow us to do 14097 11:41:15,600 --> 11:41:18,920 is process each of the words independently in a way that's 14098 11:41:18,920 --> 11:41:22,640 easy to parallelize, rather than have each word wait for some other word. 14099 11:41:22,640 --> 11:41:26,000 Each word is going to go through this same neural network 14100 11:41:26,000 --> 11:41:29,440 and produce some kind of encoded representation 14101 11:41:29,440 --> 11:41:31,160 of that particular input word. 14102 11:41:31,160 --> 11:41:33,800 And all of this is going to happen in parallel. 14103 11:41:33,800 --> 11:41:35,800 Now, it's happening for all of the words at once, 14104 11:41:35,800 --> 11:41:37,160 but we're really just going to focus on what's 14105 11:41:37,160 --> 11:41:39,240 happening for one word to make it clear. 14106 11:41:39,240 --> 11:41:41,880 But know that whatever you're seeing happen for this one word 14107 11:41:41,880 --> 11:41:45,680 is going to happen for all of the other input words, too. 14108 11:41:45,680 --> 11:41:47,280 So what's going on here? 14109 11:41:47,280 --> 11:41:49,800 Well, we start with some input word. 14110 11:41:49,800 --> 11:41:52,160 That input word goes into the neural network. 14111 11:41:52,160 --> 11:41:57,100 And the output is hopefully some encoded representation of the input word, 14112 11:41:57,100 --> 11:41:59,840 the information we need to know about the input word that's 14113 11:41:59,840 --> 11:42:03,320 going to be relevant to us as we're generating the output. 14114 11:42:03,320 --> 11:42:06,040 And because we're doing this each word independently, 14115 11:42:06,040 --> 11:42:07,200 it's easy to parallelize. 14116 11:42:07,200 --> 11:42:09,360 We don't have to wait for the previous word 14117 11:42:09,360 --> 11:42:12,800 before we run this word through the neural network. 14118 11:42:12,800 --> 11:42:16,800 But what did we lose in this process by trying to parallelize this whole thing? 14119 11:42:16,800 --> 11:42:19,640 Well, we've lost all notion of word ordering. 14120 11:42:19,640 --> 11:42:21,400 The order of words is important. 14121 11:42:21,400 --> 11:42:24,280 The sentence, Sherlock Holmes gave the book to Watson, 14122 11:42:24,280 --> 11:42:27,520 has a different meaning than Watson gave the book to Sherlock Holmes. 14123 11:42:27,520 --> 11:42:31,360 And so we want to keep track of that information about word position. 14124 11:42:31,360 --> 11:42:34,120 In the recurrent neural network, that happened for us automatically 14125 11:42:34,120 --> 11:42:37,640 because we could run each word one at a time through the neural network, 14126 11:42:37,640 --> 11:42:41,600 get the hidden state, pass it on to the next run of the neural network. 14127 11:42:41,600 --> 11:42:44,040 But that's not the case here with the transformer, 14128 11:42:44,040 --> 11:42:49,080 where each word is being processed independent of all of the other ones. 14129 11:42:49,080 --> 11:42:51,520 So what are we going to do to try to solve that problem? 14130 11:42:51,520 --> 11:42:57,040 One thing we can do is add some kind of positional encoding to the input word. 14131 11:42:57,040 --> 11:42:59,440 The positional encoding is some vector that 14132 11:42:59,440 --> 11:43:02,280 represents the position of the word in the sentence. 14133 11:43:02,280 --> 11:43:05,240 This is the first word, the second word, the third word, and so forth. 14134 11:43:05,240 --> 11:43:08,080 We're going to add that to the input word. 14135 11:43:08,080 --> 11:43:10,400 And the result of that is going to be a vector 14136 11:43:10,400 --> 11:43:12,840 that captures multiple pieces of information. 14137 11:43:12,840 --> 11:43:17,400 It captures the input word itself as well as where in the sentence it appears. 14138 11:43:17,400 --> 11:43:20,440 The result of that is we can pass the output of that addition, 14139 11:43:20,440 --> 11:43:23,760 the addition of the input word and the positional encoding 14140 11:43:23,760 --> 11:43:24,920 into the neural network. 14141 11:43:24,920 --> 11:43:27,440 That way, the neural network knows the word and where 14142 11:43:27,440 --> 11:43:31,320 it appears in the sentence and can use both of those pieces of information 14143 11:43:31,320 --> 11:43:34,720 to determine how best to represent the meaning of that word 14144 11:43:34,720 --> 11:43:38,240 in the encoded representation at the end of it. 14145 11:43:38,240 --> 11:43:40,160 In addition to what we have here, in addition 14146 11:43:40,160 --> 11:43:43,880 to the positional encoding and this feed forward neural network, 14147 11:43:43,880 --> 11:43:47,200 we're also going to add one additional component, which 14148 11:43:47,200 --> 11:43:49,920 is going to be a self-attention step. 14149 11:43:49,920 --> 11:43:52,440 This is going to be attention where we're paying attention 14150 11:43:52,440 --> 11:43:54,560 to the other input words. 14151 11:43:54,560 --> 11:43:57,240 Because the meaning or interpretation of an input word 14152 11:43:57,240 --> 11:44:00,880 might vary depending on the other words in the input as well. 14153 11:44:00,880 --> 11:44:03,520 And so we're going to allow each word in the input 14154 11:44:03,520 --> 11:44:06,800 to decide what other words in the input it should pay attention 14155 11:44:06,800 --> 11:44:10,800 to in order to decide on its encoded representation. 14156 11:44:10,800 --> 11:44:13,960 And that's going to allow us to get a better encoded representation 14157 11:44:13,960 --> 11:44:16,920 for each word because words are defined by their context, 14158 11:44:16,920 --> 11:44:21,400 by the words around them and how they're used in that particular context. 14159 11:44:21,400 --> 11:44:24,280 This kind of self-attention is so valuable, in fact, 14160 11:44:24,280 --> 11:44:28,560 that oftentimes the transformer will use multiple different self-attention 14161 11:44:28,560 --> 11:44:31,800 layers at the same time to allow for this model 14162 11:44:31,800 --> 11:44:36,400 to be able to pay attention to multiple facets of the input at the same time. 14163 11:44:36,400 --> 11:44:40,360 And we call this multi-headed attention, where each attention head can pay 14164 11:44:40,360 --> 11:44:41,880 attention to something different. 14165 11:44:41,880 --> 11:44:45,000 And as a result, this network can learn to pay attention 14166 11:44:45,000 --> 11:44:49,600 to many different parts of the input for this input word all at the same time. 14167 11:44:49,600 --> 11:44:52,160 And in the spirit of deep learning, these two steps, 14168 11:44:52,160 --> 11:44:56,120 this multi-headed self-attention layer and this neural network layer, 14169 11:44:56,120 --> 11:44:59,160 that itself can be repeated multiple times, too, 14170 11:44:59,160 --> 11:45:01,600 in order to get a deeper representation, in order 14171 11:45:01,600 --> 11:45:04,280 to learn deeper patterns within the input text 14172 11:45:04,280 --> 11:45:07,360 and ultimately get a better representation of language 14173 11:45:07,360 --> 11:45:11,620 in order to get useful encoded representations of all of the input 14174 11:45:11,620 --> 11:45:12,840 words. 14175 11:45:12,840 --> 11:45:15,520 And so this is the process that a transformer might 14176 11:45:15,520 --> 11:45:20,080 use in order to take an input word and get it its encoded representation. 14177 11:45:20,080 --> 11:45:23,760 And the key idea is to really rely on this attention step 14178 11:45:23,760 --> 11:45:26,280 in order to get information that's useful in order 14179 11:45:26,280 --> 11:45:29,000 to determine how to encode that word. 14180 11:45:29,000 --> 11:45:32,640 And that process is going to repeat for all of the input words that 14181 11:45:32,640 --> 11:45:33,760 are in the input sequence. 14182 11:45:33,760 --> 11:45:35,760 We're going to take all of the input words, 14183 11:45:35,760 --> 11:45:38,840 encode them with some kind of positional encoding, 14184 11:45:38,840 --> 11:45:42,480 feed those into these self-attention and feed-forward neural networks 14185 11:45:42,480 --> 11:45:46,920 in order to ultimately get these encoded representations of the words. 14186 11:45:46,920 --> 11:45:48,600 That's the result of the encoder. 14187 11:45:48,600 --> 11:45:51,680 We get all of these encoded representations 14188 11:45:51,680 --> 11:45:53,860 that will be useful to us when it comes time 14189 11:45:53,860 --> 11:45:57,080 then to try to decode all of this information 14190 11:45:57,080 --> 11:45:59,560 into the output sequence we're interested in. 14191 11:45:59,560 --> 11:46:02,920 And again, this might take place in the context of machine translation, 14192 11:46:02,920 --> 11:46:06,560 where the output is going to be the same sentence in a different language, 14193 11:46:06,560 --> 11:46:10,160 or it might be an answer to a question in the case of an AI chatbot, 14194 11:46:10,160 --> 11:46:11,240 for example. 14195 11:46:11,240 --> 11:46:15,960 And so now let's take a look at how that decoder is going to work. 14196 11:46:15,960 --> 11:46:19,040 Ultimately, it's going to have a very similar structure. 14197 11:46:19,040 --> 11:46:21,960 Any time we're trying to generate the next output word, 14198 11:46:21,960 --> 11:46:25,120 we need to know what the previous output word is, 14199 11:46:25,120 --> 11:46:27,000 as well as its positional encoding. 14200 11:46:27,000 --> 11:46:29,360 Where in the output sequence are we? 14201 11:46:29,360 --> 11:46:32,760 And we're going to have these same steps, self-attention, 14202 11:46:32,760 --> 11:46:34,640 because we might want an output word to be 14203 11:46:34,640 --> 11:46:37,880 able to pay attention to other words in that same output, 14204 11:46:37,880 --> 11:46:39,560 as well as a neural network. 14205 11:46:39,560 --> 11:46:42,440 And that might itself repeat multiple times. 14206 11:46:42,440 --> 11:46:45,840 But in this decoder, we're going to add one additional step. 14207 11:46:45,840 --> 11:46:48,600 We're going to add an additional attention step, where 14208 11:46:48,600 --> 11:46:51,200 instead of self-attention, where the output word is going 14209 11:46:51,200 --> 11:46:55,000 to pay attention to other output words, in this step, 14210 11:46:55,000 --> 11:46:58,080 we're going to allow the output word to pay attention 14211 11:46:58,080 --> 11:47:00,360 to the encoded representations. 14212 11:47:00,360 --> 11:47:04,160 So recall that the encoder is taking all of the input words 14213 11:47:04,160 --> 11:47:07,280 and transforming them into these encoded representations 14214 11:47:07,280 --> 11:47:08,760 of all of the input words. 14215 11:47:08,760 --> 11:47:11,560 But it's going to be important for us to be able to decide which 14216 11:47:11,560 --> 11:47:14,120 of those encoded representations we want to pay attention 14217 11:47:14,120 --> 11:47:18,640 to when generating any particular token in the output sequence. 14218 11:47:18,640 --> 11:47:22,520 And that's what this additional attention step is going to allow us to do. 14219 11:47:22,520 --> 11:47:26,160 It's saying that every time we're generating a word of the output, 14220 11:47:26,160 --> 11:47:28,600 we can pay attention to the other words in the output, 14221 11:47:28,600 --> 11:47:32,080 because we might want to know, what are the words we've generated previously? 14222 11:47:32,080 --> 11:47:33,920 And we want to pay attention to some of them 14223 11:47:33,920 --> 11:47:37,520 to decide what word is going to be next in the sequence. 14224 11:47:37,520 --> 11:47:41,080 But we also care about paying attention to the input words, too. 14225 11:47:41,080 --> 11:47:44,920 And we want the ability to decide which of these encoded representations 14226 11:47:44,920 --> 11:47:47,280 of the input words are going to be relevant in order 14227 11:47:47,280 --> 11:47:49,760 for us to generate the next step. 14228 11:47:49,760 --> 11:47:51,680 And so these two pieces combine together. 14229 11:47:51,680 --> 11:47:55,080 We have this encoder that takes all of the input words 14230 11:47:55,080 --> 11:47:57,640 and produces this encoded representation. 14231 11:47:57,640 --> 11:48:01,480 And we have this decoder that is able to take the previous output word, 14232 11:48:01,480 --> 11:48:06,280 pay attention to that encoded input, and then generate the next output word. 14233 11:48:06,280 --> 11:48:08,640 And this is one of the possible architectures 14234 11:48:08,640 --> 11:48:12,120 we could use for a transformer, with the key idea being 14235 11:48:12,120 --> 11:48:16,280 these attention steps that allow words to pay attention to each other. 14236 11:48:16,280 --> 11:48:20,240 During the training process here, we can now much more easily parallelize this, 14237 11:48:20,240 --> 11:48:23,440 because we don't have to wait for all of the words to happen in sequence. 14238 11:48:23,440 --> 11:48:26,960 And we can learn how we should perform these attention steps. 14239 11:48:26,960 --> 11:48:30,600 The model is able to learn what is important to pay attention to, 14240 11:48:30,600 --> 11:48:32,640 what things do I need to pay attention to, 14241 11:48:32,640 --> 11:48:37,240 in order to be more accurate at predicting what the output word is. 14242 11:48:37,240 --> 11:48:39,920 And this has proved to be a tremendously effective model 14243 11:48:39,920 --> 11:48:44,280 for conversational AI agents, for building machine translation systems. 14244 11:48:44,280 --> 11:48:47,000 And there have been many variants proposed on this model, too. 14245 11:48:47,000 --> 11:48:49,400 Some transformers only use an encoder. 14246 11:48:49,400 --> 11:48:51,080 Some only use a decoder. 14247 11:48:51,080 --> 11:48:54,720 Some use some other combination of these different particular features. 14248 11:48:54,720 --> 11:48:57,880 But the key ideas ultimately remain the same, 14249 11:48:57,880 --> 11:49:01,960 this real focus on trying to pay attention to what is most important. 14250 11:49:01,960 --> 11:49:04,080 And the world of natural language processing 14251 11:49:04,080 --> 11:49:06,080 is fast growing and fast evolving. 14252 11:49:06,080 --> 11:49:08,640 Year after year, we keep coming up with new models 14253 11:49:08,640 --> 11:49:11,760 that allow us to do an even better job of performing 14254 11:49:11,760 --> 11:49:14,600 these natural language related tasks, all on the surface 14255 11:49:14,600 --> 11:49:18,000 of solving the tricky problem, which is our own natural language. 14256 11:49:18,000 --> 11:49:21,800 We've seen how the syntax and semantics of our language is ambiguous, 14257 11:49:21,800 --> 11:49:24,000 and it introduces all of these new challenges 14258 11:49:24,000 --> 11:49:26,040 that we need to think about, if we're going 14259 11:49:26,040 --> 11:49:29,680 to be able to design AI agents that are able to work with language 14260 11:49:29,680 --> 11:49:30,800 effectively. 14261 11:49:30,800 --> 11:49:33,080 So as we think about where we've been in this class, 14262 11:49:33,080 --> 11:49:36,200 all of the different types of artificial intelligence we've considered, 14263 11:49:36,200 --> 11:49:38,960 we've looked at artificial intelligence in a wide variety 14264 11:49:38,960 --> 11:49:40,240 of different forms now. 14265 11:49:40,240 --> 11:49:42,880 We started by taking a look at search problems, 14266 11:49:42,880 --> 11:49:46,320 where we looked at how AI can search for solutions, play games, 14267 11:49:46,320 --> 11:49:48,680 and find the optimal decision to make. 14268 11:49:48,680 --> 11:49:53,080 We talked about knowledge, how AI can represent information that it knows 14269 11:49:53,080 --> 11:49:57,040 and use that information to generate new knowledge as well. 14270 11:49:57,040 --> 11:49:59,840 Then we looked at what AI can do when it's less certain, 14271 11:49:59,840 --> 11:50:01,760 when it doesn't know things for sure, and we 14272 11:50:01,760 --> 11:50:04,360 have to represent things in terms of probability. 14273 11:50:04,360 --> 11:50:06,360 We then took a look at optimization problems. 14274 11:50:06,360 --> 11:50:09,240 We saw how a lot of problems in AI can be boiled down 14275 11:50:09,240 --> 11:50:12,920 to trying to maximize or minimize some function. 14276 11:50:12,920 --> 11:50:15,040 And we looked at strategies that AI can use 14277 11:50:15,040 --> 11:50:18,240 in order to do that kind of maximizing and minimizing. 14278 11:50:18,240 --> 11:50:20,240 We then looked at the world of machine learning, 14279 11:50:20,240 --> 11:50:23,120 learning from data in order to figure out some patterns 14280 11:50:23,120 --> 11:50:26,600 and identify how to perform a task by looking at the training data 14281 11:50:26,600 --> 11:50:28,320 that we have available to it. 14282 11:50:28,320 --> 11:50:31,360 And one of the most powerful tools there was the neural network, 14283 11:50:31,360 --> 11:50:34,520 the sequence of units whose weights can be trained in order 14284 11:50:34,520 --> 11:50:37,680 to allow us to really effectively go from input to output 14285 11:50:37,680 --> 11:50:41,760 and predict how to get there by learning these underlying patterns. 14286 11:50:41,760 --> 11:50:44,240 And then today, we took a look at language itself, 14287 11:50:44,240 --> 11:50:47,080 trying to understand how can we train the computer to be 14288 11:50:47,080 --> 11:50:49,080 able to understand our natural language, to be 14289 11:50:49,080 --> 11:50:53,160 able to understand syntax and semantics, make sense of and generate 14290 11:50:53,160 --> 11:50:57,080 natural language, which introduces a number of interesting problems too. 14291 11:50:57,080 --> 11:51:00,120 And we've really just scratched the surface of artificial intelligence. 14292 11:51:00,120 --> 11:51:03,400 There is so much interesting research and interesting new techniques 14293 11:51:03,400 --> 11:51:05,480 and algorithms and ideas being introduced 14294 11:51:05,480 --> 11:51:07,800 to try to solve these types of problems. 14295 11:51:07,800 --> 11:51:10,160 So I hope you enjoyed this exploration into the world 14296 11:51:10,160 --> 11:51:11,480 of artificial intelligence. 14297 11:51:11,480 --> 11:51:14,520 A huge thanks to all of the course's teaching staff and production team 14298 11:51:14,520 --> 11:51:15,960 for making the class possible. 14299 11:51:15,960 --> 11:51:30,640 This was an introduction to artificial intelligence with Python. 1292369