Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,000 --> 00:00:07,040
This course from Harvard University explores the concepts and algorithms at the foundation of modern
2
00:00:07,040 --> 00:00:13,920
artificial intelligence, diving into the ideas that give rise to technologies like game-playing
3
00:00:13,920 --> 00:00:20,560
engines, handwriting recognition, and machine translation. You'll gain exposure to the theory
4
00:00:20,560 --> 00:00:26,880
behind graph search algorithms, classification, optimization, reinforcement learning,
5
00:00:26,880 --> 00:00:33,200
and other topics in artificial intelligence and machine learning. Brian Yu teaches this course.
6
00:00:33,200 --> 00:00:56,640
Hello, world. This is CS50, and this is an introduction to artificial intelligence with
7
00:00:56,640 --> 00:01:02,960
Python with CS50's own Brian Yu. This course picks up where CS50 itself leaves off and explores the
8
00:01:02,960 --> 00:01:06,480
concepts and algorithms at the foundation of modern AI.
9
00:01:06,480 --> 00:01:10,080
We'll start with a look at how AI can search for solutions to problems,
10
00:01:10,080 --> 00:01:12,800
whether those problems are learning how to play a game or trying
11
00:01:12,800 --> 00:01:15,120
to find driving directions to a destination.
12
00:01:15,120 --> 00:01:19,040
We'll then look at how AI can represent information, both knowledge that our AI
13
00:01:19,040 --> 00:01:23,440
is certain about, but also information and events about which our AI might be uncertain,
14
00:01:23,440 --> 00:01:26,240
learning how to represent that information, but more importantly,
15
00:01:26,240 --> 00:01:30,240
how to use that information to draw inferences and new conclusions as well.
16
00:01:30,240 --> 00:01:33,760
We'll explore how AI can solve various types of optimization problems,
17
00:01:33,760 --> 00:01:38,400
trying to maximize profits or minimize costs or satisfy some other constraints
18
00:01:38,400 --> 00:01:41,840
before turning our attention to the fast-growing field of machine learning,
19
00:01:41,840 --> 00:01:45,440
where we won't tell our AI exactly how to solve a problem, but instead,
20
00:01:45,440 --> 00:01:48,080
give our AI access to data and experiences
21
00:01:48,080 --> 00:01:52,000
so that our AI can learn on its own how to perform these tasks.
22
00:01:52,000 --> 00:01:55,440
In particular, we'll look at neural networks, one of the most popular tools
23
00:01:55,440 --> 00:01:59,760
in modern machine learning, inspired by the way that human brains learn and reason as well
24
00:01:59,760 --> 00:02:03,200
before finally taking a look at the world of natural language processing
25
00:02:03,200 --> 00:02:06,800
so that it's not just us humans learning to learn how artificial intelligence is
26
00:02:06,800 --> 00:02:11,840
able to speak, but also AI learning how to understand and interpret human language as well.
27
00:02:11,840 --> 00:02:14,720
We'll explore these ideas and algorithms, and along the way,
28
00:02:14,720 --> 00:02:19,520
give you the opportunity to build your own AI programs to implement all of this and more.
29
00:02:19,520 --> 00:02:44,000
This is CS50.
30
00:02:44,000 --> 00:02:44,560
All right.
31
00:02:44,560 --> 00:02:48,080
Welcome, everyone, to an introduction to artificial intelligence with Python.
32
00:02:48,080 --> 00:02:50,240
My name is Brian Yu, and in this class, we'll
33
00:02:50,240 --> 00:02:53,200
explore some of the ideas and techniques and algorithms
34
00:02:53,200 --> 00:02:56,240
that are at the foundation of artificial intelligence.
35
00:02:56,240 --> 00:03:00,000
Now, artificial intelligence covers a wide variety of types of techniques.
36
00:03:00,000 --> 00:03:01,920
Anytime you see a computer do something that
37
00:03:01,920 --> 00:03:05,120
appears to be intelligent or rational in some way,
38
00:03:05,120 --> 00:03:07,360
like recognizing someone's face in a photo,
39
00:03:07,360 --> 00:03:09,840
or being able to play a game better than people can,
40
00:03:09,840 --> 00:03:12,960
or being able to understand human language when we talk to our phones
41
00:03:12,960 --> 00:03:16,080
and they understand what we mean and are able to respond back to us,
42
00:03:16,080 --> 00:03:19,760
these are all examples of AI, or artificial intelligence.
43
00:03:19,760 --> 00:03:24,320
And in this class, we'll explore some of the ideas that make that AI possible.
44
00:03:24,320 --> 00:03:28,000
So we'll begin our conversations with search, the problem of we have an AI,
45
00:03:28,000 --> 00:03:32,160
and we would like the AI to be able to search for solutions to some kind of problem,
46
00:03:32,160 --> 00:03:33,520
no matter what that problem might be.
47
00:03:33,520 --> 00:03:37,040
Whether it's trying to get driving directions from point A to point B,
48
00:03:37,040 --> 00:03:40,080
or trying to figure out how to play a game, given a tic-tac-toe game,
49
00:03:40,080 --> 00:03:43,040
for example, figuring out what move it ought to make.
50
00:03:43,040 --> 00:03:45,520
After that, we'll take a look at knowledge.
51
00:03:45,520 --> 00:03:48,560
Ideally, we want our AI to be able to know information,
52
00:03:48,560 --> 00:03:51,440
to be able to represent that information, and more importantly,
53
00:03:51,440 --> 00:03:53,680
to be able to draw inferences from that information,
54
00:03:53,680 --> 00:03:57,760
to be able to use the information it knows and draw additional conclusions.
55
00:03:57,760 --> 00:04:02,240
So we'll talk about how AI can be programmed in order to do just that.
56
00:04:02,240 --> 00:04:04,400
Then we'll explore the topic of uncertainty,
57
00:04:04,400 --> 00:04:08,160
talking about ideas of what happens if a computer isn't sure about a fact,
58
00:04:08,160 --> 00:04:11,120
but maybe is only sure with a certain probability.
59
00:04:11,120 --> 00:04:13,520
So we'll talk about some of the ideas behind probability,
60
00:04:13,520 --> 00:04:16,640
and how computers can begin to deal with uncertain events
61
00:04:16,640 --> 00:04:20,960
in order to be a little bit more intelligent in that sense as well.
62
00:04:20,960 --> 00:04:23,680
After that, we'll turn our attention to optimization,
63
00:04:23,680 --> 00:04:27,280
problems of when the computer is trying to optimize for some sort of goal,
64
00:04:27,280 --> 00:04:29,840
especially in a situation where there might be multiple ways
65
00:04:29,840 --> 00:04:33,120
that a computer might solve a problem, but we're looking for a better way,
66
00:04:33,120 --> 00:04:36,640
or potentially the best way, if that's at all possible.
67
00:04:36,640 --> 00:04:39,680
Then we'll take a look at machine learning, or learning more generally,
68
00:04:39,680 --> 00:04:41,920
and looking at how, when we have access to data,
69
00:04:41,920 --> 00:04:45,360
our computers can be programmed to be quite intelligent by learning from data
70
00:04:45,360 --> 00:04:48,880
and learning from experience, being able to perform a task better and better
71
00:04:48,880 --> 00:04:50,880
based on greater access to data.
72
00:04:50,880 --> 00:04:54,080
So your email, for example, where your email inbox somehow knows
73
00:04:54,080 --> 00:04:57,360
which of your emails are good emails and which of your emails are spam.
74
00:04:57,360 --> 00:05:01,920
These are all examples of computers being able to learn from past experiences
75
00:05:01,920 --> 00:05:03,600
and past data.
76
00:05:03,600 --> 00:05:05,520
We'll take a look, too, at how computers are
77
00:05:05,520 --> 00:05:08,320
able to draw inspiration from human intelligence,
78
00:05:08,320 --> 00:05:10,160
looking at the structure of the human brain,
79
00:05:10,160 --> 00:05:13,920
and how neural networks can be a computer analog to that sort of idea,
80
00:05:13,920 --> 00:05:17,680
and how, by taking advantage of a certain type of structure of a computer program,
81
00:05:17,680 --> 00:05:21,040
we can write neural networks that are able to perform tasks very, very
82
00:05:21,040 --> 00:05:22,240
effectively.
83
00:05:22,240 --> 00:05:25,440
And then finally, we'll turn our attention to language, not programming
84
00:05:25,440 --> 00:05:28,320
languages, but human languages that we speak every day.
85
00:05:28,320 --> 00:05:31,280
And taking a look at the challenges that come about as a computer tries
86
00:05:31,280 --> 00:05:35,280
to understand natural language, and how it is some of the natural language
87
00:05:35,280 --> 00:05:39,120
processing that occurs in modern artificial intelligence can actually
88
00:05:39,120 --> 00:05:40,400
work.
89
00:05:40,400 --> 00:05:43,520
But today, we'll begin our conversation with search, this problem
90
00:05:43,520 --> 00:05:47,040
of trying to figure out what to do when we have some sort of situation
91
00:05:47,040 --> 00:05:50,320
that the computer is in, some sort of environment that an agent is in,
92
00:05:50,320 --> 00:05:52,560
so to speak, and we would like for that agent
93
00:05:52,560 --> 00:05:56,480
to be able to somehow look for a solution to that problem.
94
00:05:56,480 --> 00:05:59,680
Now, these problems can come in any number of different types of formats.
95
00:05:59,680 --> 00:06:01,600
One example, for instance, might be something
96
00:06:01,600 --> 00:06:04,880
like this classic 15 puzzle with the sliding tiles that you might have seen.
97
00:06:04,880 --> 00:06:06,640
Where you're trying to slide the tiles in order
98
00:06:06,640 --> 00:06:09,120
to make sure that all the numbers line up in order.
99
00:06:09,120 --> 00:06:12,000
This is an example of what you might call a search problem.
100
00:06:12,000 --> 00:06:15,600
The 15 puzzle begins in an initially mixed up state,
101
00:06:15,600 --> 00:06:18,400
and we need some way of finding moves to make in order
102
00:06:18,400 --> 00:06:20,960
to return the puzzle to its solved state.
103
00:06:20,960 --> 00:06:23,440
But there are similar problems that you can frame in other ways.
104
00:06:23,440 --> 00:06:25,600
Trying to find your way through a maze, for example,
105
00:06:25,600 --> 00:06:27,440
is another example of a search problem.
106
00:06:27,440 --> 00:06:31,040
You begin in one place, you have some goal of where you're trying to get to,
107
00:06:31,040 --> 00:06:34,320
and you need to figure out the correct sequence of actions that will take you
108
00:06:34,320 --> 00:06:36,880
from that initial state to the goal.
109
00:06:36,880 --> 00:06:38,880
And while this is a little bit abstract, any time
110
00:06:38,880 --> 00:06:40,880
we talk about maze solving in this class,
111
00:06:40,880 --> 00:06:43,440
you can translate it to something a little more real world.
112
00:06:43,440 --> 00:06:45,280
Something like driving directions.
113
00:06:45,280 --> 00:06:48,640
If you ever wonder how Google Maps is able to figure out what is the best way
114
00:06:48,640 --> 00:06:52,400
for you to get from point A to point B, and what turns to make at what time,
115
00:06:52,400 --> 00:06:56,720
depending on traffic, for example, it's often some sort of search algorithm.
116
00:06:56,720 --> 00:06:59,840
You have an AI that is trying to get from an initial position
117
00:06:59,840 --> 00:07:03,520
to some sort of goal by taking some sequence of actions.
118
00:07:03,520 --> 00:07:06,160
So we'll start our conversations today by thinking
119
00:07:06,160 --> 00:07:08,080
about these types of search problems and what
120
00:07:08,080 --> 00:07:11,680
goes in to solving a search problem like this in order for an AI
121
00:07:11,680 --> 00:07:14,160
to be able to find a good solution.
122
00:07:14,160 --> 00:07:15,600
In order to do so, though, we're going to need
123
00:07:15,600 --> 00:07:19,120
to introduce a little bit of terminology, some of which I've already used.
124
00:07:19,120 --> 00:07:22,080
But the first term we'll need to think about is an agent.
125
00:07:22,080 --> 00:07:25,360
An agent is just some entity that perceives its environment.
126
00:07:25,360 --> 00:07:27,520
It somehow is able to perceive the things around it
127
00:07:27,520 --> 00:07:30,080
and act on that environment in some way.
128
00:07:30,080 --> 00:07:31,600
So in the case of the driving directions,
129
00:07:31,600 --> 00:07:34,400
your agent might be some representation of a car that
130
00:07:34,400 --> 00:07:36,640
is trying to figure out what actions to take in order
131
00:07:36,640 --> 00:07:38,160
to arrive at a destination.
132
00:07:38,160 --> 00:07:40,880
In the case of the 15 puzzle with the sliding tiles,
133
00:07:40,880 --> 00:07:43,280
the agent might be the AI or the person that
134
00:07:43,280 --> 00:07:46,720
is trying to solve that puzzle to try and figure out what tiles to move
135
00:07:46,720 --> 00:07:49,520
in order to get to that solution.
136
00:07:49,520 --> 00:07:52,160
Next, we introduce the idea of a state.
137
00:07:52,160 --> 00:07:56,560
A state is just some configuration of the agent in its environment.
138
00:07:56,560 --> 00:08:00,320
So in the 15 puzzle, for example, any state might be any one of these three,
139
00:08:00,320 --> 00:08:03,760
for example. A state is just some configuration of the tiles.
140
00:08:03,760 --> 00:08:05,680
And each of these states is different and is
141
00:08:05,680 --> 00:08:08,240
going to require a slightly different solution.
142
00:08:08,240 --> 00:08:11,440
A different sequence of actions will be needed in each one of these
143
00:08:11,440 --> 00:08:15,120
in order to get from this initial state to the goal, which
144
00:08:15,120 --> 00:08:16,880
is where we're trying to get.
145
00:08:16,880 --> 00:08:18,640
So the initial state, then, what is that?
146
00:08:18,640 --> 00:08:21,520
The initial state is just the state where the agent begins.
147
00:08:21,520 --> 00:08:24,320
It is one such state where we're going to start from.
148
00:08:24,320 --> 00:08:27,440
And this is going to be the starting point for our search algorithm,
149
00:08:27,440 --> 00:08:28,160
so to speak.
150
00:08:28,160 --> 00:08:29,840
We're going to begin with this initial state
151
00:08:29,840 --> 00:08:32,960
and then start to reason about it, to think about what actions might we
152
00:08:32,960 --> 00:08:37,120
apply to that initial state in order to figure out how to get from the beginning
153
00:08:37,120 --> 00:08:42,080
to the end, from the initial position to whatever our goal happens to be.
154
00:08:42,080 --> 00:08:44,880
And how do we make our way from that initial position to the goal?
155
00:08:44,880 --> 00:08:47,440
Well, ultimately, it's via taking actions.
156
00:08:47,440 --> 00:08:50,880
Actions are just choices that we can make in any given state.
157
00:08:50,880 --> 00:08:54,400
And in AI, we're always going to try to formalize these ideas a little bit
158
00:08:54,400 --> 00:08:57,280
more precisely, such that we could program them a little bit more
159
00:08:57,280 --> 00:08:58,800
mathematically, so to speak.
160
00:08:58,800 --> 00:09:00,480
So this will be a recurring theme.
161
00:09:00,480 --> 00:09:04,240
And we can more precisely define actions as a function.
162
00:09:04,240 --> 00:09:07,680
We're going to effectively define a function called actions that takes an
163
00:09:07,680 --> 00:09:12,960
input, s, where s is going to be some state that exists inside of our environment.
164
00:09:12,960 --> 00:09:17,600
And actions of s is going to take the state as input and return as output
165
00:09:17,600 --> 00:09:22,000
the set of all actions that can be executed in that state.
166
00:09:22,000 --> 00:09:25,600
And so it's possible that some actions are only valid in certain states
167
00:09:25,600 --> 00:09:27,040
and not in other states.
168
00:09:27,040 --> 00:09:29,840
And we'll see examples of that soon, too.
169
00:09:29,840 --> 00:09:31,920
So in the case of the 15 puzzle, for example,
170
00:09:31,920 --> 00:09:35,600
there are generally going to be four possible actions that we can do most of
171
00:09:35,600 --> 00:09:36,160
the time.
172
00:09:36,160 --> 00:09:39,680
We can slide a tile to the right, slide a tile to the left, slide a tile up,
173
00:09:39,680 --> 00:09:41,680
or slide a tile down, for example.
174
00:09:41,680 --> 00:09:45,280
And those are going to be the actions that are available to us.
175
00:09:45,280 --> 00:09:48,400
So somehow our AI, our program, needs some encoding
176
00:09:48,400 --> 00:09:51,600
of the state, which is often going to be in some numerical format,
177
00:09:51,600 --> 00:09:53,520
and some encoding of these actions.
178
00:09:53,520 --> 00:09:56,640
But it also needs some encoding of the relationship between these things.
179
00:09:56,640 --> 00:10:00,080
How do the states and actions relate to one another?
180
00:10:00,080 --> 00:10:04,000
And in order to do that, we'll introduce to our AI a transition model, which
181
00:10:04,000 --> 00:10:08,240
will be a description of what state we get after we perform some available
182
00:10:08,240 --> 00:10:10,800
action in some other state.
183
00:10:10,800 --> 00:10:12,960
And again, we can be a little bit more precise about this,
184
00:10:12,960 --> 00:10:17,200
define this transition model a little bit more formally, again, as a function.
185
00:10:17,200 --> 00:10:20,720
The function is going to be a function called result that this time takes two
186
00:10:20,720 --> 00:10:21,600
inputs.
187
00:10:21,600 --> 00:10:24,560
Input number one is s, some state.
188
00:10:24,560 --> 00:10:27,680
And input number two is a, some action.
189
00:10:27,680 --> 00:10:30,080
And the output of this function result is it
190
00:10:30,080 --> 00:10:36,320
is going to give us the state that we get after we perform action a in state s.
191
00:10:36,320 --> 00:10:39,840
So let's take a look at an example to see more precisely what this actually means.
192
00:10:39,840 --> 00:10:43,280
Here is an example of a state, of the 15 puzzle, for example.
193
00:10:43,280 --> 00:10:46,880
And here is an example of an action, sliding a tile to the right.
194
00:10:46,880 --> 00:10:50,160
What happens if we pass these as inputs to the result function?
195
00:10:50,160 --> 00:10:54,720
Again, the result function takes this board, this state, as its first input.
196
00:10:54,720 --> 00:10:57,120
And it takes an action as a second input.
197
00:10:57,120 --> 00:10:59,360
And of course, here, I'm describing things visually
198
00:10:59,360 --> 00:11:02,320
so that you can see visually what the state is and what the action is.
199
00:11:02,320 --> 00:11:04,720
In a computer, you might represent one of these actions
200
00:11:04,720 --> 00:11:06,960
as just some number that represents the action.
201
00:11:06,960 --> 00:11:08,720
Or if you're familiar with enums that allow
202
00:11:08,720 --> 00:11:10,400
you to enumerate multiple possibilities,
203
00:11:10,400 --> 00:11:11,760
it might be something like that.
204
00:11:11,760 --> 00:11:13,760
And this state might just be represented
205
00:11:13,760 --> 00:11:17,760
as an array or two-dimensional array of all of these numbers that exist.
206
00:11:17,760 --> 00:11:20,880
But here, we're going to show it visually just so you can see it.
207
00:11:20,880 --> 00:11:23,360
But when we take this state and this action,
208
00:11:23,360 --> 00:11:26,800
pass it into the result function, the output is a new state.
209
00:11:26,800 --> 00:11:30,080
The state we get after we take a tile and slide it to the right,
210
00:11:30,080 --> 00:11:32,000
and this is the state we get as a result.
211
00:11:32,000 --> 00:11:35,200
If we had a different action and a different state, for example,
212
00:11:35,200 --> 00:11:37,120
and pass that into the result function, we'd
213
00:11:37,120 --> 00:11:38,960
get a different answer altogether.
214
00:11:38,960 --> 00:11:41,280
So the result function needs to take care
215
00:11:41,280 --> 00:11:45,600
of figuring out how to take a state and take an action and get what results.
216
00:11:45,600 --> 00:11:48,320
And this is going to be our transition model that
217
00:11:48,320 --> 00:11:52,800
describes how it is that states and actions are related to each other.
218
00:11:52,800 --> 00:11:55,760
If we take this transition model and think about it more generally
219
00:11:55,760 --> 00:12:00,320
and across the entire problem, we can form what we might call a state space.
220
00:12:00,320 --> 00:12:03,520
The set of all of the states we can get from the initial state
221
00:12:03,520 --> 00:12:08,160
via any sequence of actions, by taking 0 or 1 or 2 or more actions in addition
222
00:12:08,160 --> 00:12:12,160
to that, so we could draw a diagram that looks something like this, where
223
00:12:12,160 --> 00:12:15,920
every state is represented here by a game board, and there are arrows
224
00:12:15,920 --> 00:12:20,240
that connect every state to every other state we can get to from that state.
225
00:12:20,240 --> 00:12:23,280
And the state space is much larger than what you see just here.
226
00:12:23,280 --> 00:12:27,600
This is just a sample of what the state space might actually look like.
227
00:12:27,600 --> 00:12:29,840
And in general, across many search problems,
228
00:12:29,840 --> 00:12:33,680
whether they're this particular 15 puzzle or driving directions or something else,
229
00:12:33,680 --> 00:12:36,080
the state space is going to look something like this.
230
00:12:36,080 --> 00:12:40,480
We have individual states and arrows that are connecting them.
231
00:12:40,480 --> 00:12:42,560
And oftentimes, just for simplicity, we'll
232
00:12:42,560 --> 00:12:47,120
simplify our representation of this entire thing as a graph, some sequence
233
00:12:47,120 --> 00:12:50,080
of nodes and edges that connect nodes.
234
00:12:50,080 --> 00:12:52,640
But you can think of this more abstract representation
235
00:12:52,640 --> 00:12:54,160
as the exact same idea.
236
00:12:54,160 --> 00:12:56,320
Each of these little circles or nodes is going
237
00:12:56,320 --> 00:12:59,360
to represent one of the states inside of our problem.
238
00:12:59,360 --> 00:13:01,440
And the arrows here represent the actions
239
00:13:01,440 --> 00:13:04,320
that we can take in any particular state, taking us
240
00:13:04,320 --> 00:13:09,680
from one particular state to another state, for example.
241
00:13:09,680 --> 00:13:10,560
All right.
242
00:13:10,560 --> 00:13:14,320
So now we have this idea of nodes that are representing these states,
243
00:13:14,320 --> 00:13:16,560
actions that can take us from one state to another,
244
00:13:16,560 --> 00:13:19,520
and a transition model that defines what happens after we
245
00:13:19,520 --> 00:13:21,120
take a particular action.
246
00:13:21,120 --> 00:13:23,280
So the next step we need to figure out is how
247
00:13:23,280 --> 00:13:26,400
we know when the AI is done solving the problem.
248
00:13:26,400 --> 00:13:30,720
The AI needs some way to know when it gets to the goal that it's found the goal.
249
00:13:30,720 --> 00:13:33,920
So the next thing we'll need to encode into our artificial intelligence
250
00:13:33,920 --> 00:13:39,200
is a goal test, some way to determine whether a given state is a goal state.
251
00:13:39,200 --> 00:13:42,560
In the case of something like driving directions, it might be pretty easy.
252
00:13:42,560 --> 00:13:45,600
If you're in a state that corresponds to whatever the user typed
253
00:13:45,600 --> 00:13:48,960
in as their intended destination, well, then you know you're in a goal state.
254
00:13:48,960 --> 00:13:51,200
In the 15 puzzle, it might be checking the numbers
255
00:13:51,200 --> 00:13:52,880
to make sure they're all in ascending order.
256
00:13:52,880 --> 00:13:55,760
But the AI needs some way to encode whether or not
257
00:13:55,760 --> 00:13:58,160
any state they happen to be in is a goal.
258
00:13:58,160 --> 00:14:00,480
And some problems might have one goal, like a maze
259
00:14:00,480 --> 00:14:03,120
where you have one initial position and one ending position,
260
00:14:03,120 --> 00:14:04,240
and that's the goal.
261
00:14:04,240 --> 00:14:06,560
In other more complex problems, you might imagine
262
00:14:06,560 --> 00:14:08,240
that there are multiple possible goals.
263
00:14:08,240 --> 00:14:10,880
That there are multiple ways to solve a problem,
264
00:14:10,880 --> 00:14:13,680
and we might not care which one the computer finds,
265
00:14:13,680 --> 00:14:17,200
as long as it does find a particular goal.
266
00:14:17,200 --> 00:14:20,800
However, sometimes the computer doesn't just care about finding a goal,
267
00:14:20,800 --> 00:14:23,840
but finding a goal well, or one with a low cost.
268
00:14:23,840 --> 00:14:26,160
And it's for that reason that the last piece of terminology
269
00:14:26,160 --> 00:14:28,240
that we'll use to define these search problems
270
00:14:28,240 --> 00:14:30,560
is something called a path cost.
271
00:14:30,560 --> 00:14:33,040
You might imagine that in the case of driving directions,
272
00:14:33,040 --> 00:14:36,560
it would be pretty annoying if I said I wanted directions from point A
273
00:14:36,560 --> 00:14:38,960
to point B, and the route that Google Maps gave me
274
00:14:38,960 --> 00:14:42,640
was a long route with lots of detours that were unnecessary that took longer
275
00:14:42,640 --> 00:14:45,360
than it should have for me to get to that destination.
276
00:14:45,360 --> 00:14:48,240
And it's for that reason that when we're formulating search problems,
277
00:14:48,240 --> 00:14:51,920
we'll often give every path some sort of numerical cost,
278
00:14:51,920 --> 00:14:56,480
some number telling us how expensive it is to take this particular option,
279
00:14:56,480 --> 00:14:59,440
and then tell our AI that instead of just finding
280
00:14:59,440 --> 00:15:02,800
a solution, some way of getting from the initial state to the goal,
281
00:15:02,800 --> 00:15:06,480
we'd really like to find one that minimizes this path cost.
282
00:15:06,480 --> 00:15:09,200
That is, less expensive, or takes less time,
283
00:15:09,200 --> 00:15:12,320
or minimizes some other numerical value.
284
00:15:12,320 --> 00:15:15,520
We can represent this graphically if we take a look at this graph again,
285
00:15:15,520 --> 00:15:18,560
and imagine that each of these arrows, each of these actions
286
00:15:18,560 --> 00:15:21,360
that we can take from one state to another state,
287
00:15:21,360 --> 00:15:23,520
has some sort of number associated with it.
288
00:15:23,520 --> 00:15:26,800
That number being the path cost of this particular action,
289
00:15:26,800 --> 00:15:29,280
where some of the costs for any particular action
290
00:15:29,280 --> 00:15:33,280
might be more expensive than the cost for some other action, for example.
291
00:15:33,280 --> 00:15:35,920
Although this will only happen in some sorts of problems.
292
00:15:35,920 --> 00:15:38,320
In other problems, we can simplify the diagram
293
00:15:38,320 --> 00:15:42,400
and just assume that the cost of any particular action is the same.
294
00:15:42,400 --> 00:15:45,280
And this is probably the case in something like the 15 puzzle,
295
00:15:45,280 --> 00:15:47,840
for example, where it doesn't really make a difference
296
00:15:47,840 --> 00:15:49,680
whether I'm moving right or moving left.
297
00:15:49,680 --> 00:15:52,240
The only thing that matters is the total number
298
00:15:52,240 --> 00:15:56,080
of steps that I have to take to get from point A to point B.
299
00:15:56,080 --> 00:15:58,720
And each of those steps is of equal cost.
300
00:15:58,720 --> 00:16:03,040
We can just assume it's of some constant cost like one.
301
00:16:03,040 --> 00:16:07,520
And so this now forms the basis for what we might consider to be a search problem.
302
00:16:07,520 --> 00:16:11,680
A search problem has some sort of initial state, some place where we begin,
303
00:16:11,680 --> 00:16:14,160
some sort of action that we can take or multiple actions
304
00:16:14,160 --> 00:16:16,080
that we can take in any given state.
305
00:16:16,080 --> 00:16:17,680
And it has a transition model.
306
00:16:17,680 --> 00:16:21,120
Some way of defining what happens when we go from one state
307
00:16:21,120 --> 00:16:24,960
and take one action, what state do we end up with as a result.
308
00:16:24,960 --> 00:16:26,960
In addition to that, we need some goal test
309
00:16:26,960 --> 00:16:29,440
to know whether or not we've reached a goal.
310
00:16:29,440 --> 00:16:31,840
And then we need a path cost function that
311
00:16:31,840 --> 00:16:35,760
tells us for any particular path, by following some sequence of actions,
312
00:16:35,760 --> 00:16:37,520
how expensive is that path.
313
00:16:37,520 --> 00:16:41,280
What does its cost in terms of money or time or some other resource
314
00:16:41,280 --> 00:16:44,160
that we are trying to minimize our usage of.
315
00:16:44,160 --> 00:16:46,880
And the goal ultimately is to find a solution.
316
00:16:46,880 --> 00:16:50,000
Where a solution in this case is just some sequence of actions
317
00:16:50,000 --> 00:16:52,960
that will take us from the initial state to the goal state.
318
00:16:52,960 --> 00:16:55,920
And ideally, we'd like to find not just any solution
319
00:16:55,920 --> 00:16:58,800
but the optimal solution, which is a solution that
320
00:16:58,800 --> 00:17:02,800
has the lowest path cost among all of the possible solutions.
321
00:17:02,800 --> 00:17:05,440
And in some cases, there might be multiple optimal solutions.
322
00:17:05,440 --> 00:17:07,440
But an optimal solution just means that there
323
00:17:07,440 --> 00:17:12,160
is no way that we could have done better in terms of finding that solution.
324
00:17:12,160 --> 00:17:13,760
So now we've defined the problem.
325
00:17:13,760 --> 00:17:15,920
And now we need to begin to figure out how it
326
00:17:15,920 --> 00:17:18,800
is that we're going to solve this kind of search problem.
327
00:17:18,800 --> 00:17:21,120
And in order to do so, you'll probably imagine
328
00:17:21,120 --> 00:17:24,640
that our computer is going to need to represent a whole bunch of data
329
00:17:24,640 --> 00:17:26,000
about this particular problem.
330
00:17:26,000 --> 00:17:28,880
We need to represent data about where we are in the problem.
331
00:17:28,880 --> 00:17:32,320
And we might need to be considering multiple different options at once.
332
00:17:32,320 --> 00:17:35,280
And oftentimes, when we're trying to package a whole bunch of data
333
00:17:35,280 --> 00:17:38,640
related to a state together, we'll do so using a data structure
334
00:17:38,640 --> 00:17:40,480
that we're going to call a node.
335
00:17:40,480 --> 00:17:42,400
A node is a data structure that is just going
336
00:17:42,400 --> 00:17:44,960
to keep track of a variety of different values.
337
00:17:44,960 --> 00:17:47,280
And specifically, in the case of a search problem,
338
00:17:47,280 --> 00:17:50,480
it's going to keep track of these four values in particular.
339
00:17:50,480 --> 00:17:54,400
Every node is going to keep track of a state, the state we're currently on.
340
00:17:54,400 --> 00:17:57,360
And every node is also going to keep track of a parent.
341
00:17:57,360 --> 00:18:00,320
A parent being the state before us or the node
342
00:18:00,320 --> 00:18:03,440
that we used in order to get to this current state.
343
00:18:03,440 --> 00:18:07,120
And this is going to be relevant because eventually, once we reach the goal node,
344
00:18:07,120 --> 00:18:10,720
once we get to the end, we want to know what sequence of actions
345
00:18:10,720 --> 00:18:12,880
we use in order to get to that goal.
346
00:18:12,880 --> 00:18:16,000
And the way we'll know that is by looking at these parents
347
00:18:16,000 --> 00:18:19,680
to keep track of what led us to the goal and what led us to that state
348
00:18:19,680 --> 00:18:22,560
and what led us to the state before that, so on and so forth,
349
00:18:22,560 --> 00:18:25,200
backtracking our way to the beginning so that we
350
00:18:25,200 --> 00:18:27,840
know the entire sequence of actions we needed in order
351
00:18:27,840 --> 00:18:30,560
to get from the beginning to the end.
352
00:18:30,560 --> 00:18:33,440
The node is also going to keep track of what action we took in order
353
00:18:33,440 --> 00:18:35,920
to get from the parent to the current state.
354
00:18:35,920 --> 00:18:39,360
And the node is also going to keep track of a path cost.
355
00:18:39,360 --> 00:18:41,920
In other words, it's going to keep track of the number
356
00:18:41,920 --> 00:18:45,440
that represents how long it took to get from the initial state
357
00:18:45,440 --> 00:18:47,920
to the state that we currently happen to be at.
358
00:18:47,920 --> 00:18:49,760
And we'll see why this is relevant as we
359
00:18:49,760 --> 00:18:51,600
start to talk about some of the optimizations
360
00:18:51,600 --> 00:18:55,360
that we can make in terms of these search problems more generally.
361
00:18:55,360 --> 00:18:57,920
So this is the data structure that we're going to use in order to solve
362
00:18:57,920 --> 00:18:58,800
the problem.
363
00:18:58,800 --> 00:19:00,480
And now let's talk about the approach.
364
00:19:00,480 --> 00:19:03,840
How might we actually begin to solve the problem?
365
00:19:03,840 --> 00:19:05,840
Well, as you might imagine, what we're going to do
366
00:19:05,840 --> 00:19:08,000
is we're going to start at one particular state,
367
00:19:08,000 --> 00:19:10,560
and we're just going to explore from there.
368
00:19:10,560 --> 00:19:12,560
The intuition is that from a given state,
369
00:19:12,560 --> 00:19:14,560
we have multiple options that we could take,
370
00:19:14,560 --> 00:19:16,640
and we're going to explore those options.
371
00:19:16,640 --> 00:19:18,640
And once we explore those options, we'll
372
00:19:18,640 --> 00:19:22,160
find that more options than that are going to make themselves available.
373
00:19:22,160 --> 00:19:24,960
And we're going to consider all of the available options
374
00:19:24,960 --> 00:19:29,120
to be stored inside of a single data structure that we'll call the frontier.
375
00:19:29,120 --> 00:19:31,600
The frontier is going to represent all of the things
376
00:19:31,600 --> 00:19:36,640
that we could explore next that we haven't yet explored or visited.
377
00:19:36,640 --> 00:19:39,200
So in our approach, we're going to begin the search algorithm
378
00:19:39,200 --> 00:19:42,800
by starting with a frontier that just contains one state.
379
00:19:42,800 --> 00:19:45,280
The frontier is going to contain the initial state,
380
00:19:45,280 --> 00:19:47,840
because at the beginning, that's the only state we know about.
381
00:19:47,840 --> 00:19:50,160
That is the only state that exists.
382
00:19:50,160 --> 00:19:53,600
And then our search algorithm is effectively going to follow a loop.
383
00:19:53,600 --> 00:19:57,200
We're going to repeat some process again and again and again.
384
00:19:57,200 --> 00:20:01,040
The first thing we're going to do is if the frontier is empty,
385
00:20:01,040 --> 00:20:02,320
then there's no solution.
386
00:20:02,320 --> 00:20:05,120
And we can report that there is no way to get to the goal.
387
00:20:05,120 --> 00:20:06,400
And that's certainly possible.
388
00:20:06,400 --> 00:20:09,680
There are certain types of problems that an AI might try to explore
389
00:20:09,680 --> 00:20:12,640
and realize that there is no way to solve that problem.
390
00:20:12,640 --> 00:20:15,360
And that's useful information for humans to know as well.
391
00:20:15,360 --> 00:20:19,360
So if ever the frontier is empty, that means there's nothing left to explore.
392
00:20:19,360 --> 00:20:22,960
And we haven't yet found a solution, so there is no solution.
393
00:20:22,960 --> 00:20:24,960
There's nothing left to explore.
394
00:20:24,960 --> 00:20:28,720
Otherwise, what we'll do is we'll remove a node from the frontier.
395
00:20:28,720 --> 00:20:32,000
So right now at the beginning, the frontier just contains one node
396
00:20:32,000 --> 00:20:33,680
representing the initial state.
397
00:20:33,680 --> 00:20:35,360
But over time, the frontier might grow.
398
00:20:35,360 --> 00:20:36,880
It might contain multiple states.
399
00:20:36,880 --> 00:20:41,520
And so here, we're just going to remove a single node from that frontier.
400
00:20:41,520 --> 00:20:44,800
If that node happens to be a goal, then we found a solution.
401
00:20:44,800 --> 00:20:48,240
So we remove a node from the frontier and ask ourselves, is this the goal?
402
00:20:48,240 --> 00:20:51,360
And we do that by applying the goal test that we talked about earlier,
403
00:20:51,360 --> 00:20:53,120
asking if we're at the destination.
404
00:20:53,120 --> 00:20:56,960
Or asking if all the numbers of the 15 puzzle happen to be in order.
405
00:20:56,960 --> 00:20:59,760
So if the node contains the goal, we found a solution.
406
00:20:59,760 --> 00:21:00,240
Great.
407
00:21:00,240 --> 00:21:01,680
We're done.
408
00:21:01,680 --> 00:21:06,480
And otherwise, what we'll need to do is we'll need to expand the node.
409
00:21:06,480 --> 00:21:08,800
And this is a term of art in artificial intelligence.
410
00:21:08,800 --> 00:21:12,720
To expand the node just means to look at all of the neighbors of that node.
411
00:21:12,720 --> 00:21:15,440
In other words, consider all of the possible actions
412
00:21:15,440 --> 00:21:18,640
that I could take from the state that this node is representing
413
00:21:18,640 --> 00:21:21,120
and what nodes could I get to from there.
414
00:21:21,120 --> 00:21:23,360
We're going to take all of those nodes, the next nodes
415
00:21:23,360 --> 00:21:26,000
that I can get to from this current one I'm looking at,
416
00:21:26,000 --> 00:21:28,080
and add those to the frontier.
417
00:21:28,080 --> 00:21:30,240
And then we'll repeat this process.
418
00:21:30,240 --> 00:21:32,640
So at a very high level, the idea is we start
419
00:21:32,640 --> 00:21:35,200
with a frontier that contains the initial state.
420
00:21:35,200 --> 00:21:38,000
And we're constantly removing a node from the frontier,
421
00:21:38,000 --> 00:21:41,920
looking at where we can get to next and adding those nodes to the frontier,
422
00:21:41,920 --> 00:21:44,720
repeating this process over and over until either we
423
00:21:44,720 --> 00:21:47,440
remove a node from the frontier and it contains a goal,
424
00:21:47,440 --> 00:21:50,800
meaning we've solved the problem, or we run into a situation
425
00:21:50,800 --> 00:21:55,280
where the frontier is empty, at which point we're left with no solution.
426
00:21:55,280 --> 00:21:57,440
So let's actually try and take the pseudocode,
427
00:21:57,440 --> 00:22:02,160
put it into practice by taking a look at an example of a sample search problem.
428
00:22:02,160 --> 00:22:04,080
So right here, I have a sample graph.
429
00:22:04,080 --> 00:22:06,240
A is connected to B via this action.
430
00:22:06,240 --> 00:22:10,640
B is connected to nodes C and D. C is connected to E. D is connected to F.
431
00:22:10,640 --> 00:22:16,400
And what I'd like to do is have my AI find a path from A to E.
432
00:22:16,400 --> 00:22:20,800
We want to get from this initial state to this goal state.
433
00:22:20,800 --> 00:22:22,320
So how are we going to do that?
434
00:22:22,320 --> 00:22:25,520
Well, we're going to start with a frontier that contains the initial state.
435
00:22:25,520 --> 00:22:27,360
This is going to represent our frontier.
436
00:22:27,360 --> 00:22:29,360
So our frontier initially will just contain
437
00:22:29,360 --> 00:22:32,400
A, that initial state where we're going to begin.
438
00:22:32,400 --> 00:22:34,240
And now we'll repeat this process.
439
00:22:34,240 --> 00:22:36,240
If the frontier is empty, no solution.
440
00:22:36,240 --> 00:22:38,720
That's not a problem, because the frontier is not empty.
441
00:22:38,720 --> 00:22:42,880
So we'll remove a node from the frontier as the one to consider next.
442
00:22:42,880 --> 00:22:44,480
There's only one node in the frontier.
443
00:22:44,480 --> 00:22:46,640
So we'll go ahead and remove it from the frontier.
444
00:22:46,640 --> 00:22:51,280
But now A, this initial node, this is the node we're currently considering.
445
00:22:51,280 --> 00:22:52,400
We follow the next step.
446
00:22:52,400 --> 00:22:55,040
We ask ourselves, is this node the goal?
447
00:22:55,040 --> 00:22:55,760
No, it's not.
448
00:22:55,760 --> 00:22:56,640
A is not the goal.
449
00:22:56,640 --> 00:22:57,920
E is the goal.
450
00:22:57,920 --> 00:22:59,600
So we don't return the solution.
451
00:22:59,600 --> 00:23:02,960
So instead, we go to this last step, expand the node,
452
00:23:02,960 --> 00:23:05,760
and add the resulting nodes to the frontier.
453
00:23:05,760 --> 00:23:06,720
What does that mean?
454
00:23:06,720 --> 00:23:10,800
Well, it means take this state A and consider where we could get to next.
455
00:23:10,800 --> 00:23:14,000
And after A, what we could get to next is only B.
456
00:23:14,000 --> 00:23:16,880
So that's what we get when we expand A. We find B.
457
00:23:16,880 --> 00:23:18,800
And we add B to the frontier.
458
00:23:18,800 --> 00:23:20,400
And now B is in the frontier.
459
00:23:20,400 --> 00:23:22,080
And we repeat the process again.
460
00:23:22,080 --> 00:23:24,080
We say, all right, the frontier is not empty.
461
00:23:24,080 --> 00:23:26,240
So let's remove B from the frontier.
462
00:23:26,240 --> 00:23:28,080
B is now the node that we're considering.
463
00:23:28,080 --> 00:23:29,920
We ask ourselves, is B the goal?
464
00:23:29,920 --> 00:23:30,880
No, it's not.
465
00:23:30,880 --> 00:23:35,760
So we go ahead and expand B and add its resulting nodes to the frontier.
466
00:23:35,760 --> 00:23:37,440
What happens when we expand B?
467
00:23:37,440 --> 00:23:40,480
In other words, what nodes can we get to from B?
468
00:23:40,480 --> 00:23:43,760
Well, we can get to C and D. So we'll go ahead and add C and D
469
00:23:43,760 --> 00:23:44,800
from the frontier.
470
00:23:44,800 --> 00:23:47,200
And now we have two nodes in the frontier, C and D.
471
00:23:47,200 --> 00:23:48,880
And we repeat the process again.
472
00:23:48,880 --> 00:23:50,560
We remove a node from the frontier.
473
00:23:50,560 --> 00:23:52,960
For now, I'll do so arbitrarily just by picking C.
474
00:23:52,960 --> 00:23:56,320
We'll see why later, how choosing which node you remove from the frontier
475
00:23:56,320 --> 00:23:58,560
is actually quite an important part of the algorithm.
476
00:23:58,560 --> 00:24:02,000
But for now, I'll arbitrarily remove C, say it's not the goal.
477
00:24:02,000 --> 00:24:05,040
So we'll add E, the next one, to the frontier.
478
00:24:05,040 --> 00:24:07,200
Then let's say I remove E from the frontier.
479
00:24:07,200 --> 00:24:11,440
And now I check I'm currently looking at state E. Is it a goal state?
480
00:24:11,440 --> 00:24:15,600
It is, because I'm trying to find a path from A to E. So I would return the goal.
481
00:24:15,600 --> 00:24:19,760
And that now would be the solution, that I'm now able to return the solution.
482
00:24:19,760 --> 00:24:23,120
And I have found a path from A to E.
483
00:24:23,120 --> 00:24:26,560
So this is the general idea, the general approach of this search algorithm,
484
00:24:26,560 --> 00:24:30,080
to follow these steps, constantly removing nodes from the frontier,
485
00:24:30,080 --> 00:24:31,600
until we're able to find a solution.
486
00:24:31,600 --> 00:24:35,600
So the next question you might reasonably ask is, what could go wrong here?
487
00:24:35,600 --> 00:24:39,040
What are the potential problems with an approach like this?
488
00:24:39,040 --> 00:24:42,960
And here's one example of a problem that could arise from this sort of approach.
489
00:24:42,960 --> 00:24:47,040
Imagine this same graph, same as before, with one change.
490
00:24:47,040 --> 00:24:50,160
The change being now, instead of just an arrow from A to B,
491
00:24:50,160 --> 00:24:54,240
we also have an arrow from B to A, meaning we can go in both directions.
492
00:24:54,240 --> 00:24:57,600
And this is true in something like the 15 puzzle, where when I slide a tile
493
00:24:57,600 --> 00:25:00,640
to the right, I could then slide a tile to the left
494
00:25:00,640 --> 00:25:02,320
to get back to the original position.
495
00:25:02,320 --> 00:25:04,800
I could go back and forth between A and B.
496
00:25:04,800 --> 00:25:06,880
And that's what these double arrows symbolize,
497
00:25:06,880 --> 00:25:10,640
the idea that from one state, I can get to another, and then I can get back.
498
00:25:10,640 --> 00:25:12,880
And that's true in many search problems.
499
00:25:12,880 --> 00:25:16,240
What's going to happen if I try to apply the same approach now?
500
00:25:16,240 --> 00:25:18,480
Well, I'll begin with A, same as before.
501
00:25:18,480 --> 00:25:20,480
And I'll remove A from the frontier.
502
00:25:20,480 --> 00:25:23,200
And then I'll consider where I can get to from A.
503
00:25:23,200 --> 00:25:28,160
And after A, the only place I can get to is B. So B goes into the frontier.
504
00:25:28,160 --> 00:25:29,760
Then I'll say, all right, let's take a look at B.
505
00:25:29,760 --> 00:25:31,600
That's the only thing left in the frontier.
506
00:25:31,600 --> 00:25:33,600
Where can I get to from B?
507
00:25:33,600 --> 00:25:37,840
Before, it was just C and D. But now, because of that reverse arrow,
508
00:25:37,840 --> 00:25:43,360
I can get to A or C or D. So all three, A, C, and D, all of those
509
00:25:43,360 --> 00:25:44,560
now go into the frontier.
510
00:25:44,560 --> 00:25:48,800
They are places I can get to from B. And now I remove one from the frontier.
511
00:25:48,800 --> 00:25:53,200
And maybe I'm unlucky, and maybe I pick A. And now I'm looking at A again.
512
00:25:53,200 --> 00:25:54,880
And I consider, where can I get to from A?
513
00:25:54,880 --> 00:25:58,320
And from A, well, I can get to B. And now we start to see the problem.
514
00:25:58,320 --> 00:26:02,560
But if I'm not careful, I go from A to B, and then back to A, and then to B again.
515
00:26:02,560 --> 00:26:05,920
And I could be going in this infinite loop, where I never make any progress,
516
00:26:05,920 --> 00:26:09,200
because I'm constantly just going back and forth between two states
517
00:26:09,200 --> 00:26:10,880
that I've already seen.
518
00:26:10,880 --> 00:26:12,160
So what is the solution to this?
519
00:26:12,160 --> 00:26:14,480
We need some way to deal with this problem.
520
00:26:14,480 --> 00:26:16,320
And the way that we can deal with this problem
521
00:26:16,320 --> 00:26:20,000
is by somehow keeping track of what we've already explored.
522
00:26:20,000 --> 00:26:23,440
And the logic is going to be, well, if we've already explored the state,
523
00:26:23,440 --> 00:26:25,040
there's no reason to go back to it.
524
00:26:25,040 --> 00:26:27,120
Once we've explored a state, don't go back to it.
525
00:26:27,120 --> 00:26:29,360
Don't bother adding it to the frontier.
526
00:26:29,360 --> 00:26:31,040
There's no need to.
527
00:26:31,040 --> 00:26:33,920
So here's going to be our revised approach, a better way
528
00:26:33,920 --> 00:26:35,920
to approach this sort of search problem.
529
00:26:35,920 --> 00:26:39,520
And it's going to look very similar, just with a couple of modifications.
530
00:26:39,520 --> 00:26:43,600
We'll start with a frontier that contains the initial state, same as before.
531
00:26:43,600 --> 00:26:46,960
But now we'll start with another data structure, which
532
00:26:46,960 --> 00:26:49,840
will just be a set of nodes that we've already explored.
533
00:26:49,840 --> 00:26:51,360
So what are the states we've explored?
534
00:26:51,360 --> 00:26:52,720
Initially, it's empty.
535
00:26:52,720 --> 00:26:55,200
We have an empty explored set.
536
00:26:55,200 --> 00:26:57,040
And now we repeat.
537
00:26:57,040 --> 00:27:00,080
If the frontier is empty, no solution, same as before.
538
00:27:00,080 --> 00:27:02,000
We remove a node from the frontier.
539
00:27:02,000 --> 00:27:04,240
We check to see if it's a goal state, return the solution.
540
00:27:04,240 --> 00:27:06,400
None of this is any different so far.
541
00:27:06,400 --> 00:27:09,760
But now what we're going to do is we're going to add the node
542
00:27:09,760 --> 00:27:11,520
to the explored state.
543
00:27:11,520 --> 00:27:15,440
So if it happens to be the case that we remove a node from the frontier
544
00:27:15,440 --> 00:27:18,400
and it's not the goal, we'll add it to the explored set
545
00:27:18,400 --> 00:27:19,920
so that we know we've already explored it.
546
00:27:19,920 --> 00:27:23,680
We don't need to go back to it again if it happens to come up later.
547
00:27:23,680 --> 00:27:26,160
And then the final step, we expand the node
548
00:27:26,160 --> 00:27:28,880
and we add the resulting nodes to the frontier.
549
00:27:28,880 --> 00:27:31,680
But before, we just always added the resulting nodes to the frontier.
550
00:27:31,680 --> 00:27:34,000
We're going to be a little clever about it this time.
551
00:27:34,000 --> 00:27:36,640
We're only going to add the nodes to the frontier
552
00:27:36,640 --> 00:27:40,880
if they aren't already in the frontier and if they aren't already
553
00:27:40,880 --> 00:27:42,640
in the explored set.
554
00:27:42,640 --> 00:27:45,040
So we'll check both the frontier and the explored set,
555
00:27:45,040 --> 00:27:48,240
make sure that the node isn't already in one of those two.
556
00:27:48,240 --> 00:27:51,440
And so long as it isn't, then we'll go ahead and add it to the frontier,
557
00:27:51,440 --> 00:27:53,280
but not otherwise.
558
00:27:53,280 --> 00:27:55,120
And so that revised approach is ultimately
559
00:27:55,120 --> 00:27:58,640
what's going to help make sure that we don't go back and forth between two
560
00:27:58,640 --> 00:28:00,160
nodes.
561
00:28:00,160 --> 00:28:02,800
Now, the one point that I've kind of glossed over here so far
562
00:28:02,800 --> 00:28:06,480
is this step here, removing a node from the frontier.
563
00:28:06,480 --> 00:28:08,080
Before, I just chose arbitrarily.
564
00:28:08,080 --> 00:28:10,400
Like, let's just remove a node and that's it.
565
00:28:10,400 --> 00:28:12,800
But it turns out it's actually quite important how
566
00:28:12,800 --> 00:28:17,520
we decide to structure our frontier, how we add and how we remove our nodes.
567
00:28:17,520 --> 00:28:19,440
The frontier is a data structure and we need
568
00:28:19,440 --> 00:28:21,760
to make a choice about in what order are we
569
00:28:21,760 --> 00:28:23,760
going to be removing elements.
570
00:28:23,760 --> 00:28:27,280
And one of the simplest data structures for adding and removing elements
571
00:28:27,280 --> 00:28:28,800
is something called a stack.
572
00:28:28,800 --> 00:28:33,760
And a stack is a data structure that is a last in, first out data type, which
573
00:28:33,760 --> 00:28:36,560
means the last thing that I add to the frontier
574
00:28:36,560 --> 00:28:40,400
is going to be the first thing that I remove from the frontier.
575
00:28:40,400 --> 00:28:44,320
So the most recent thing to go into the stack or the frontier in this case
576
00:28:44,320 --> 00:28:47,280
is going to be the node that I explore.
577
00:28:47,280 --> 00:28:51,280
So let's see what happens if I apply this stack-based approach to something
578
00:28:51,280 --> 00:28:56,480
like this problem, finding a path from A to E. What's going to happen?
579
00:28:56,480 --> 00:28:58,960
Well, again, we'll start with A and we'll say, all right,
580
00:28:58,960 --> 00:29:00,640
let's go ahead and look at A first.
581
00:29:00,640 --> 00:29:04,720
And then notice this time, we've added A to the explored set.
582
00:29:04,720 --> 00:29:06,240
A is something we've now explored.
583
00:29:06,240 --> 00:29:09,040
We have this data structure that's keeping track.
584
00:29:09,040 --> 00:29:13,680
We then say from A, we can get to B. And all right, from B, what can we do?
585
00:29:13,680 --> 00:29:17,840
Well, from B, we can explore B and get to both C and D.
586
00:29:17,840 --> 00:29:21,200
So we added C and then D. So now,
587
00:29:21,200 --> 00:29:24,400
when we explore a node, we're going to treat the frontier as a stack,
588
00:29:24,400 --> 00:29:26,000
last in, first out.
589
00:29:26,000 --> 00:29:27,760
D was the last one to come in.
590
00:29:27,760 --> 00:29:30,560
So we'll go ahead and explore that next and say, all right,
591
00:29:30,560 --> 00:29:32,000
where can we get to from D?
592
00:29:32,000 --> 00:29:36,720
Well, we can get to F. And so all right, we'll put F into the frontier.
593
00:29:36,720 --> 00:29:39,040
And now, because the frontier is a stack,
594
00:29:39,040 --> 00:29:42,080
F is the most recent thing that's gone in the stack.
595
00:29:42,080 --> 00:29:43,600
So F is what we'll explore next.
596
00:29:43,600 --> 00:29:47,200
We'll explore F and say, all right, where can we get to from F?
597
00:29:47,200 --> 00:29:50,400
Well, we can't get anywhere, so nothing gets added to the frontier.
598
00:29:50,400 --> 00:29:53,280
So now, what was the new most recent thing added to the frontier?
599
00:29:53,280 --> 00:29:55,920
Well, it's now C, the only thing left in the frontier.
600
00:29:55,920 --> 00:29:59,600
We'll explore that from which we can see, all right, from C, we can get to E.
601
00:29:59,600 --> 00:30:01,280
So E goes into the frontier.
602
00:30:01,280 --> 00:30:04,560
And then we say, all right, let's look at E. And E is now the solution.
603
00:30:04,560 --> 00:30:07,120
And now, we've solved the problem.
604
00:30:07,120 --> 00:30:10,080
So when we treat the frontier like a stack, a last in,
605
00:30:10,080 --> 00:30:13,120
first out data structure, that's the result we get.
606
00:30:13,120 --> 00:30:18,880
We go from A to B to D to F. And then we sort of backed up and went down to C
607
00:30:18,880 --> 00:30:19,760
and then E.
608
00:30:19,760 --> 00:30:23,200
And it's important to get a visual sense for how this algorithm is working.
609
00:30:23,200 --> 00:30:25,840
We went very deep in this search tree, so to speak,
610
00:30:25,840 --> 00:30:28,480
all the way until the bottom where we hit a dead end.
611
00:30:28,480 --> 00:30:32,080
And then we effectively backed up and explored this other route
612
00:30:32,080 --> 00:30:33,520
that we didn't try before.
613
00:30:33,520 --> 00:30:36,400
And it's this going very deep in the search tree idea,
614
00:30:36,400 --> 00:30:39,920
this way the algorithm ends up working when we use a stack
615
00:30:39,920 --> 00:30:44,000
that we call this version of the algorithm depth first search.
616
00:30:44,000 --> 00:30:46,160
Depth first search is the search algorithm
617
00:30:46,160 --> 00:30:49,680
where we always explore the deepest node in the frontier.
618
00:30:49,680 --> 00:30:52,800
We keep going deeper and deeper through our search tree.
619
00:30:52,800 --> 00:30:57,520
And then if we hit a dead end, we back up and we try something else instead.
620
00:30:57,520 --> 00:31:00,560
But depth first search is just one of the possible search options
621
00:31:00,560 --> 00:31:01,600
that we could use.
622
00:31:01,600 --> 00:31:05,200
It turns out that there's another algorithm called breadth first search,
623
00:31:05,200 --> 00:31:08,880
which behaves very similarly to depth first search with one difference.
624
00:31:08,880 --> 00:31:12,400
Instead of always exploring the deepest node in the search tree,
625
00:31:12,400 --> 00:31:14,800
the way the depth first search does, breadth first search
626
00:31:14,800 --> 00:31:19,040
is always going to explore the shallowest node in the frontier.
627
00:31:19,040 --> 00:31:20,080
So what does that mean?
628
00:31:20,080 --> 00:31:24,480
Well, it means that instead of using a stack which depth first search or DFS
629
00:31:24,480 --> 00:31:27,520
used, where the most recent item added to the frontier
630
00:31:27,520 --> 00:31:32,160
is the one we'll explore next, in breadth first search or BFS,
631
00:31:32,160 --> 00:31:37,440
we'll instead use a queue, where a queue is a first in first out data type,
632
00:31:37,440 --> 00:31:39,760
where the very first thing we add to the frontier
633
00:31:39,760 --> 00:31:43,840
is the first one we'll explore and they effectively form a line or a queue,
634
00:31:43,840 --> 00:31:49,040
where the earlier you arrive in the frontier, the earlier you get explored.
635
00:31:49,040 --> 00:31:51,440
So what would that mean for the same exact problem,
636
00:31:51,440 --> 00:31:53,760
finding a path from A to E?
637
00:31:53,760 --> 00:31:57,680
Well, we start with A, same as before, then we'll go ahead and have explored A
638
00:31:57,680 --> 00:31:59,200
and say, where can we get to from A?
639
00:31:59,200 --> 00:32:01,920
Well, from A, we can get to B, same as before.
640
00:32:01,920 --> 00:32:04,480
From B, same as before, we can get to C and D.
641
00:32:04,480 --> 00:32:06,800
So C and D get added to the frontier.
642
00:32:06,800 --> 00:32:10,480
This time, though, we added C to the frontier before D.
643
00:32:10,480 --> 00:32:12,480
So we'll explore C first.
644
00:32:12,480 --> 00:32:14,160
So C gets explored.
645
00:32:14,160 --> 00:32:16,000
And from C, where can we get to?
646
00:32:16,000 --> 00:32:19,520
Well, we can get to E. So E gets added to the frontier.
647
00:32:19,520 --> 00:32:24,080
But because D was explored before E, we'll look at D next.
648
00:32:24,080 --> 00:32:26,400
So we'll explore D and say, where can we get to from D?
649
00:32:26,400 --> 00:32:31,440
We can get to F. And only then will we say, all right, now we can get to E.
650
00:32:31,440 --> 00:32:35,360
And so what breadth first search or BFS did is we started here,
651
00:32:35,360 --> 00:32:39,440
we looked at both C and D, and then we looked at E.
652
00:32:39,440 --> 00:32:42,640
Effectively, we're looking at things one away from the initial state,
653
00:32:42,640 --> 00:32:45,680
then two away from the initial state, and only then,
654
00:32:45,680 --> 00:32:49,760
things that are three away from the initial state, unlike depth first search,
655
00:32:49,760 --> 00:32:53,040
which just went as deep as possible into the search tree
656
00:32:53,040 --> 00:32:56,000
until it hit a dead end and then ultimately had to back up.
657
00:32:56,720 --> 00:32:59,200
So these now are two different search algorithms
658
00:32:59,200 --> 00:33:01,760
that we could apply in order to try and solve a problem.
659
00:33:01,760 --> 00:33:05,040
And let's take a look at how these would actually work in practice
660
00:33:05,040 --> 00:33:07,600
with something like maze solving, for example.
661
00:33:07,600 --> 00:33:09,200
So here's an example of a maze.
662
00:33:09,200 --> 00:33:12,400
These empty cells represent places where our agent can move.
663
00:33:12,400 --> 00:33:16,880
These darkened gray cells represent walls that the agent can't pass through.
664
00:33:16,880 --> 00:33:20,320
And ultimately, our agent, our AI, is going to try to find a way
665
00:33:20,320 --> 00:33:25,120
to get from position A to position B via some sequence of actions,
666
00:33:25,120 --> 00:33:28,000
where those actions are left, right, up, and down.
667
00:33:28,800 --> 00:33:31,200
What will depth first search do in this case?
668
00:33:31,200 --> 00:33:34,080
Well, depth first search will just follow one path.
669
00:33:34,080 --> 00:33:37,440
If it reaches a fork in the road where it has multiple different options,
670
00:33:37,440 --> 00:33:40,000
depth first search is just, in this case, going to choose one.
671
00:33:40,000 --> 00:33:41,360
That doesn't a real preference.
672
00:33:41,360 --> 00:33:45,040
But it's going to keep following one until it hits a dead end.
673
00:33:45,040 --> 00:33:48,480
And when it hits a dead end, depth first search effectively
674
00:33:48,480 --> 00:33:52,240
goes back to the last decision point and tries the other path,
675
00:33:52,240 --> 00:33:54,240
fully exhausting this entire path.
676
00:33:54,240 --> 00:33:56,720
And when it realizes that, OK, the goal is not here,
677
00:33:56,720 --> 00:33:58,560
then it turns its attention to this path.
678
00:33:58,560 --> 00:34:00,400
It goes as deep as possible.
679
00:34:00,400 --> 00:34:04,000
When it hits a dead end, it backs up and then tries this other path,
680
00:34:04,000 --> 00:34:07,120
keeps going as deep as possible down one particular path.
681
00:34:07,120 --> 00:34:10,480
And when it realizes that that's a dead end, then it'll back up,
682
00:34:10,480 --> 00:34:13,200
and then ultimately find its way to the goal.
683
00:34:13,200 --> 00:34:16,800
And maybe you got lucky, and maybe you made a different choice earlier on.
684
00:34:16,800 --> 00:34:19,680
But ultimately, this is how depth first search is going to work.
685
00:34:19,680 --> 00:34:22,000
It's going to keep following until it hits a dead end.
686
00:34:22,000 --> 00:34:26,160
And when it hits a dead end, it backs up and looks for a different solution.
687
00:34:26,160 --> 00:34:28,160
And so one thing you might reasonably ask is,
688
00:34:28,160 --> 00:34:30,160
is this algorithm always going to work?
689
00:34:30,160 --> 00:34:33,440
Will it always actually find a way to get from the initial state?
690
00:34:33,440 --> 00:34:34,480
To the goal.
691
00:34:34,480 --> 00:34:37,600
And it turns out that as long as our maze is finite,
692
00:34:37,600 --> 00:34:40,720
as long as there are only finitely many spaces where we can travel,
693
00:34:40,720 --> 00:34:44,000
then, yes, depth first search is going to find a solution.
694
00:34:44,000 --> 00:34:46,480
Because eventually, it'll just explore everything.
695
00:34:46,480 --> 00:34:49,600
If the maze happens to be infinite and there's an infinite state space,
696
00:34:49,600 --> 00:34:51,840
which does exist in certain types of problems,
697
00:34:51,840 --> 00:34:53,440
then it's a slightly different story.
698
00:34:53,440 --> 00:34:56,160
But as long as our maze has finitely many squares,
699
00:34:56,160 --> 00:34:58,400
we're going to find a solution.
700
00:34:58,400 --> 00:35:00,800
The next question, though, that we want to ask is,
701
00:35:00,800 --> 00:35:02,400
is it going to be a good solution?
702
00:35:02,400 --> 00:35:05,200
Is it the optimal solution that we can find?
703
00:35:05,200 --> 00:35:07,680
And the answer there is not necessarily.
704
00:35:07,680 --> 00:35:09,520
And let's take a look at an example of that.
705
00:35:09,520 --> 00:35:14,320
In this maze, for example, we're again trying to find our way from A to B.
706
00:35:14,320 --> 00:35:16,960
And you notice here there are multiple possible solutions.
707
00:35:16,960 --> 00:35:21,680
We could go this way or we could go up in order to make our way from A to B.
708
00:35:21,680 --> 00:35:25,600
Now, if we're lucky, depth first search will choose this way and get to B.
709
00:35:25,600 --> 00:35:28,080
But there's no reason necessarily why depth first search
710
00:35:28,080 --> 00:35:30,880
would choose between going up or going to the right.
711
00:35:30,880 --> 00:35:33,680
It's sort of an arbitrary decision point because both
712
00:35:33,680 --> 00:35:35,840
are going to be added to the frontier.
713
00:35:35,840 --> 00:35:38,720
And ultimately, if we get unlucky, depth first search
714
00:35:38,720 --> 00:35:42,000
might choose to explore this path first because it's just a random choice
715
00:35:42,000 --> 00:35:42,880
at this point.
716
00:35:42,880 --> 00:35:45,280
It'll explore, explore, explore.
717
00:35:45,280 --> 00:35:48,560
And it'll eventually find the goal, this particular path,
718
00:35:48,560 --> 00:35:50,480
when in actuality there was a better path.
719
00:35:50,480 --> 00:35:54,400
There was a more optimal solution that used fewer steps,
720
00:35:54,400 --> 00:35:58,000
assuming we're measuring the cost of a solution based on the number of steps
721
00:35:58,000 --> 00:35:59,280
that we need to take.
722
00:35:59,280 --> 00:36:01,520
So depth first search, if we're unlucky,
723
00:36:01,520 --> 00:36:05,360
might end up not finding the best solution when a better solution is
724
00:36:05,360 --> 00:36:07,200
available.
725
00:36:07,200 --> 00:36:09,680
So that's DFS, depth first search.
726
00:36:09,680 --> 00:36:12,720
How does BFS, or breadth first search, compare?
727
00:36:12,720 --> 00:36:14,960
How would it work in this particular situation?
728
00:36:14,960 --> 00:36:17,920
Well, the algorithm is going to look very different visually
729
00:36:17,920 --> 00:36:20,160
in terms of how BFS explores.
730
00:36:20,160 --> 00:36:24,640
Because BFS looks at shallower nodes first, the idea is going to be,
731
00:36:24,640 --> 00:36:29,600
BFS will first look at all of the nodes that are one away from the initial state.
732
00:36:29,600 --> 00:36:31,680
Look here and look here, for example, just
733
00:36:31,680 --> 00:36:36,000
at the two nodes that are immediately next to this initial state.
734
00:36:36,000 --> 00:36:37,840
Then it'll explore nodes that are two away,
735
00:36:37,840 --> 00:36:40,480
looking at this state and that state, for example.
736
00:36:40,480 --> 00:36:43,520
Then it'll explore nodes that are three away, this state and that state.
737
00:36:43,520 --> 00:36:47,600
Whereas depth first search just picked one path and kept following it,
738
00:36:47,600 --> 00:36:49,440
breadth first search, on the other hand,
739
00:36:49,440 --> 00:36:52,960
is taking the option of exploring all of the possible paths
740
00:36:52,960 --> 00:36:56,160
as kind of at the same time bouncing back between them,
741
00:36:56,160 --> 00:36:58,720
looking deeper and deeper at each one, but making sure
742
00:36:58,720 --> 00:37:01,360
to explore the shallower ones or the ones that
743
00:37:01,360 --> 00:37:04,080
are closer to the initial state earlier.
744
00:37:04,080 --> 00:37:07,200
So we'll keep following this pattern, looking at things that are four away,
745
00:37:07,200 --> 00:37:10,720
looking at things that are five away, looking at things that are six away,
746
00:37:10,720 --> 00:37:14,160
until eventually we make our way to the goal.
747
00:37:14,160 --> 00:37:17,520
And in this case, it's true we had to explore some states that ultimately
748
00:37:17,520 --> 00:37:20,960
didn't lead us anywhere, but the path that we found to the goal
749
00:37:20,960 --> 00:37:22,200
was the optimal path.
750
00:37:22,200 --> 00:37:25,880
This is the shortest way that we could get to the goal.
751
00:37:25,880 --> 00:37:28,880
And so what might happen then in a larger maze?
752
00:37:28,880 --> 00:37:30,840
Well, let's take a look at something like this
753
00:37:30,840 --> 00:37:32,880
and how breadth first search is going to behave.
754
00:37:32,880 --> 00:37:35,800
Well, breadth first search, again, we'll just keep following the states
755
00:37:35,800 --> 00:37:37,280
until it receives a decision point.
756
00:37:37,280 --> 00:37:39,480
It could go either left or right.
757
00:37:39,480 --> 00:37:44,600
And while DFS just picked one and kept following that until it hit a dead end,
758
00:37:44,600 --> 00:37:47,580
BFS, on the other hand, will explore both.
759
00:37:47,580 --> 00:37:50,000
It'll say look at this node, then this node,
760
00:37:50,000 --> 00:37:52,080
and it'll look at this node, then that node.
761
00:37:52,080 --> 00:37:53,440
So on and so forth.
762
00:37:53,440 --> 00:37:57,280
And when it hits a decision point here, rather than pick one left or two
763
00:37:57,280 --> 00:38:01,040
right and explore that path, it'll again explore both,
764
00:38:01,040 --> 00:38:03,280
alternating between them, going deeper and deeper.
765
00:38:03,280 --> 00:38:07,600
We'll explore here, and then maybe here and here, and then keep going.
766
00:38:07,600 --> 00:38:10,800
Explore here and slowly make our way, you can visually
767
00:38:10,800 --> 00:38:12,840
see, further and further out.
768
00:38:12,840 --> 00:38:16,520
Once we get to this decision point, we'll explore both up and down
769
00:38:16,520 --> 00:38:21,600
until ultimately we make our way to the goal.
770
00:38:21,600 --> 00:38:24,240
And what you'll notice is, yes, breadth first search
771
00:38:24,240 --> 00:38:28,640
did find our way from A to B by following this particular path,
772
00:38:28,640 --> 00:38:32,320
but it needed to explore a lot of states in order to do so.
773
00:38:32,320 --> 00:38:35,440
And so we see some trade offs here between DFS and BFS,
774
00:38:35,440 --> 00:38:39,440
that in DFS, there may be some cases where there is some memory savings
775
00:38:39,440 --> 00:38:43,760
as compared to a breadth first approach, where breadth first search in this case
776
00:38:43,760 --> 00:38:45,240
had to explore a lot of states.
777
00:38:45,240 --> 00:38:48,480
But maybe that won't always be the case.
778
00:38:48,480 --> 00:38:51,280
So now let's actually turn our attention to some code
779
00:38:51,280 --> 00:38:52,940
and look at the code that we could actually
780
00:38:52,940 --> 00:38:56,400
write in order to implement something like depth first search or breadth
781
00:38:56,400 --> 00:39:01,000
first search in the context of solving a maze, for example.
782
00:39:01,000 --> 00:39:03,360
So I'll go ahead and go into my terminal.
783
00:39:03,360 --> 00:39:07,280
And what I have here inside of maze.py is an implementation
784
00:39:07,280 --> 00:39:09,640
of this same idea of maze solving.
785
00:39:09,640 --> 00:39:12,680
I've defined a class called node that in this case
786
00:39:12,680 --> 00:39:15,520
is keeping track of the state, the parent, in other words,
787
00:39:15,520 --> 00:39:17,960
the state before the state, and the action.
788
00:39:17,960 --> 00:39:20,120
In this case, we're not keeping track of the path cost
789
00:39:20,120 --> 00:39:22,800
because we can calculate the cost of the path at the end
790
00:39:22,800 --> 00:39:26,920
after we found our way from the initial state to the goal.
791
00:39:26,920 --> 00:39:31,560
In addition to this, I've defined a class called a stack frontier.
792
00:39:31,560 --> 00:39:34,800
And if unfamiliar with a class, a class is a way for me
793
00:39:34,800 --> 00:39:37,960
to define a way to generate objects in Python.
794
00:39:37,960 --> 00:39:42,080
It refers to an idea of object oriented programming, where the idea here
795
00:39:42,080 --> 00:39:44,760
is that I would like to create an object that is
796
00:39:44,760 --> 00:39:46,960
able to store all of my frontier data.
797
00:39:46,960 --> 00:39:49,040
And I would like to have functions, otherwise known
798
00:39:49,040 --> 00:39:53,400
as methods, on that object that I can use to manipulate the object.
799
00:39:53,400 --> 00:39:57,120
And so what's going on here, if unfamiliar with the syntax,
800
00:39:57,120 --> 00:40:00,680
is I have a function that initially creates a frontier that I'm
801
00:40:00,680 --> 00:40:02,400
going to represent using a list.
802
00:40:02,400 --> 00:40:05,800
And initially, my frontier is represented by the empty list.
803
00:40:05,800 --> 00:40:08,840
There's nothing in my frontier to begin with.
804
00:40:08,840 --> 00:40:12,000
I have an add function that adds something to the frontier
805
00:40:12,000 --> 00:40:15,240
as by appending it to the end of the list.
806
00:40:15,240 --> 00:40:17,880
I have a function that checks if the frontier contains
807
00:40:17,880 --> 00:40:19,400
a particular state.
808
00:40:19,400 --> 00:40:22,240
I have an empty function that checks if the frontier is empty.
809
00:40:22,240 --> 00:40:26,200
If the frontier is empty, that just means the length of the frontier is 0.
810
00:40:26,200 --> 00:40:29,560
And then I have a function for removing something from the frontier.
811
00:40:29,560 --> 00:40:32,240
I can't remove something from the frontier if the frontier is empty,
812
00:40:32,240 --> 00:40:33,800
so I check for that first.
813
00:40:33,800 --> 00:40:36,720
But otherwise, if the frontier isn't empty,
814
00:40:36,720 --> 00:40:41,680
recall that I'm implementing this frontier as a stack, a last in first
815
00:40:41,680 --> 00:40:45,640
out data structure, which means the last thing I add to the frontier,
816
00:40:45,640 --> 00:40:48,600
in other words, the last thing in the list, is the item
817
00:40:48,600 --> 00:40:51,880
that I should remove from this frontier.
818
00:40:51,880 --> 00:40:56,200
So what you'll see here is I have removed the last item of a list.
819
00:40:56,200 --> 00:40:59,400
And if you index into a Python list with negative 1,
820
00:40:59,400 --> 00:41:01,080
that gets you the last item in the list.
821
00:41:01,080 --> 00:41:04,600
Since 0 is the first item, negative 1 kind of wraps around
822
00:41:04,600 --> 00:41:07,400
and gets you to the last item in the list.
823
00:41:07,400 --> 00:41:09,320
So we give that the node.
824
00:41:09,320 --> 00:41:10,360
We call that node.
825
00:41:10,360 --> 00:41:12,640
We update the frontier here on line 28 to say,
826
00:41:12,640 --> 00:41:16,040
go ahead and remove that node that you just removed from the frontier.
827
00:41:16,040 --> 00:41:18,720
And then we return the node as a result.
828
00:41:18,720 --> 00:41:23,080
So this class here effectively implements the idea of a frontier.
829
00:41:23,080 --> 00:41:25,400
It gives me a way to add something to a frontier
830
00:41:25,400 --> 00:41:29,440
and a way to remove something from the frontier as a stack.
831
00:41:29,440 --> 00:41:31,960
I've also, just for good measure, implemented
832
00:41:31,960 --> 00:41:36,000
an alternative version of the same thing called a queue frontier, which
833
00:41:36,000 --> 00:41:39,200
in parentheses you'll see here, it inherits from a stack frontier,
834
00:41:39,200 --> 00:41:42,680
meaning it's going to do all the same things that the stack frontier did,
835
00:41:42,680 --> 00:41:45,560
except the way we remove a node from the frontier
836
00:41:45,560 --> 00:41:47,000
is going to be slightly different.
837
00:41:47,000 --> 00:41:50,480
Instead of removing from the end of the list the way we would in a stack,
838
00:41:50,480 --> 00:41:53,160
we're instead going to remove from the beginning of the list.
839
00:41:53,160 --> 00:41:58,080
Self.frontier 0 will get me the first node in the frontier, the first one
840
00:41:58,080 --> 00:42:00,440
that was added, and that is going to be the one
841
00:42:00,440 --> 00:42:03,440
that we return in the case of a queue.
842
00:42:03,440 --> 00:42:06,360
Then under here, I have a definition of a class called maze.
843
00:42:06,360 --> 00:42:11,080
This is going to handle the process of taking a sequence, a maze-like text
844
00:42:11,080 --> 00:42:13,360
file, and figuring out how to solve it.
845
00:42:13,360 --> 00:42:16,960
So it will take as input a text file that looks something like this,
846
00:42:16,960 --> 00:42:20,720
for example, where we see hash marks that are here representing walls,
847
00:42:20,720 --> 00:42:23,880
and I have the character A representing the starting position
848
00:42:23,880 --> 00:42:27,840
and the character B representing the ending position.
849
00:42:27,840 --> 00:42:30,840
And you can take a look at the code for parsing this text file right now.
850
00:42:30,840 --> 00:42:32,360
That's the less interesting part.
851
00:42:32,360 --> 00:42:35,440
The more interesting part is this solve function here,
852
00:42:35,440 --> 00:42:37,440
the solve function is going to figure out
853
00:42:37,440 --> 00:42:41,160
how to actually get from point A to point B.
854
00:42:41,160 --> 00:42:44,160
And here we see an implementation of the exact same idea
855
00:42:44,160 --> 00:42:45,800
we saw from a moment ago.
856
00:42:45,800 --> 00:42:48,240
We're going to keep track of how many states we've explored,
857
00:42:48,240 --> 00:42:50,440
just so we can report that data later.
858
00:42:50,440 --> 00:42:55,680
But I start with a node that represents just the start state.
859
00:42:55,680 --> 00:43:00,000
And I start with a frontier that, in this case, is a stack frontier.
860
00:43:00,000 --> 00:43:02,000
And given that I'm treating my frontier as a stack,
861
00:43:02,000 --> 00:43:06,160
you might imagine that the algorithm I'm using here is now depth-first search,
862
00:43:06,160 --> 00:43:11,120
because depth-first search, or DFS, uses a stack as its data structure.
863
00:43:11,120 --> 00:43:16,320
And initially, this frontier is just going to contain the start state.
864
00:43:16,320 --> 00:43:19,280
We initialize an explored set that initially is empty.
865
00:43:19,280 --> 00:43:21,320
There's nothing we've explored so far.
866
00:43:21,320 --> 00:43:25,920
And now here's our loop, that notion of repeating something again and again.
867
00:43:25,920 --> 00:43:29,560
First, we check if the frontier is empty by calling that empty function
868
00:43:29,560 --> 00:43:31,800
that we saw the implementation of a moment ago.
869
00:43:31,800 --> 00:43:34,080
And if the frontier is indeed empty, we'll
870
00:43:34,080 --> 00:43:37,040
go ahead and raise an exception, or a Python error, to say,
871
00:43:37,040 --> 00:43:41,040
sorry, there is no solution to this problem.
872
00:43:41,040 --> 00:43:44,520
Otherwise, we'll go ahead and remove a node from the frontier
873
00:43:44,520 --> 00:43:48,920
as by calling frontier.remove and update the number of states we've explored,
874
00:43:48,920 --> 00:43:51,400
because now we've explored one additional state.
875
00:43:51,400 --> 00:43:55,240
So we say self.numexplored plus equals 1, adding 1
876
00:43:55,240 --> 00:43:57,800
to the number of states we've explored.
877
00:43:57,800 --> 00:44:00,080
Once we remove a node from the frontier,
878
00:44:00,080 --> 00:44:02,360
recall that the next step is to see whether or not
879
00:44:02,360 --> 00:44:04,320
it's the goal, the goal test.
880
00:44:04,320 --> 00:44:06,840
And in the case of the maze, the goal is pretty easy.
881
00:44:06,840 --> 00:44:11,080
I check to see whether the state of the node is equal to the goal.
882
00:44:11,080 --> 00:44:13,080
Initially, when I set up the maze, I set up
883
00:44:13,080 --> 00:44:15,760
this value called goal, which is a property of the maze,
884
00:44:15,760 --> 00:44:19,280
so I can just check to see if the node is actually the goal.
885
00:44:19,280 --> 00:44:22,040
And if it is the goal, then what I want to do
886
00:44:22,040 --> 00:44:26,400
is backtrack my way towards figuring out what actions I took in order
887
00:44:26,400 --> 00:44:28,360
to get to this goal.
888
00:44:28,360 --> 00:44:29,440
And how do I do that?
889
00:44:29,440 --> 00:44:33,400
We'll recall that every node stores its parent, the node that came before it
890
00:44:33,400 --> 00:44:37,000
that we used to get to this node, and also the action used in order to get
891
00:44:37,000 --> 00:44:37,680
there.
892
00:44:37,680 --> 00:44:40,800
So I can create this loop where I'm constantly just looking
893
00:44:40,800 --> 00:44:44,480
at the parent of every node and keeping track for all of the parents
894
00:44:44,480 --> 00:44:47,920
what action I took to get from the parent to this current node.
895
00:44:47,920 --> 00:44:50,280
So this loop is going to keep repeating this process
896
00:44:50,280 --> 00:44:52,400
of looking through all of the parent nodes
897
00:44:52,400 --> 00:44:54,680
until we get back to the initial state, which
898
00:44:54,680 --> 00:44:59,080
has no parent, where node.parent is going to be equal to none.
899
00:44:59,080 --> 00:45:01,960
As I do so, I'm going to be building up the list of all of the actions
900
00:45:01,960 --> 00:45:05,600
that I'm following and the list of all the cells that are part of the solution.
901
00:45:05,600 --> 00:45:08,240
But I'll reverse them because when I build it up,
902
00:45:08,240 --> 00:45:10,960
going from the goal back to the initial state
903
00:45:10,960 --> 00:45:14,040
and building the sequence of actions from the goal to the initial state,
904
00:45:14,040 --> 00:45:16,920
but I want to reverse them in order to get the sequence of actions
905
00:45:16,920 --> 00:45:19,640
from the initial state to the goal.
906
00:45:19,640 --> 00:45:23,400
And that is ultimately going to be the solution.
907
00:45:23,400 --> 00:45:27,320
So all of that happens if the current state is equal to the goal.
908
00:45:27,320 --> 00:45:29,320
And otherwise, if it's not the goal, well,
909
00:45:29,320 --> 00:45:32,920
then I'll go ahead and add this state to the explored set to say,
910
00:45:32,920 --> 00:45:34,280
I've explored this state now.
911
00:45:34,280 --> 00:45:37,520
No need to go back to it if I come across it in the future.
912
00:45:37,520 --> 00:45:42,840
And then this logic here implements the idea of adding neighbors to the frontier.
913
00:45:42,840 --> 00:45:44,840
I'm saying, look at all of my neighbors, and I
914
00:45:44,840 --> 00:45:47,560
implemented a function called neighbors that you can take a look at.
915
00:45:47,560 --> 00:45:49,720
And for each of those neighbors, I'm going to check,
916
00:45:49,720 --> 00:45:51,880
is the state already in the frontier?
917
00:45:51,880 --> 00:45:54,440
Is the state already in the explored set?
918
00:45:54,440 --> 00:45:58,600
And if it's not in either of those, then I'll go ahead and add this new child
919
00:45:58,600 --> 00:46:01,320
node, this new node, to the frontier.
920
00:46:01,320 --> 00:46:03,040
So there's a fair amount of syntax here,
921
00:46:03,040 --> 00:46:05,960
but the key here is not to understand all the nuances of the syntax.
922
00:46:05,960 --> 00:46:08,760
So feel free to take a closer look at this file on your own
923
00:46:08,760 --> 00:46:10,640
to get a sense for how it is working.
924
00:46:10,640 --> 00:46:13,120
But the key is to see how this is an implementation
925
00:46:13,120 --> 00:46:16,960
of the same pseudocode, the same idea that we were describing a moment ago
926
00:46:16,960 --> 00:46:19,880
on the screen when we were looking at the steps
927
00:46:19,880 --> 00:46:23,640
that we might follow in order to solve this kind of search problem.
928
00:46:23,640 --> 00:46:25,560
So now let's actually see this in action.
929
00:46:25,560 --> 00:46:31,560
I'll go ahead and run maze.py on maze1.txt, for example.
930
00:46:31,560 --> 00:46:34,200
And what we'll see is here, we have a printout
931
00:46:34,200 --> 00:46:36,400
of what the maze initially looked like.
932
00:46:36,400 --> 00:46:39,040
And then here down below is after we've solved it.
933
00:46:39,040 --> 00:46:41,480
We had to explore 11 states in order to do it,
934
00:46:41,480 --> 00:46:45,040
and we found a path from A to B. And in this program,
935
00:46:45,040 --> 00:46:48,160
I just happened to generate a graphical representation of this as well.
936
00:46:48,160 --> 00:46:50,440
So I can open up maze.png, which is generated
937
00:46:50,440 --> 00:46:54,840
by this program, that shows you where in the darker color here are the walls,
938
00:46:54,840 --> 00:46:56,880
red is the initial state, green is the goal,
939
00:46:56,880 --> 00:46:58,960
and yellow is the path that was followed.
940
00:46:58,960 --> 00:47:03,240
We found a path from the initial state to the goal.
941
00:47:03,240 --> 00:47:06,080
But now let's take a look at a more sophisticated maze
942
00:47:06,080 --> 00:47:08,160
to see what might happen instead.
943
00:47:08,160 --> 00:47:10,880
Let's look now at maze2.txt.
944
00:47:10,880 --> 00:47:11,760
We're now here.
945
00:47:11,760 --> 00:47:13,040
We have a much larger maze.
946
00:47:13,040 --> 00:47:16,320
Again, we're trying to find our way from point A to point B.
947
00:47:16,320 --> 00:47:19,560
But now you'll imagine that depth-first search might not be so lucky.
948
00:47:19,560 --> 00:47:22,040
It might not get the goal on the first try.
949
00:47:22,040 --> 00:47:26,000
It might have to follow one path, then backtrack and explore something else
950
00:47:26,000 --> 00:47:28,120
a little bit later.
951
00:47:28,120 --> 00:47:29,240
So let's try this.
952
00:47:29,240 --> 00:47:34,960
We'll run python maze.py of maze2.txt, this time trying on this other maze.
953
00:47:34,960 --> 00:47:38,160
And now, depth-first search is able to find a solution.
954
00:47:38,160 --> 00:47:42,080
Here, as indicated by the stars, is a way to get from A to B.
955
00:47:42,080 --> 00:47:45,480
And we can represent this visually by opening up this maze.
956
00:47:45,480 --> 00:47:48,040
Here's what that maze looks like, and highlighted in yellow
957
00:47:48,040 --> 00:47:52,320
is the path that was found from the initial state to the goal.
958
00:47:52,320 --> 00:47:57,360
But how many states did we have to explore before we found that path?
959
00:47:57,360 --> 00:47:59,440
Well, recall that in my program, I was keeping
960
00:47:59,440 --> 00:48:02,560
track of the number of states that we've explored so far.
961
00:48:02,560 --> 00:48:05,920
And so I can go back to the terminal and see that, all right,
962
00:48:05,920 --> 00:48:12,040
in order to solve this problem, we had to explore 399 different states.
963
00:48:12,040 --> 00:48:14,880
And in fact, if I make one small modification of the program
964
00:48:14,880 --> 00:48:17,960
and tell the program at the end when we output this image,
965
00:48:17,960 --> 00:48:21,200
I added an argument called show explored.
966
00:48:21,200 --> 00:48:26,000
And if I set show explored equal to true and rerun this program,
967
00:48:26,000 --> 00:48:30,680
python maze.py, running it on maze2, and then I open the maze, what you'll see
968
00:48:30,680 --> 00:48:33,520
here is highlighted in red are all of the states
969
00:48:33,520 --> 00:48:37,560
that had to be explored to get from the initial state to the goal.
970
00:48:37,560 --> 00:48:41,200
Depth-first search, or DFS, didn't find its way to the goal right away.
971
00:48:41,200 --> 00:48:44,200
It made a choice to first explore this direction.
972
00:48:44,200 --> 00:48:46,040
And when it explored this direction, it had
973
00:48:46,040 --> 00:48:49,040
to follow every conceivable path all the way to the very end,
974
00:48:49,040 --> 00:48:52,400
even this long and winding one, in order to realize that, you know what?
975
00:48:52,400 --> 00:48:53,480
That's a dead end.
976
00:48:53,480 --> 00:48:55,720
And instead, the program needed to backtrack.
977
00:48:55,720 --> 00:48:58,680
After going this direction, it must have gone this direction.
978
00:48:58,680 --> 00:49:01,440
It got lucky here by just not choosing this path,
979
00:49:01,440 --> 00:49:05,360
but it got unlucky here, exploring this direction, exploring a bunch of states
980
00:49:05,360 --> 00:49:07,680
it didn't need to, and then likewise exploring
981
00:49:07,680 --> 00:49:10,000
all of this top part of the graph when it probably
982
00:49:10,000 --> 00:49:12,240
didn't need to do that either.
983
00:49:12,240 --> 00:49:16,720
So all in all, depth-first search here really not performing optimally,
984
00:49:16,720 --> 00:49:19,000
or probably exploring more states than it needs to.
985
00:49:19,000 --> 00:49:22,640
It finds an optimal solution, the best path to the goal,
986
00:49:22,640 --> 00:49:25,520
but the number of states needed to explore in order to do so,
987
00:49:25,520 --> 00:49:29,080
the number of steps I had to take, that was much higher.
988
00:49:29,080 --> 00:49:30,080
So let's compare.
989
00:49:30,080 --> 00:49:35,160
How would breadth-first search, or BFS, do on this exact same maze instead?
990
00:49:35,160 --> 00:49:37,640
And in order to do so, it's a very easy change.
991
00:49:37,640 --> 00:49:42,560
The algorithm for DFS and BFS is identical with the exception
992
00:49:42,560 --> 00:49:47,000
of what data structure we use to represent the frontier,
993
00:49:47,000 --> 00:49:51,560
that in DFS, I used a stack frontier, last in, first out,
994
00:49:51,560 --> 00:49:57,320
whereas in BFS, I'm going to use a queue frontier, first in, first out,
995
00:49:57,320 --> 00:50:00,640
where the first thing I add to the frontier is the first thing that I
996
00:50:00,640 --> 00:50:01,600
remove.
997
00:50:01,600 --> 00:50:06,680
So I'll go back to the terminal, rerun this program on the same maze,
998
00:50:06,680 --> 00:50:08,800
and now you'll see that the number of states
999
00:50:08,800 --> 00:50:13,200
we had to explore was only 77 as compared to almost 400
1000
00:50:13,200 --> 00:50:15,040
when we used depth-first search.
1001
00:50:15,040 --> 00:50:16,360
And we can see exactly why.
1002
00:50:16,360 --> 00:50:21,000
We can see what happened if we open up maze.png now and take a look.
1003
00:50:21,000 --> 00:50:25,560
Again, yellow highlight is the solution that breadth-first search found,
1004
00:50:25,560 --> 00:50:29,040
which incidentally is the same solution that depth-first search found.
1005
00:50:29,040 --> 00:50:31,360
They're both finding the best solution.
1006
00:50:31,360 --> 00:50:33,640
But notice all the white unexplored cells.
1007
00:50:33,640 --> 00:50:37,000
There was much fewer states that needed to be explored in order
1008
00:50:37,000 --> 00:50:41,000
to make our way to the goal because breadth-first search operates
1009
00:50:41,000 --> 00:50:42,000
a little more shallowly.
1010
00:50:42,000 --> 00:50:45,080
It's exploring things that are close to the initial state
1011
00:50:45,080 --> 00:50:48,160
without exploring things that are further away.
1012
00:50:48,160 --> 00:50:51,240
So if the goal is not too far away, then breadth-first search
1013
00:50:51,240 --> 00:50:53,960
can actually behave quite effectively on a maze that
1014
00:50:53,960 --> 00:50:56,760
looks a little something like this.
1015
00:50:56,760 --> 00:51:01,760
Now, in this case, both BFS and DFS ended up finding the same solution,
1016
00:51:01,760 --> 00:51:03,560
but that won't always be the case.
1017
00:51:03,560 --> 00:51:06,320
And in fact, let's take a look at one more example.
1018
00:51:06,320 --> 00:51:09,400
For instance, maze3.txt.
1019
00:51:09,400 --> 00:51:12,980
In maze3.txt, notice that here there are multiple ways
1020
00:51:12,980 --> 00:51:16,440
that you could get from A to B. It's a relatively small maze,
1021
00:51:16,440 --> 00:51:18,080
but let's look at what happens.
1022
00:51:18,080 --> 00:51:21,560
If I use, and I'll go ahead and turn off show explored
1023
00:51:21,560 --> 00:51:24,320
so we just see the solution.
1024
00:51:24,320 --> 00:51:30,640
If I use BFS, breadth-first search, to solve maze3.txt,
1025
00:51:30,640 --> 00:51:33,840
well, then we find a solution, and if I open up the maze,
1026
00:51:33,840 --> 00:51:35,560
here is the solution that we found.
1027
00:51:35,560 --> 00:51:36,640
It is the optimal one.
1028
00:51:36,640 --> 00:51:39,720
With just four steps, we can get from the initial state
1029
00:51:39,720 --> 00:51:43,080
to what the goal happens to be.
1030
00:51:43,080 --> 00:51:47,920
But what happens if we tried to use depth-first search or DFS instead?
1031
00:51:47,920 --> 00:51:52,560
Well, again, I'll go back up to my Q frontier, where Q frontier means
1032
00:51:52,560 --> 00:51:57,320
that we're using breadth-first search, and I'll change it to a stack frontier,
1033
00:51:57,320 --> 00:52:00,880
which means that now we'll be using depth-first search.
1034
00:52:00,880 --> 00:52:06,520
I'll rerun pythonmaze.py, and now you'll see that we find the solution,
1035
00:52:06,520 --> 00:52:09,000
but it is not the optimal solution.
1036
00:52:09,000 --> 00:52:11,760
This instead is what our algorithm finds,
1037
00:52:11,760 --> 00:52:14,160
and maybe depth-first search would have found the solution.
1038
00:52:14,160 --> 00:52:17,400
It's possible, but it's not guaranteed that if we just
1039
00:52:17,400 --> 00:52:21,320
happen to be unlucky, if we choose this state instead of that state,
1040
00:52:21,320 --> 00:52:24,000
then depth-first search might find a longer route
1041
00:52:24,000 --> 00:52:27,280
to get from the initial state to the goal.
1042
00:52:27,280 --> 00:52:30,320
So we do see some trade-offs here, where depth-first search might not
1043
00:52:30,320 --> 00:52:32,360
find the optimal solution.
1044
00:52:32,360 --> 00:52:35,120
So at that point, it seems like breadth-first search is pretty good.
1045
00:52:35,120 --> 00:52:38,960
Is that the best we can do, where it's going to find us the optimal solution,
1046
00:52:38,960 --> 00:52:41,360
and we don't have to worry about situations
1047
00:52:41,360 --> 00:52:44,560
where we might end up finding a longer path to the solution
1048
00:52:44,560 --> 00:52:46,440
than what actually exists?
1049
00:52:46,440 --> 00:52:49,320
Where the goal is far away from the initial state,
1050
00:52:49,320 --> 00:52:51,520
and we might have to take lots of steps in order
1051
00:52:51,520 --> 00:52:55,000
to get from the initial state to the goal, what ended up happening
1052
00:52:55,000 --> 00:52:59,560
is that this algorithm, BFS, ended up exploring basically the entire graph,
1053
00:52:59,560 --> 00:53:01,920
having to go through the entire maze in order
1054
00:53:01,920 --> 00:53:05,960
to find its way from the initial state to the goal state.
1055
00:53:05,960 --> 00:53:08,120
What we'd ultimately like is for our algorithm
1056
00:53:08,120 --> 00:53:10,800
to be a little bit more intelligent.
1057
00:53:10,800 --> 00:53:13,800
And now what would it mean for our algorithm to be a little bit more
1058
00:53:13,800 --> 00:53:16,000
intelligent in this case?
1059
00:53:16,000 --> 00:53:18,680
Well, let's look back to where breadth-first search might
1060
00:53:18,680 --> 00:53:20,440
have been able to make a different decision
1061
00:53:20,440 --> 00:53:23,880
and consider human intuition in this process as well.
1062
00:53:23,880 --> 00:53:26,280
What might a human do when solving this maze
1063
00:53:26,280 --> 00:53:30,640
that is different than what BFS ultimately chose to do?
1064
00:53:30,640 --> 00:53:35,160
Well, the very first decision point that BFS made was right here,
1065
00:53:35,160 --> 00:53:38,400
when it made five steps and ended up in a position
1066
00:53:38,400 --> 00:53:39,680
where it had a fork in the row.
1067
00:53:39,680 --> 00:53:41,880
It could either go left or it could go right.
1068
00:53:41,880 --> 00:53:44,000
In these initial couple steps, there was no choice.
1069
00:53:44,000 --> 00:53:46,840
There was only one action that could be taken from each of those states.
1070
00:53:46,840 --> 00:53:49,200
And so the search algorithm did the only thing
1071
00:53:49,200 --> 00:53:53,000
that any search algorithm could do, which is keep following that state
1072
00:53:53,000 --> 00:53:54,560
after the next state.
1073
00:53:54,560 --> 00:53:57,840
But this decision point is where things get a little bit interesting.
1074
00:53:57,840 --> 00:54:01,000
Depth-first search, that very first search algorithm we looked at,
1075
00:54:01,000 --> 00:54:04,880
chose to say, let's pick one path and exhaust that path.
1076
00:54:04,880 --> 00:54:07,520
See if anything that way has the goal.
1077
00:54:07,520 --> 00:54:09,920
And if not, then let's try the other way.
1078
00:54:09,920 --> 00:54:12,720
Depth-first search took the alternative approach of saying,
1079
00:54:12,720 --> 00:54:16,480
you know what, let's explore things that are shallow, close to us first.
1080
00:54:16,480 --> 00:54:20,080
Look left and right, then back left and back right, so on and so forth,
1081
00:54:20,080 --> 00:54:24,800
alternating between our options in the hopes of finding something nearby.
1082
00:54:24,800 --> 00:54:27,520
But ultimately, what might a human do if confronted
1083
00:54:27,520 --> 00:54:30,400
with a situation like this of go left or go right?
1084
00:54:30,400 --> 00:54:33,280
Well, a human might visually see that, all right, I'm
1085
00:54:33,280 --> 00:54:36,080
trying to get to state b, which is way up there,
1086
00:54:36,080 --> 00:54:39,600
and going right just feels like it's closer to the goal.
1087
00:54:39,600 --> 00:54:42,240
It feels like going right should be better than going left
1088
00:54:42,240 --> 00:54:45,440
because I'm making progress towards getting to that goal.
1089
00:54:45,440 --> 00:54:48,480
Now, of course, there are a couple of assumptions that I'm making here.
1090
00:54:48,480 --> 00:54:51,840
I'm making the assumption that we can represent this grid
1091
00:54:51,840 --> 00:54:55,040
as like a two-dimensional grid where I know the coordinates of everything.
1092
00:54:55,040 --> 00:55:00,120
I know that a is in coordinate 0, 0, and b is in some other coordinate pair,
1093
00:55:00,120 --> 00:55:01,640
and I know what coordinate I'm at now.
1094
00:55:01,640 --> 00:55:05,640
So I can calculate that, yeah, going this way, that is closer to the goal.
1095
00:55:05,640 --> 00:55:08,840
And that might be a reasonable assumption for some types of search problems,
1096
00:55:08,840 --> 00:55:10,240
but maybe not in others.
1097
00:55:10,240 --> 00:55:12,840
But for now, we'll go ahead and assume that,
1098
00:55:12,840 --> 00:55:15,480
that I know what my current coordinate pair is,
1099
00:55:15,480 --> 00:55:19,840
and I know the coordinate, x, y, of the goal that I'm trying to get to.
1100
00:55:19,840 --> 00:55:22,520
And in this situation, I'd like an algorithm
1101
00:55:22,520 --> 00:55:25,240
that is a little bit more intelligent, that somehow knows
1102
00:55:25,240 --> 00:55:28,320
that I should be making progress towards the goal,
1103
00:55:28,320 --> 00:55:31,880
and this is probably the way to do that because in a maze,
1104
00:55:31,880 --> 00:55:34,640
moving in the coordinate direction of the goal
1105
00:55:34,640 --> 00:55:37,920
is usually, though not always, a good thing.
1106
00:55:37,920 --> 00:55:40,640
And so here we draw a distinction between two different types
1107
00:55:40,640 --> 00:55:45,200
of search algorithms, uninformed search and informed search.
1108
00:55:45,200 --> 00:55:49,480
Uninformed search algorithms are algorithms like DFS and BFS,
1109
00:55:49,480 --> 00:55:51,440
the two algorithms that we just looked at, which
1110
00:55:51,440 --> 00:55:55,720
are search strategies that don't use any problem-specific knowledge
1111
00:55:55,720 --> 00:55:57,560
to be able to solve the problem.
1112
00:55:57,560 --> 00:56:01,280
DFS and BFS didn't really care about the structure of the maze
1113
00:56:01,280 --> 00:56:05,400
or anything about the way that a maze is in order to solve the problem.
1114
00:56:05,400 --> 00:56:08,720
They just look at the actions available and choose from those actions,
1115
00:56:08,720 --> 00:56:11,480
and it doesn't matter whether it's a maze or some other problem,
1116
00:56:11,480 --> 00:56:14,200
the solution or the way that it tries to solve the problem
1117
00:56:14,200 --> 00:56:17,520
is really fundamentally going to be the same.
1118
00:56:17,520 --> 00:56:19,920
What we're going to take a look at now is an improvement
1119
00:56:19,920 --> 00:56:21,520
upon uninformed search.
1120
00:56:21,520 --> 00:56:24,000
We're going to take a look at informed search.
1121
00:56:24,000 --> 00:56:26,440
Informed search are going to be search strategies
1122
00:56:26,440 --> 00:56:29,680
that use knowledge specific to the problem
1123
00:56:29,680 --> 00:56:31,960
to be able to better find a solution.
1124
00:56:31,960 --> 00:56:35,440
And in the case of a maze, this problem-specific knowledge
1125
00:56:35,440 --> 00:56:40,400
is something like if I'm in a square that is geographically closer to the goal,
1126
00:56:40,400 --> 00:56:45,880
that is better than being in a square that is geographically further away.
1127
00:56:45,880 --> 00:56:49,400
And this is something we can only know by thinking about this problem
1128
00:56:49,400 --> 00:56:54,000
and reasoning about what knowledge might be helpful for our AI agent
1129
00:56:54,000 --> 00:56:56,360
to know a little something about.
1130
00:56:56,360 --> 00:56:58,600
There are a number of different types of informed search.
1131
00:56:58,600 --> 00:57:01,600
Specifically, first, we're going to look at a particular type of search
1132
00:57:01,600 --> 00:57:05,720
algorithm called greedy best-first search.
1133
00:57:05,720 --> 00:57:08,880
Greedy best-first search, often abbreviated G-BFS,
1134
00:57:08,880 --> 00:57:13,160
is a search algorithm that instead of expanding the deepest node like DFS
1135
00:57:13,160 --> 00:57:16,680
or the shallowest node like BFS, this algorithm
1136
00:57:16,680 --> 00:57:22,160
is always going to expand the node that it thinks is closest to the goal.
1137
00:57:22,160 --> 00:57:24,600
Now, the search algorithm isn't going to know for sure
1138
00:57:24,600 --> 00:57:27,040
whether it is the closest thing to the goal.
1139
00:57:27,040 --> 00:57:29,720
Because if we knew what was closest to the goal all the time,
1140
00:57:29,720 --> 00:57:31,600
then we would already have a solution.
1141
00:57:31,600 --> 00:57:33,360
The knowledge of what is close to the goal,
1142
00:57:33,360 --> 00:57:36,760
we could just follow those steps in order to get from the initial position
1143
00:57:36,760 --> 00:57:37,960
to the solution.
1144
00:57:37,960 --> 00:57:39,600
But if we don't know the solution, meaning
1145
00:57:39,600 --> 00:57:42,560
we don't know exactly what's closest to the goal,
1146
00:57:42,560 --> 00:57:46,200
instead we can use an estimate of what's closest to the goal,
1147
00:57:46,200 --> 00:57:50,680
otherwise known as a heuristic, just some way of estimating whether or not
1148
00:57:50,680 --> 00:57:51,960
we're close to the goal.
1149
00:57:51,960 --> 00:57:54,640
And we'll do so using a heuristic function conventionally
1150
00:57:54,640 --> 00:57:58,520
called h of n that takes a status input and returns
1151
00:57:58,520 --> 00:58:03,000
our estimate of how close we are to the goal.
1152
00:58:03,000 --> 00:58:05,160
So what might this heuristic function actually
1153
00:58:05,160 --> 00:58:08,240
look like in the case of a maze solving algorithm?
1154
00:58:08,240 --> 00:58:11,600
Where we're trying to solve a maze, what does the heuristic look like?
1155
00:58:11,600 --> 00:58:14,160
Well, the heuristic needs to answer a question
1156
00:58:14,160 --> 00:58:17,800
between these two cells, C and D, which one is better?
1157
00:58:17,800 --> 00:58:22,280
Which one would I rather be in if I'm trying to find my way to the goal?
1158
00:58:22,280 --> 00:58:24,440
Well, any human could probably look at this and tell you,
1159
00:58:24,440 --> 00:58:26,400
you know what, D looks like it's better.
1160
00:58:26,400 --> 00:58:29,680
Even if the maze is convoluted and you haven't thought about all the walls,
1161
00:58:29,680 --> 00:58:31,520
D is probably better.
1162
00:58:31,520 --> 00:58:32,760
And why is D better?
1163
00:58:32,760 --> 00:58:35,480
Well, because if you ignore the wall, so let's just pretend
1164
00:58:35,480 --> 00:58:40,440
the walls don't exist for a moment and relax the problem, so to speak,
1165
00:58:40,440 --> 00:58:44,800
D, just in terms of coordinate pairs, is closer to this goal.
1166
00:58:44,800 --> 00:58:49,080
It's fewer steps that I wouldn't take to get to the goal as compared to C,
1167
00:58:49,080 --> 00:58:50,320
even if you ignore the walls.
1168
00:58:50,320 --> 00:58:55,320
If you just know the xy-coordinate of C and the xy-coordinate of the goal,
1169
00:58:55,320 --> 00:58:57,600
and likewise you know the xy-coordinate of D,
1170
00:58:57,600 --> 00:59:00,520
you can calculate the D just geographically.
1171
00:59:00,520 --> 00:59:03,320
Ignoring the walls looks like it's better.
1172
00:59:03,320 --> 00:59:05,820
And so this is the heuristic function that we're going to use.
1173
00:59:05,820 --> 00:59:08,160
And it's something called the Manhattan distance,
1174
00:59:08,160 --> 00:59:12,440
one specific type of heuristic, where the heuristic is how many squares
1175
00:59:12,440 --> 00:59:15,080
vertically and horizontally and then left to right,
1176
00:59:15,080 --> 00:59:18,320
so not allowing myself to go diagonally, just either up or right
1177
00:59:18,320 --> 00:59:19,480
or left or down.
1178
00:59:19,480 --> 00:59:24,160
How many steps do I need to take to get from each of these cells to the goal?
1179
00:59:24,160 --> 00:59:27,040
Well, as it turns out, D is much closer.
1180
00:59:27,040 --> 00:59:28,040
There are fewer steps.
1181
00:59:28,040 --> 00:59:31,920
It only needs to take six steps in order to get to that goal.
1182
00:59:31,920 --> 00:59:33,760
Again, here, ignoring the walls.
1183
00:59:33,760 --> 00:59:35,960
We've relaxed the problem a little bit.
1184
00:59:35,960 --> 00:59:38,600
We're just concerned with if you do the math
1185
00:59:38,600 --> 00:59:41,880
to subtract the x values from each other and the y values from each other,
1186
00:59:41,880 --> 00:59:44,200
what is our estimate of how far we are away?
1187
00:59:44,200 --> 00:59:49,140
We can estimate the D is closer to the goal than C is.
1188
00:59:49,140 --> 00:59:51,000
And so now we have an approach.
1189
00:59:51,000 --> 00:59:54,040
We have a way of picking which node to remove from the frontier.
1190
00:59:54,040 --> 00:59:56,080
And at each stage in our algorithm, we're
1191
00:59:56,080 --> 00:59:57,760
going to remove a node from the frontier.
1192
00:59:57,760 --> 01:00:00,920
We're going to explore the node if it has the smallest
1193
01:00:00,920 --> 01:00:04,120
value for this heuristic function, if it has the smallest
1194
01:00:04,120 --> 01:00:06,720
Manhattan distance to the goal.
1195
01:00:06,720 --> 01:00:08,560
And so what would this actually look like?
1196
01:00:08,560 --> 01:00:11,440
Well, let me first label this graph, label this maze,
1197
01:00:11,440 --> 01:00:14,680
with a number representing the value of this heuristic function,
1198
01:00:14,680 --> 01:00:18,080
the value of the Manhattan distance from any of these cells.
1199
01:00:18,080 --> 01:00:21,200
So from this cell, for example, we're one away from the goal.
1200
01:00:21,200 --> 01:00:24,560
From this cell, we're two away from the goal, three away, four away.
1201
01:00:24,560 --> 01:00:27,120
Here, we're five away because we have to go one to the right
1202
01:00:27,120 --> 01:00:28,400
and then four up.
1203
01:00:28,400 --> 01:00:32,160
From somewhere like here, the Manhattan distance is two.
1204
01:00:32,160 --> 01:00:35,720
We're only two squares away from the goal geographically,
1205
01:00:35,720 --> 01:00:39,000
even though in practice, we're going to have to take a longer path.
1206
01:00:39,000 --> 01:00:40,080
But we don't know that yet.
1207
01:00:40,080 --> 01:00:42,920
The heuristic is just some easy way to estimate
1208
01:00:42,920 --> 01:00:44,560
how far we are away from the goal.
1209
01:00:44,560 --> 01:00:47,560
And maybe our heuristic is overly optimistic.
1210
01:00:47,560 --> 01:00:49,800
It thinks that, yeah, we're only two steps away.
1211
01:00:49,800 --> 01:00:53,680
When in practice, when you consider the walls, it might be more steps.
1212
01:00:53,680 --> 01:00:57,800
So the important thing here is that the heuristic isn't a guarantee of how
1213
01:00:57,800 --> 01:00:59,400
many steps it's going to take.
1214
01:00:59,400 --> 01:01:01,040
It is estimating.
1215
01:01:01,040 --> 01:01:03,040
It's an attempt at trying to approximate.
1216
01:01:03,040 --> 01:01:06,240
And it does seem generally the case that the squares that
1217
01:01:06,240 --> 01:01:10,120
look closer to the goal have smaller values for the heuristic function
1218
01:01:10,120 --> 01:01:13,120
than squares that are further away.
1219
01:01:13,120 --> 01:01:18,240
So now, using greedy best-first search, what might this algorithm actually do?
1220
01:01:18,240 --> 01:01:21,520
Well, again, for these first five steps, there's not much of a choice.
1221
01:01:21,520 --> 01:01:23,840
We start at this initial state a, and we say, all right,
1222
01:01:23,840 --> 01:01:26,440
we have to explore these five states.
1223
01:01:26,440 --> 01:01:28,040
But now we have a decision point.
1224
01:01:28,040 --> 01:01:30,760
Now we have a choice between going left and going right.
1225
01:01:30,760 --> 01:01:34,080
And before, when DFS and BFS would just pick arbitrarily,
1226
01:01:34,080 --> 01:01:37,760
because it just depends on the order you throw these two nodes into the frontier,
1227
01:01:37,760 --> 01:01:40,880
and we didn't specify what order you put them into the frontier,
1228
01:01:40,880 --> 01:01:45,520
only the order you take them out, here we can look at 13 and 11
1229
01:01:45,520 --> 01:01:50,800
and say that, all right, this square is a distance of 11 away from the goal
1230
01:01:50,800 --> 01:01:53,440
according to our heuristic, according to our estimate.
1231
01:01:53,440 --> 01:01:57,720
And this one, we estimate to be 13 away from the goal.
1232
01:01:57,720 --> 01:02:00,800
So between those two options, between these two choices,
1233
01:02:00,800 --> 01:02:02,280
I'd rather have the 11.
1234
01:02:02,280 --> 01:02:06,280
I'd rather be 11 steps away from the goal, so I'll go to the right.
1235
01:02:06,280 --> 01:02:09,800
We're able to make an informed decision, because we know a little something
1236
01:02:09,800 --> 01:02:11,840
more about this problem.
1237
01:02:11,840 --> 01:02:13,960
So then we keep following, 10, 9, 8.
1238
01:02:13,960 --> 01:02:17,920
Between the two 7s, we don't really have much of a way to know between those.
1239
01:02:17,920 --> 01:02:20,040
So then we do just have to make an arbitrary choice.
1240
01:02:20,040 --> 01:02:21,840
And you know what, maybe we choose wrong.
1241
01:02:21,840 --> 01:02:26,240
But that's OK, because now we can still say, all right, let's try this 7.
1242
01:02:26,240 --> 01:02:29,280
We say 7, 6, we have to make this choice,
1243
01:02:29,280 --> 01:02:31,800
even though it increases the value of the heuristic function.
1244
01:02:31,800 --> 01:02:36,440
But now we have another decision point, between 6 and 8, and between those two.
1245
01:02:36,440 --> 01:02:39,520
And really, we're also considering this 13, but that's much higher.
1246
01:02:39,520 --> 01:02:43,560
Between 6, 8, and 13, well, the 6 is the smallest value,
1247
01:02:43,560 --> 01:02:45,040
so we'd rather take the 6.
1248
01:02:45,040 --> 01:02:48,600
We're able to make an informed decision that going this way to the right
1249
01:02:48,600 --> 01:02:51,000
is probably better than going down.
1250
01:02:51,000 --> 01:02:53,000
So we turn this way, we go to 5.
1251
01:02:53,000 --> 01:02:55,320
And now we find a decision point where we'll actually
1252
01:02:55,320 --> 01:02:57,360
make a decision that we might not want to make,
1253
01:02:57,360 --> 01:03:00,440
but there's unfortunately not too much of a way around this.
1254
01:03:00,440 --> 01:03:01,800
We see 4 and 6.
1255
01:03:01,800 --> 01:03:03,760
4 looks closer to the goal, right?
1256
01:03:03,760 --> 01:03:06,320
It's going up, and the goal is further up.
1257
01:03:06,320 --> 01:03:09,840
So we end up taking that route, which ultimately leads us to a dead end.
1258
01:03:09,840 --> 01:03:13,120
But that's OK, because we can still say, all right, now let's try the 6.
1259
01:03:13,120 --> 01:03:17,400
And now follow this route that will ultimately lead us to the goal.
1260
01:03:17,400 --> 01:03:20,480
And so this now is how greedy best-for-search
1261
01:03:20,480 --> 01:03:22,640
might try to approach this problem by saying,
1262
01:03:22,640 --> 01:03:26,240
whenever we have a decision between multiple nodes that we could explore,
1263
01:03:26,240 --> 01:03:30,480
let's explore the node that has the smallest value of h of n,
1264
01:03:30,480 --> 01:03:35,360
this heuristic function that is estimating how far I have to go.
1265
01:03:35,360 --> 01:03:37,560
And it just so happens that in this case, we end up
1266
01:03:37,560 --> 01:03:41,200
doing better in terms of the number of states we needed to explore
1267
01:03:41,200 --> 01:03:42,560
than BFS needed to.
1268
01:03:42,560 --> 01:03:46,120
BFS explored all of this section and all of that section,
1269
01:03:46,120 --> 01:03:49,640
but we were able to eliminate that by taking advantage of this heuristic,
1270
01:03:49,640 --> 01:03:56,360
this knowledge about how close we are to the goal or some estimate of that idea.
1271
01:03:56,360 --> 01:03:57,480
So this seems much better.
1272
01:03:57,480 --> 01:04:01,080
So wouldn't we always prefer an algorithm like this over an algorithm
1273
01:04:01,080 --> 01:04:03,040
like breadth-first search?
1274
01:04:03,040 --> 01:04:05,560
Well, maybe one thing to take into consideration
1275
01:04:05,560 --> 01:04:09,600
is that we need to come up with a good heuristic, how good the heuristic is,
1276
01:04:09,600 --> 01:04:11,840
is going to affect how good this algorithm is.
1277
01:04:11,840 --> 01:04:16,000
And coming up with a good heuristic can oftentimes be challenging.
1278
01:04:16,000 --> 01:04:18,440
But the other thing to consider is to ask the question,
1279
01:04:18,440 --> 01:04:22,720
just as we did with the prior two algorithms, is this algorithm optimal?
1280
01:04:22,720 --> 01:04:28,400
Will it always find the shortest path from the initial state to the goal?
1281
01:04:28,400 --> 01:04:32,320
And to answer that question, let's take a look at this example for a moment.
1282
01:04:32,320 --> 01:04:33,600
Take a look at this example.
1283
01:04:33,600 --> 01:04:36,120
Again, we're trying to get from A to B. And again,
1284
01:04:36,120 --> 01:04:40,160
I've labeled each of the cells with their Manhattan distance from the goal.
1285
01:04:40,160 --> 01:04:42,480
The number of squares up and to the right,
1286
01:04:42,480 --> 01:04:46,680
you would need to travel in order to get from that square to the goal.
1287
01:04:46,680 --> 01:04:49,560
And let's think about, would greedy best-first search
1288
01:04:49,560 --> 01:04:55,520
that always picks the smallest number end up finding the optimal solution?
1289
01:04:55,520 --> 01:04:57,080
What is the shortest solution?
1290
01:04:57,080 --> 01:04:59,560
And would this algorithm find it?
1291
01:04:59,560 --> 01:05:04,360
And the important thing to realize is that right here is the decision point.
1292
01:05:04,360 --> 01:05:06,840
We're estimated to be 12 away from the goal.
1293
01:05:06,840 --> 01:05:08,360
And we have two choices.
1294
01:05:08,360 --> 01:05:11,840
We can go to the left, which we estimate to be 13 away from the goal.
1295
01:05:11,840 --> 01:05:15,840
Or we can go up, where we estimate it to be 11 away from the goal.
1296
01:05:15,840 --> 01:05:18,720
And between those two, greedy best-first search
1297
01:05:18,720 --> 01:05:23,120
is going to say the 11 looks better than the 13.
1298
01:05:23,120 --> 01:05:26,040
And in doing so, greedy best-first search will end up
1299
01:05:26,040 --> 01:05:28,960
finding this path to the goal.
1300
01:05:28,960 --> 01:05:31,120
But it turns out this path is not optimal.
1301
01:05:31,120 --> 01:05:33,600
There is a way to get to the goal using fewer steps.
1302
01:05:33,600 --> 01:05:38,520
And it's actually this way, this way that ultimately involved fewer steps,
1303
01:05:38,520 --> 01:05:43,480
even though it meant at this moment choosing the worst option between the two
1304
01:05:43,480 --> 01:05:47,280
or what we estimated to be the worst option based on the heuristics.
1305
01:05:47,280 --> 01:05:50,040
And so this is what we mean by this is a greedy algorithm.
1306
01:05:50,040 --> 01:05:52,600
It's making the best decision locally.
1307
01:05:52,600 --> 01:05:55,800
At this decision point, it looks like it's better to go here
1308
01:05:55,800 --> 01:05:57,480
than it is to go to the 13.
1309
01:05:57,480 --> 01:06:00,200
But in the big picture, it's not necessarily optimal.
1310
01:06:00,200 --> 01:06:03,200
That it might find a solution when in actuality,
1311
01:06:03,200 --> 01:06:06,200
there was a better solution available.
1312
01:06:06,200 --> 01:06:09,360
So we would like some way to solve this problem.
1313
01:06:09,360 --> 01:06:12,000
We like the idea of this heuristic, of being
1314
01:06:12,000 --> 01:06:16,280
able to estimate the path, the distance between us and the goal.
1315
01:06:16,280 --> 01:06:18,440
And that helps us to be able to make better decisions
1316
01:06:18,440 --> 01:06:23,160
and to eliminate having to search through entire parts of this state space.
1317
01:06:23,160 --> 01:06:27,080
But we would like to modify the algorithm so that we can achieve optimality,
1318
01:06:27,080 --> 01:06:28,760
so that it can be optimal.
1319
01:06:28,760 --> 01:06:30,120
And what is the way to do this?
1320
01:06:30,120 --> 01:06:31,960
What is the intuition here?
1321
01:06:31,960 --> 01:06:34,480
Well, let's take a look at this problem.
1322
01:06:34,480 --> 01:06:37,200
In this initial problem, greedy best research
1323
01:06:37,200 --> 01:06:40,240
found us this solution here, this long path.
1324
01:06:40,240 --> 01:06:43,440
And the reason why it wasn't great is because, yes, the heuristic numbers
1325
01:06:43,440 --> 01:06:44,960
went down pretty low.
1326
01:06:44,960 --> 01:06:47,320
But later on, they started to build back up.
1327
01:06:47,320 --> 01:06:52,000
They built back 8, 9, 10, 11, all the way up to 12 in this case.
1328
01:06:52,000 --> 01:06:55,440
And so how might we go about trying to improve this algorithm?
1329
01:06:55,440 --> 01:06:59,240
Well, one thing that we might realize is that if we go all the way
1330
01:06:59,240 --> 01:07:03,440
through this algorithm, through this path, and we end up going to the 12,
1331
01:07:03,440 --> 01:07:06,600
and we've had to take this many steps, who knows how many steps that is,
1332
01:07:06,600 --> 01:07:11,440
just to get to this 12, we could have also, as an alternative,
1333
01:07:11,440 --> 01:07:16,320
taken much fewer steps, just six steps, and ended up at this 13 here.
1334
01:07:16,320 --> 01:07:19,840
And yes, 13 is more than 12, so it looks like it's not as good.
1335
01:07:19,840 --> 01:07:22,120
But it required far fewer steps.
1336
01:07:22,120 --> 01:07:25,680
It only took six steps to get to this 13 versus many more steps
1337
01:07:25,680 --> 01:07:27,160
to get to this 12.
1338
01:07:27,160 --> 01:07:30,320
And while greedy best research says, oh, well, 12 is better than 13,
1339
01:07:30,320 --> 01:07:33,920
so pick the 12, we might more intelligently say,
1340
01:07:33,920 --> 01:07:37,240
I'd rather be somewhere that heuristically looks
1341
01:07:37,240 --> 01:07:42,160
like it takes slightly longer if I can get there much more quickly.
1342
01:07:42,160 --> 01:07:45,120
And we're going to encode that idea, this general idea,
1343
01:07:45,120 --> 01:07:49,200
into a more formal algorithm known as A star search.
1344
01:07:49,200 --> 01:07:51,280
A star search is going to solve this problem
1345
01:07:51,280 --> 01:07:54,120
by instead of just considering the heuristic,
1346
01:07:54,120 --> 01:07:58,800
also considering how long it took us to get to any particular state.
1347
01:07:58,800 --> 01:08:01,040
So the distinction is greedy best for search.
1348
01:08:01,040 --> 01:08:04,120
If I am in a state right now, the only thing I care about
1349
01:08:04,120 --> 01:08:07,240
is, what is the estimated distance, the heuristic value,
1350
01:08:07,240 --> 01:08:09,160
between me and the goal?
1351
01:08:09,160 --> 01:08:11,800
Whereas A star search will take into consideration
1352
01:08:11,800 --> 01:08:13,440
two pieces of information.
1353
01:08:13,440 --> 01:08:17,280
It'll take into consideration, how far do I estimate I am from the goal?
1354
01:08:17,280 --> 01:08:21,200
But also, how far did I have to travel in order to get here?
1355
01:08:21,200 --> 01:08:23,640
Because that is relevant, too.
1356
01:08:23,640 --> 01:08:26,160
So we'll search algorithms by expanding the node
1357
01:08:26,160 --> 01:08:30,200
with the lowest value of g of n plus h of n.
1358
01:08:30,200 --> 01:08:33,800
h of n is that same heuristic that we were talking about a moment ago that's
1359
01:08:33,800 --> 01:08:35,720
going to vary based on the problem.
1360
01:08:35,720 --> 01:08:40,320
But g of n is going to be the cost to reach the node, how many steps
1361
01:08:40,320 --> 01:08:45,520
I had to take, in this case, to get to my current position.
1362
01:08:45,520 --> 01:08:48,200
So what does that search algorithm look like in practice?
1363
01:08:48,200 --> 01:08:49,760
Well, let's take a look.
1364
01:08:49,760 --> 01:08:51,280
Again, we've got the same maze.
1365
01:08:51,280 --> 01:08:54,160
And again, I've labeled them with their Manhattan distance.
1366
01:08:54,160 --> 01:08:57,400
This value is the h of n value, the heuristic
1367
01:08:57,400 --> 01:09:02,400
estimate of how far each of these squares is away from the goal.
1368
01:09:02,400 --> 01:09:04,520
But now, as we begin to explore states, we
1369
01:09:04,520 --> 01:09:08,560
care not just about this heuristic value, but also about g of n,
1370
01:09:08,560 --> 01:09:11,680
the number of steps I had to take in order to get there.
1371
01:09:11,680 --> 01:09:14,280
And I care about summing those two numbers together.
1372
01:09:14,280 --> 01:09:15,400
So what does that look like?
1373
01:09:15,400 --> 01:09:19,000
On this very first step, I have taken one step.
1374
01:09:19,000 --> 01:09:22,280
And now I am estimated to be 16 steps away from the goal.
1375
01:09:22,280 --> 01:09:25,400
So the total value here is 17.
1376
01:09:25,400 --> 01:09:26,520
Then I take one more step.
1377
01:09:26,520 --> 01:09:28,160
I've now taken two steps.
1378
01:09:28,160 --> 01:09:32,800
And I estimate myself to be 15 away from the goal, again, a total value of 17.
1379
01:09:32,800 --> 01:09:34,360
Now I've taken three steps.
1380
01:09:34,360 --> 01:09:37,600
And I'm estimated to be 14 away from the goal, so on and so forth.
1381
01:09:37,600 --> 01:09:39,880
Four steps, an estimate of 13.
1382
01:09:39,880 --> 01:09:41,960
Five steps, estimate of 12.
1383
01:09:41,960 --> 01:09:44,120
And now here's a decision point.
1384
01:09:44,120 --> 01:09:48,880
I could either be six steps away from the goal with a heuristic of 13
1385
01:09:48,880 --> 01:09:52,600
for a total of 19, or I could be six steps away
1386
01:09:52,600 --> 01:09:57,840
from the goal with a heuristic of 11 with an estimate of 17 for the total.
1387
01:09:57,840 --> 01:10:03,200
So between 19 and 17, I'd rather take the 17, the 6 plus 11.
1388
01:10:03,200 --> 01:10:05,200
So so far, no different than what we saw before.
1389
01:10:05,200 --> 01:10:08,280
We're still taking this option because it appears to be better.
1390
01:10:08,280 --> 01:10:11,280
And I keep taking this option because it appears to be better.
1391
01:10:11,280 --> 01:10:15,720
But it's right about here that things get a little bit different.
1392
01:10:15,720 --> 01:10:21,760
Now I could be 15 steps away from the goal with an estimated distance of 6.
1393
01:10:21,760 --> 01:10:24,880
So 15 plus 6, total value of 21.
1394
01:10:24,880 --> 01:10:28,000
Alternatively, I could be six steps away from the goal,
1395
01:10:28,000 --> 01:10:30,800
because this is five steps away, so this is six steps away,
1396
01:10:30,800 --> 01:10:33,480
with a total value of 13 as my estimate.
1397
01:10:33,480 --> 01:10:36,320
So 6 plus 13, that's 19.
1398
01:10:36,320 --> 01:10:41,720
So here, we would evaluate g of n plus h of n to be 19, 6 plus 13.
1399
01:10:41,720 --> 01:10:46,560
Whereas here, we would be 15 plus 6, or 21.
1400
01:10:46,560 --> 01:10:49,840
And so the intuition is 19 less than 21, pick here.
1401
01:10:49,840 --> 01:10:55,360
But the idea is ultimately I'd rather be having taken fewer steps, get to a 13,
1402
01:10:55,360 --> 01:10:59,160
than having taken 15 steps and be at a 6, because it
1403
01:10:59,160 --> 01:11:01,560
means I've had to take more steps in order to get there.
1404
01:11:01,560 --> 01:11:04,640
Maybe there's a better path this way.
1405
01:11:04,640 --> 01:11:07,200
So instead, we'll explore this route.
1406
01:11:07,200 --> 01:11:11,040
Now if we go one more, this is seven steps plus 14 is 21.
1407
01:11:11,040 --> 01:11:12,960
So between those two, it's sort of a toss-up.
1408
01:11:12,960 --> 01:11:15,120
We might end up exploring that one anyways.
1409
01:11:15,120 --> 01:11:19,280
But after that, as these numbers start to get bigger in the heuristic values,
1410
01:11:19,280 --> 01:11:21,720
and these heuristic values start to get smaller,
1411
01:11:21,720 --> 01:11:25,240
you'll find that we'll actually keep exploring down this path.
1412
01:11:25,240 --> 01:11:28,400
And you can do the math to see that at every decision point,
1413
01:11:28,400 --> 01:11:31,240
A star search is going to make a choice based
1414
01:11:31,240 --> 01:11:35,200
on the sum of how many steps it took me to get to my current position,
1415
01:11:35,200 --> 01:11:39,320
and then how far I estimate I am from the goal.
1416
01:11:39,320 --> 01:11:41,920
So while we did have to explore some of these states,
1417
01:11:41,920 --> 01:11:46,640
the ultimate solution we found was, in fact, an optimal solution.
1418
01:11:46,640 --> 01:11:50,960
It did find us the quickest possible way to get from the initial state
1419
01:11:50,960 --> 01:11:51,960
to the goal.
1420
01:11:51,960 --> 01:11:55,240
And it turns out that A star is an optimal search algorithm
1421
01:11:55,240 --> 01:11:57,440
under certain conditions.
1422
01:11:57,440 --> 01:12:02,160
So the conditions are H of n, my heuristic, needs to be admissible.
1423
01:12:02,160 --> 01:12:04,120
What does it mean for a heuristic to be admissible?
1424
01:12:04,120 --> 01:12:08,840
Well, a heuristic is admissible if it never overestimates the true cost.
1425
01:12:08,840 --> 01:12:12,560
H of n always needs to either get it exactly right
1426
01:12:12,560 --> 01:12:16,680
in terms of how far away I am, or it needs to underestimate.
1427
01:12:16,680 --> 01:12:20,800
So we saw an example from before where the heuristic value was much smaller
1428
01:12:20,800 --> 01:12:22,520
than the actual cost it would take.
1429
01:12:22,520 --> 01:12:26,280
That's totally fine, but the heuristic value should never overestimate.
1430
01:12:26,280 --> 01:12:30,720
It should never think that I'm further away from the goal than I actually am.
1431
01:12:30,720 --> 01:12:34,840
And meanwhile, to make a stronger statement, H of n also needs to be
1432
01:12:34,840 --> 01:12:36,160
consistent.
1433
01:12:36,160 --> 01:12:37,960
And what does it mean for it to be consistent?
1434
01:12:37,960 --> 01:12:41,760
Mathematically, it means that for every node, which we'll call n,
1435
01:12:41,760 --> 01:12:43,840
and successor, the node after me, that I'll
1436
01:12:43,840 --> 01:12:48,920
call n prime, where it takes a cost of C to make that step,
1437
01:12:48,920 --> 01:12:52,720
the heuristic value of n needs to be less than or equal to the heuristic
1438
01:12:52,720 --> 01:12:55,240
value of n prime plus the cost.
1439
01:12:55,240 --> 01:12:58,160
So it's a lot of math, but in words what that ultimately means
1440
01:12:58,160 --> 01:13:01,040
is that if I am here at this state right now,
1441
01:13:01,040 --> 01:13:03,640
the heuristic value from me to the goal shouldn't
1442
01:13:03,640 --> 01:13:07,080
be more than the heuristic value of my successor,
1443
01:13:07,080 --> 01:13:10,200
the next place I could go to, plus however much
1444
01:13:10,200 --> 01:13:14,600
it would cost me to just make that step from one step to the next step.
1445
01:13:14,600 --> 01:13:18,600
And so this is just making sure that my heuristic is consistent between all
1446
01:13:18,600 --> 01:13:20,240
of these steps that I might take.
1447
01:13:20,240 --> 01:13:22,680
So as long as this is true, then A star search
1448
01:13:22,680 --> 01:13:25,600
is going to find me an optimal solution.
1449
01:13:25,600 --> 01:13:28,760
And this is where much of the challenge of solving these search problems
1450
01:13:28,760 --> 01:13:32,120
can sometimes come in, that A star search is an algorithm that is known
1451
01:13:32,120 --> 01:13:34,120
and you could write the code fairly easily,
1452
01:13:34,120 --> 01:13:35,800
but it's choosing the heuristic.
1453
01:13:35,800 --> 01:13:37,400
It can be the interesting challenge.
1454
01:13:37,400 --> 01:13:39,680
The better the heuristic is, the better I'll
1455
01:13:39,680 --> 01:13:43,000
be able to solve the problem in the fewer states that I'll have to explore.
1456
01:13:43,000 --> 01:13:46,320
And I need to make sure that the heuristic satisfies
1457
01:13:46,320 --> 01:13:48,680
these particular constraints.
1458
01:13:48,680 --> 01:13:52,040
So all in all, these are some of the examples of search algorithms
1459
01:13:52,040 --> 01:13:55,200
that might work, and certainly there are many more than just this.
1460
01:13:55,200 --> 01:13:58,720
A star, for example, does have a tendency to use quite a bit of memory.
1461
01:13:58,720 --> 01:14:01,560
So there are alternative approaches to A star
1462
01:14:01,560 --> 01:14:04,680
that ultimately use less memory than this version of A star
1463
01:14:04,680 --> 01:14:07,960
happens to use, and there are other search algorithms
1464
01:14:07,960 --> 01:14:11,640
that are optimized for other cases as well.
1465
01:14:11,640 --> 01:14:14,600
But now so far, we've only been looking at search algorithms
1466
01:14:14,600 --> 01:14:17,040
where there is one agent.
1467
01:14:17,040 --> 01:14:19,640
I am trying to find a solution to a problem.
1468
01:14:19,640 --> 01:14:22,080
I am trying to navigate my way through a maze.
1469
01:14:22,080 --> 01:14:24,080
I am trying to solve a 15 puzzle.
1470
01:14:24,080 --> 01:14:28,360
I am trying to find driving directions from point A to point B.
1471
01:14:28,360 --> 01:14:30,600
Sometimes in search situations, though, we'll
1472
01:14:30,600 --> 01:14:34,560
enter an adversarial situation, where I am an agent trying
1473
01:14:34,560 --> 01:14:36,120
to make intelligent decisions.
1474
01:14:36,120 --> 01:14:39,240
And there's someone else who is fighting against me, so to speak,
1475
01:14:39,240 --> 01:14:41,320
that has opposite objectives, someone where
1476
01:14:41,320 --> 01:14:45,320
I am trying to succeed, someone else that wants me to fail.
1477
01:14:45,320 --> 01:14:49,840
And this is most popular in something like a game, a game like Tic Tac Toe,
1478
01:14:49,840 --> 01:14:53,520
where we've got this 3 by 3 grid, and x and o take turns,
1479
01:14:53,520 --> 01:14:56,280
either writing an x or an o in any one of these squares.
1480
01:14:56,280 --> 01:14:59,760
And the goal is to get three x's in a row if you're the x player,
1481
01:14:59,760 --> 01:15:02,760
or three o's in a row if you're the o player.
1482
01:15:02,760 --> 01:15:05,320
And computers have gotten quite good at playing games,
1483
01:15:05,320 --> 01:15:08,520
Tic Tac Toe very easily, but even more complex games.
1484
01:15:08,520 --> 01:15:12,480
And so you might imagine, what does an intelligent decision in a game
1485
01:15:12,480 --> 01:15:13,440
look like?
1486
01:15:13,440 --> 01:15:17,280
So maybe x makes an initial move in the middle, and o plays up here.
1487
01:15:17,280 --> 01:15:20,480
What does an intelligent move for x now become?
1488
01:15:20,480 --> 01:15:22,520
Where should you move if you were x?
1489
01:15:22,520 --> 01:15:24,840
And it turns out there are a couple of possibilities.
1490
01:15:24,840 --> 01:15:27,240
But if an AI is playing this game optimally,
1491
01:15:27,240 --> 01:15:30,160
then the AI might play somewhere like the upper right,
1492
01:15:30,160 --> 01:15:34,200
where in this situation, o has the opposite objective of x.
1493
01:15:34,200 --> 01:15:37,920
x is trying to win the game to get three in a row diagonally here.
1494
01:15:37,920 --> 01:15:41,440
And o is trying to stop that objective, opposite of the objective.
1495
01:15:41,440 --> 01:15:44,000
And so o is going to place here to try to block.
1496
01:15:44,000 --> 01:15:46,400
But now, x has a pretty clever move.
1497
01:15:46,400 --> 01:15:51,000
x can make a move like this, where now x has two possible ways
1498
01:15:51,000 --> 01:15:52,200
that x can win the game.
1499
01:15:52,200 --> 01:15:55,200
x could win the game by getting three in a row across here.
1500
01:15:55,200 --> 01:15:58,520
Or x could win the game by getting three in a row vertically this way.
1501
01:15:58,520 --> 01:16:00,680
So it doesn't matter where o makes their next move.
1502
01:16:00,680 --> 01:16:04,360
o could play here, for example, blocking the three in a row horizontally.
1503
01:16:04,360 --> 01:16:09,360
But then x is going to win the game by getting a three in a row vertically.
1504
01:16:09,360 --> 01:16:11,360
And so there's a fair amount of reasoning that's
1505
01:16:11,360 --> 01:16:14,400
going on here in order for the computer to be able to solve a problem.
1506
01:16:14,400 --> 01:16:17,720
And it's similar in spirit to the problems we've looked at so far.
1507
01:16:17,720 --> 01:16:19,280
There are actions.
1508
01:16:19,280 --> 01:16:21,680
There's some sort of state of the board and some transition
1509
01:16:21,680 --> 01:16:23,360
from one action to the next.
1510
01:16:23,360 --> 01:16:25,640
But it's different in the sense that this is now
1511
01:16:25,640 --> 01:16:29,440
not just a classical search problem, but an adversarial search problem.
1512
01:16:29,440 --> 01:16:32,960
That I am at the x player trying to find the best moves to make,
1513
01:16:32,960 --> 01:16:36,680
but I know that there is some adversary that is trying to stop me.
1514
01:16:36,680 --> 01:16:41,280
So we need some sort of algorithm to deal with these adversarial type of search
1515
01:16:41,280 --> 01:16:42,560
situations.
1516
01:16:42,560 --> 01:16:44,520
And the algorithm we're going to take a look at
1517
01:16:44,520 --> 01:16:47,780
is an algorithm called Minimax, which works very well
1518
01:16:47,780 --> 01:16:51,000
for these deterministic games where there are two players.
1519
01:16:51,000 --> 01:16:52,800
It can work for other types of games as well.
1520
01:16:52,800 --> 01:16:55,440
But we'll look right now at games where I make a move,
1521
01:16:55,440 --> 01:16:56,880
then my opponent makes a move.
1522
01:16:56,880 --> 01:17:00,400
And I am trying to win, and my opponent is trying to win also.
1523
01:17:00,400 --> 01:17:04,120
Or in other words, my opponent is trying to get me to lose.
1524
01:17:04,120 --> 01:17:07,120
And so what do we need in order to make this algorithm work?
1525
01:17:07,120 --> 01:17:10,960
Well, any time we try and translate this human concept of playing a game,
1526
01:17:10,960 --> 01:17:14,100
winning and losing to a computer, we want to translate it
1527
01:17:14,100 --> 01:17:16,360
in terms that the computer can understand.
1528
01:17:16,360 --> 01:17:19,880
And ultimately, the computer really just understands the numbers.
1529
01:17:19,880 --> 01:17:23,920
And so we want some way of translating a game of x's and o's on a grid
1530
01:17:23,920 --> 01:17:26,640
to something numerical, something the computer can understand.
1531
01:17:26,640 --> 01:17:30,480
The computer doesn't normally understand notions of win or lose.
1532
01:17:30,480 --> 01:17:34,560
But it does understand the concept of bigger and smaller.
1533
01:17:34,560 --> 01:17:38,240
And so what we might do is we might take each of the possible ways
1534
01:17:38,240 --> 01:17:43,280
that a tic-tac-toe game can unfold and assign a value or a utility
1535
01:17:43,280 --> 01:17:45,240
to each one of those possible ways.
1536
01:17:45,240 --> 01:17:47,960
And in a tic-tac-toe game, and in many types of games,
1537
01:17:47,960 --> 01:17:49,960
there are three possible outcomes.
1538
01:17:49,960 --> 01:17:54,360
The outcomes are o wins, x wins, or nobody wins.
1539
01:17:54,360 --> 01:17:58,560
So player one wins, player two wins, or nobody wins.
1540
01:17:58,560 --> 01:18:02,840
And for now, let's go ahead and assign each of these possible outcomes
1541
01:18:02,840 --> 01:18:04,040
a different value.
1542
01:18:04,040 --> 01:18:07,400
We'll say o winning, that'll have a value of negative 1.
1543
01:18:07,400 --> 01:18:09,800
Nobody winning, that'll have a value of 0.
1544
01:18:09,800 --> 01:18:13,000
And x winning, that will have a value of 1.
1545
01:18:13,000 --> 01:18:17,000
So we've just assigned numbers to each of these three possible outcomes.
1546
01:18:17,000 --> 01:18:22,440
And now we have two players, we have the x player and the o player.
1547
01:18:22,440 --> 01:18:26,360
And we're going to go ahead and call the x player the max player.
1548
01:18:26,360 --> 01:18:29,160
And we'll call the o player the min player.
1549
01:18:29,160 --> 01:18:32,080
And the reason why is because in the min and max algorithm,
1550
01:18:32,080 --> 01:18:37,520
the max player, which in this case is x, is aiming to maximize the score.
1551
01:18:37,520 --> 01:18:40,880
These are the possible options for the score, negative 1, 0, and 1.
1552
01:18:40,880 --> 01:18:44,560
x wants to maximize the score, meaning if at all possible,
1553
01:18:44,560 --> 01:18:48,040
x would like this situation, where x wins the game,
1554
01:18:48,040 --> 01:18:49,760
and we give it a score of 1.
1555
01:18:49,760 --> 01:18:54,000
But if this isn't possible, if x needs to choose between these two options,
1556
01:18:54,000 --> 01:18:58,080
negative 1, meaning o winning, or 0, meaning nobody winning,
1557
01:18:58,080 --> 01:19:01,720
x would rather that nobody wins, score of 0,
1558
01:19:01,720 --> 01:19:04,400
than a score of negative 1, o winning.
1559
01:19:04,400 --> 01:19:07,240
So this notion of winning and losing and tying
1560
01:19:07,240 --> 01:19:12,240
has been reduced mathematically to just this idea of try and maximize the score.
1561
01:19:12,240 --> 01:19:16,080
The x player always wants the score to be bigger.
1562
01:19:16,080 --> 01:19:19,040
And on the flip side, the min player, in this case o,
1563
01:19:19,040 --> 01:19:20,760
is aiming to minimize the score.
1564
01:19:20,760 --> 01:19:25,640
The o player wants the score to be as small as possible.
1565
01:19:25,640 --> 01:19:29,000
So now we've taken this game of x's and o's and winning and losing
1566
01:19:29,000 --> 01:19:30,760
and turned it into something mathematical,
1567
01:19:30,760 --> 01:19:33,480
something where x is trying to maximize the score,
1568
01:19:33,480 --> 01:19:35,640
o is trying to minimize the score.
1569
01:19:35,640 --> 01:19:37,760
Let's now look at all of the parts of the game
1570
01:19:37,760 --> 01:19:40,800
that we need in order to encode it in an AI
1571
01:19:40,800 --> 01:19:44,880
so that an AI can play a game like tic-tac-toe.
1572
01:19:44,880 --> 01:19:46,920
So the game is going to need a couple of things.
1573
01:19:46,920 --> 01:19:50,680
We'll need some sort of initial state that will, in this case, call s0,
1574
01:19:50,680 --> 01:19:54,880
which is how the game begins, like an empty tic-tac-toe board, for example.
1575
01:19:54,880 --> 01:20:00,080
We'll also need a function called player, where the player function
1576
01:20:00,080 --> 01:20:04,280
is going to take as input a state here represented by s.
1577
01:20:04,280 --> 01:20:09,600
And the output of the player function is going to be which player's turn is it.
1578
01:20:09,600 --> 01:20:12,520
We need to be able to give a tic-tac-toe board to the computer,
1579
01:20:12,520 --> 01:20:16,600
run it through a function, and that function tells us whose turn it is.
1580
01:20:16,600 --> 01:20:19,040
We'll need some notion of actions that we can take.
1581
01:20:19,040 --> 01:20:21,120
We'll see examples of that in just a moment.
1582
01:20:21,120 --> 01:20:24,080
We need some notion of a transition model, same as before.
1583
01:20:24,080 --> 01:20:26,320
If I have a state and I take an action, I
1584
01:20:26,320 --> 01:20:29,200
need to know what results as a consequence of it.
1585
01:20:29,200 --> 01:20:31,960
I need some way of knowing when the game is over.
1586
01:20:31,960 --> 01:20:34,040
So this is equivalent to kind of like a goal test,
1587
01:20:34,040 --> 01:20:36,480
but I need some terminal test, some way to check
1588
01:20:36,480 --> 01:20:40,760
to see if a state is a terminal state, where a terminal state means the game is
1589
01:20:40,760 --> 01:20:41,440
over.
1590
01:20:41,440 --> 01:20:44,960
In a classic game of tic-tac-toe, a terminal state
1591
01:20:44,960 --> 01:20:47,480
means either someone has gotten three in a row
1592
01:20:47,480 --> 01:20:50,200
or all of the squares of the tic-tac-toe board are filled.
1593
01:20:50,200 --> 01:20:52,920
Either of those conditions make it a terminal state.
1594
01:20:52,920 --> 01:20:55,840
In a game of chess, it might be something like when there is checkmate
1595
01:20:55,840 --> 01:21:00,440
or if checkmate is no longer possible, that that becomes a terminal state.
1596
01:21:00,440 --> 01:21:04,560
And then finally, we'll need a utility function, a function that takes a state
1597
01:21:04,560 --> 01:21:08,040
and gives us a numerical value for that terminal state, some way of saying
1598
01:21:08,040 --> 01:21:10,680
if x wins the game, that has a value of 1.
1599
01:21:10,680 --> 01:21:13,200
If o is won the game, that has a value of negative 1.
1600
01:21:13,200 --> 01:21:16,520
If nobody has won the game, that has a value of 0.
1601
01:21:16,520 --> 01:21:18,840
So let's take a look at each of these in turn.
1602
01:21:18,840 --> 01:21:23,240
The initial state, we can just represent in tic-tac-toe as the empty game board.
1603
01:21:23,240 --> 01:21:24,480
This is where we begin.
1604
01:21:24,480 --> 01:21:27,200
It's the place from which we begin this search.
1605
01:21:27,200 --> 01:21:29,600
And again, I'll be representing these things visually,
1606
01:21:29,600 --> 01:21:32,120
but you can imagine this really just being like an array
1607
01:21:32,120 --> 01:21:36,240
or a two-dimensional array of all of these possible squares.
1608
01:21:36,240 --> 01:21:39,640
Then we need the player function that, again, takes a state
1609
01:21:39,640 --> 01:21:41,360
and tells us whose turn it is.
1610
01:21:41,360 --> 01:21:44,800
Assuming x makes the first move, if I have an empty game board,
1611
01:21:44,800 --> 01:21:47,640
then my player function is going to return x.
1612
01:21:47,640 --> 01:21:49,840
And if I have a game board where x has made a move,
1613
01:21:49,840 --> 01:21:52,520
then my player function is going to return o.
1614
01:21:52,520 --> 01:21:54,960
The player function takes a tic-tac-toe game board
1615
01:21:54,960 --> 01:21:58,320
and tells us whose turn it is.
1616
01:21:58,320 --> 01:22:01,000
Next up, we'll consider the actions function.
1617
01:22:01,000 --> 01:22:04,080
The actions function, much like it did in classical search,
1618
01:22:04,080 --> 01:22:08,000
takes a state and gives us the set of all of the possible actions
1619
01:22:08,000 --> 01:22:10,520
we can take in that state.
1620
01:22:10,520 --> 01:22:15,480
So let's imagine it's o is turned to move in a game board that looks like this.
1621
01:22:15,480 --> 01:22:18,240
What happens when we pass it into the actions function?
1622
01:22:18,240 --> 01:22:22,120
So the actions function takes this state of the game as input,
1623
01:22:22,120 --> 01:22:25,000
and the output is a set of possible actions.
1624
01:22:25,000 --> 01:22:27,560
It's a set of I could move in the upper left
1625
01:22:27,560 --> 01:22:29,720
or I could move in the bottom middle.
1626
01:22:29,720 --> 01:22:31,720
So those are the two possible action choices
1627
01:22:31,720 --> 01:22:36,320
that I have when I begin in this particular state.
1628
01:22:36,320 --> 01:22:39,240
Now, just as before, when we had states and actions,
1629
01:22:39,240 --> 01:22:41,600
we need some sort of transition model to tell us
1630
01:22:41,600 --> 01:22:45,520
when we take this action in the state, what is the new state that we get.
1631
01:22:45,520 --> 01:22:48,200
And here, we define that using the result function
1632
01:22:48,200 --> 01:22:51,600
that takes a state as input as well as an action.
1633
01:22:51,600 --> 01:22:54,640
And when we apply the result function to this state,
1634
01:22:54,640 --> 01:22:58,040
saying that let's let o move in this upper left corner,
1635
01:22:58,040 --> 01:23:01,480
the new state we get is this resulting state where o is in the upper left
1636
01:23:01,480 --> 01:23:02,040
corner.
1637
01:23:02,040 --> 01:23:04,800
And now, this seems obvious to someone who knows how to play tic-tac-toe.
1638
01:23:04,800 --> 01:23:06,840
Of course, you play in the upper left corner.
1639
01:23:06,840 --> 01:23:07,960
That's the board you get.
1640
01:23:07,960 --> 01:23:11,360
But all of this information needs to be encoded into the AI.
1641
01:23:11,360 --> 01:23:14,120
The AI doesn't know how to play tic-tac-toe until you
1642
01:23:14,120 --> 01:23:17,280
tell the AI how the rules of tic-tac-toe work.
1643
01:23:17,280 --> 01:23:19,760
And this function, defining this function here,
1644
01:23:19,760 --> 01:23:23,200
allows us to tell the AI how this game actually works
1645
01:23:23,200 --> 01:23:27,320
and how actions actually affect the outcome of the game.
1646
01:23:27,320 --> 01:23:29,720
So the AI needs to know how the game works.
1647
01:23:29,720 --> 01:23:32,240
The AI also needs to know when the game is over,
1648
01:23:32,240 --> 01:23:36,640
as by defining a function called terminal that takes as input a state s,
1649
01:23:36,640 --> 01:23:39,360
such that if we take a game that is not yet over,
1650
01:23:39,360 --> 01:23:42,280
pass it into the terminal function, the output is false.
1651
01:23:42,280 --> 01:23:43,680
The game is not over.
1652
01:23:43,680 --> 01:23:47,320
But if we take a game that is over because x has gotten three in a row
1653
01:23:47,320 --> 01:23:50,400
along that diagonal, pass that into the terminal function,
1654
01:23:50,400 --> 01:23:55,040
then the output is going to be true because the game now is, in fact, over.
1655
01:23:55,040 --> 01:23:58,160
And finally, we've told the AI how the game works
1656
01:23:58,160 --> 01:24:01,320
in terms of what moves can be made and what happens when you make those moves.
1657
01:24:01,320 --> 01:24:03,320
We've told the AI when the game is over.
1658
01:24:03,320 --> 01:24:07,400
Now we need to tell the AI what the value of each of those states is.
1659
01:24:07,400 --> 01:24:11,320
And we do that by defining this utility function that takes a state s
1660
01:24:11,320 --> 01:24:14,880
and tells us the score or the utility of that state.
1661
01:24:14,880 --> 01:24:18,880
So again, we said that if x wins the game, that utility is a value of 1,
1662
01:24:18,880 --> 01:24:23,480
whereas if o wins the game, then the utility of that is negative 1.
1663
01:24:23,480 --> 01:24:26,360
And the AI needs to know, for each of these terminal states
1664
01:24:26,360 --> 01:24:30,840
where the game is over, what is the utility of that state?
1665
01:24:30,840 --> 01:24:34,560
So if I give you a game board like this where the game is, in fact, over,
1666
01:24:34,560 --> 01:24:38,840
and I ask the AI to tell me what the value of that state is, it could do so.
1667
01:24:38,840 --> 01:24:42,000
The value of the state is 1.
1668
01:24:42,000 --> 01:24:46,400
Where things get interesting, though, is if the game is not yet over.
1669
01:24:46,400 --> 01:24:49,480
Let's imagine a game board like this, where in the middle of the game,
1670
01:24:49,480 --> 01:24:52,000
it's o's turn to make a move.
1671
01:24:52,000 --> 01:24:54,120
So how do we know it's o's turn to make a move?
1672
01:24:54,120 --> 01:24:56,480
We can calculate that using the player function.
1673
01:24:56,480 --> 01:25:00,120
We can say player of s, pass in the state, o is the answer.
1674
01:25:00,120 --> 01:25:02,280
So we know it's o's turn to move.
1675
01:25:02,280 --> 01:25:06,800
And now, what is the value of this board and what action should o take?
1676
01:25:06,800 --> 01:25:08,080
Well, that's going to depend.
1677
01:25:08,080 --> 01:25:09,880
We have to do some calculation here.
1678
01:25:09,880 --> 01:25:13,600
And this is where the minimax algorithm really comes in.
1679
01:25:13,600 --> 01:25:16,720
Recall that x is trying to maximize the score, which
1680
01:25:16,720 --> 01:25:19,840
means that o is trying to minimize the score.
1681
01:25:19,840 --> 01:25:22,800
So o would like to minimize the total value
1682
01:25:22,800 --> 01:25:25,000
that we get at the end of the game.
1683
01:25:25,000 --> 01:25:27,440
And because this game isn't over yet, we don't really
1684
01:25:27,440 --> 01:25:30,960
know just yet what the value of this game board is.
1685
01:25:30,960 --> 01:25:34,320
We have to do some calculation in order to figure that out.
1686
01:25:34,320 --> 01:25:36,560
And so how do we do that kind of calculation?
1687
01:25:36,560 --> 01:25:39,160
Well, in order to do so, we're going to consider,
1688
01:25:39,160 --> 01:25:41,800
just as we might in a classical search situation,
1689
01:25:41,800 --> 01:25:46,160
what actions could happen next and what states will that take us to.
1690
01:25:46,160 --> 01:25:50,120
And it turns out that in this position, there are only two open squares,
1691
01:25:50,120 --> 01:25:54,760
which means there are only two open places where o can make a move.
1692
01:25:54,760 --> 01:25:57,200
o could either make a move in the upper left
1693
01:25:57,200 --> 01:26:00,280
or o can make a move in the bottom middle.
1694
01:26:00,280 --> 01:26:03,080
And minimax doesn't know right out of the box which of those moves
1695
01:26:03,080 --> 01:26:04,360
is going to be better.
1696
01:26:04,360 --> 01:26:06,640
So it's going to consider both.
1697
01:26:06,640 --> 01:26:08,560
But now, we sort of run into the same situation.
1698
01:26:08,560 --> 01:26:11,280
Now, I have two more game boards, neither of which is over.
1699
01:26:11,280 --> 01:26:12,720
What happens next?
1700
01:26:12,720 --> 01:26:14,520
And now, it's in this sense that minimax is
1701
01:26:14,520 --> 01:26:16,800
what we'll call a recursive algorithm.
1702
01:26:16,800 --> 01:26:20,480
It's going to now repeat the exact same process,
1703
01:26:20,480 --> 01:26:23,760
although now considering it from the opposite perspective.
1704
01:26:23,760 --> 01:26:27,400
It's as if I am now going to put myself, if I am the o player,
1705
01:26:27,400 --> 01:26:31,680
I'm going to put myself in my opponent's shoes, my opponent as the x player,
1706
01:26:31,680 --> 01:26:36,160
and consider what would my opponent do if they were in this position?
1707
01:26:36,160 --> 01:26:40,200
What would my opponent do, the x player, if they were in that position?
1708
01:26:40,200 --> 01:26:41,600
And what would then happen?
1709
01:26:41,600 --> 01:26:44,400
Well, the other player, my opponent, the x player,
1710
01:26:44,400 --> 01:26:46,920
is trying to maximize the score, whereas I
1711
01:26:46,920 --> 01:26:49,400
am trying to minimize the score as the o player.
1712
01:26:49,400 --> 01:26:53,680
So x is trying to find the maximum possible value that they can get.
1713
01:26:53,680 --> 01:26:55,520
And so what's going to happen?
1714
01:26:55,520 --> 01:26:58,920
Well, from this board position, x only has one choice.
1715
01:26:58,920 --> 01:27:01,720
x is going to play here, and they're going to get three in a row.
1716
01:27:01,720 --> 01:27:05,680
And we know that that board, x winning, that has a value of 1.
1717
01:27:05,680 --> 01:27:09,360
If x wins the game, the value of that game board is 1.
1718
01:27:09,360 --> 01:27:14,120
And so from this position, if this state can only ever
1719
01:27:14,120 --> 01:27:16,720
lead to this state, it's the only possible option,
1720
01:27:16,720 --> 01:27:21,120
and this state has a value of 1, then the maximum possible value
1721
01:27:21,120 --> 01:27:24,560
that the x player can get from this game board is also 1.
1722
01:27:24,560 --> 01:27:27,800
From here, the only place we can get is to a game with a value of 1,
1723
01:27:27,800 --> 01:27:31,400
so this game board also has a value of 1.
1724
01:27:31,400 --> 01:27:33,680
Now we consider this one over here.
1725
01:27:33,680 --> 01:27:34,960
What's going to happen now?
1726
01:27:34,960 --> 01:27:36,480
Well, x needs to make a move.
1727
01:27:36,480 --> 01:27:39,680
The only move x can make is in the upper left, so x will go there.
1728
01:27:39,680 --> 01:27:41,400
And in this game, no one wins the game.
1729
01:27:41,400 --> 01:27:42,960
Nobody has three in a row.
1730
01:27:42,960 --> 01:27:45,760
And so the value of that game board is 0.
1731
01:27:45,760 --> 01:27:47,040
Nobody is 1.
1732
01:27:47,040 --> 01:27:50,280
And so again, by the same logic, if from this board position
1733
01:27:50,280 --> 01:27:53,920
the only place we can get to is a board where the value is 0,
1734
01:27:53,920 --> 01:27:57,440
then this state must also have a value of 0.
1735
01:27:57,440 --> 01:28:01,520
And now here comes the choice part, the idea of trying to minimize.
1736
01:28:01,520 --> 01:28:05,760
I, as the o player, now know that if I make this choice moving in the upper
1737
01:28:05,760 --> 01:28:09,320
left, that is going to result in a game with a value of 1,
1738
01:28:09,320 --> 01:28:11,400
assuming everyone plays optimally.
1739
01:28:11,400 --> 01:28:13,200
And if I instead play in the lower middle,
1740
01:28:13,200 --> 01:28:15,480
choose this fork in the road, that is going
1741
01:28:15,480 --> 01:28:17,640
to result in a game board with a value of 0.
1742
01:28:17,640 --> 01:28:18,760
I have two options.
1743
01:28:18,760 --> 01:28:22,400
I have a 1 and a 0 to choose from, and I need to pick.
1744
01:28:22,400 --> 01:28:25,200
And as the min player, I would rather choose
1745
01:28:25,200 --> 01:28:27,000
the option with the minimum value.
1746
01:28:27,000 --> 01:28:29,220
So whenever a player has multiple choices,
1747
01:28:29,220 --> 01:28:32,160
the min player will choose the option with the smallest value.
1748
01:28:32,160 --> 01:28:34,880
The max player will choose the option with the largest value.
1749
01:28:34,880 --> 01:28:37,520
Between the 1 and the 0, the 0 is smaller,
1750
01:28:37,520 --> 01:28:40,760
meaning I'd rather tie the game than lose the game.
1751
01:28:40,760 --> 01:28:44,200
And so this game board will say also has a value of 0,
1752
01:28:44,200 --> 01:28:48,400
because if I am playing optimally, I will pick this fork in the road.
1753
01:28:48,400 --> 01:28:53,000
I'll place my o here to block x's 3 in a row, x will move in the upper left,
1754
01:28:53,000 --> 01:28:56,440
and the game will be over, and no one will have won the game.
1755
01:28:56,440 --> 01:29:00,400
So this is now the logic of minimax, to consider all of the possible options
1756
01:29:00,400 --> 01:29:03,280
that I can take, all of the actions that I can take,
1757
01:29:03,280 --> 01:29:05,540
and then to put myself in my opponent's shoes.
1758
01:29:05,540 --> 01:29:08,680
I decide what move I'm going to make now by considering
1759
01:29:08,680 --> 01:29:11,000
what move my opponent will make on the next turn.
1760
01:29:11,000 --> 01:29:14,360
And to do that, I consider what move I would make on the turn after that,
1761
01:29:14,360 --> 01:29:17,240
so on and so forth, until I get all the way down
1762
01:29:17,240 --> 01:29:21,200
to the end of the game, to one of these so-called terminal states.
1763
01:29:21,200 --> 01:29:25,280
In fact, this very decision point, where I am trying to decide as the o player
1764
01:29:25,280 --> 01:29:27,360
what to make a decision about, might have just
1765
01:29:27,360 --> 01:29:31,640
been a part of the logic that the x player, my opponent, was using,
1766
01:29:31,640 --> 01:29:32,520
the move before me.
1767
01:29:32,520 --> 01:29:35,320
This might be part of some larger tree, where
1768
01:29:35,320 --> 01:29:37,720
x is trying to make a move in this situation,
1769
01:29:37,720 --> 01:29:40,300
and needs to pick between three different options in order
1770
01:29:40,300 --> 01:29:42,560
to make a decision about what to happen.
1771
01:29:42,560 --> 01:29:45,400
And the further and further away we are from the end of the game,
1772
01:29:45,400 --> 01:29:47,240
the deeper this tree has to go.
1773
01:29:47,240 --> 01:29:51,760
Because every level in this tree is going to correspond to one move,
1774
01:29:51,760 --> 01:29:55,040
one move or action that I take, one move or action
1775
01:29:55,040 --> 01:29:58,480
that my opponent takes, in order to decide what happens.
1776
01:29:58,480 --> 01:30:02,120
And in fact, it turns out that if I am the x player in this position,
1777
01:30:02,120 --> 01:30:05,480
and I recursively do the logic, and see I have a choice, three choices,
1778
01:30:05,480 --> 01:30:08,240
in fact, one of which leads to a value of 0.
1779
01:30:08,240 --> 01:30:12,040
If I play here, and if everyone plays optimally, the game will be a tie.
1780
01:30:12,040 --> 01:30:17,120
If I play here, then o is going to win, and I'll lose playing optimally.
1781
01:30:17,120 --> 01:30:21,520
Or here, where I, the x player, can win, well between a score of 0,
1782
01:30:21,520 --> 01:30:25,200
and negative 1, and 1, I'd rather pick the board with a value of 1,
1783
01:30:25,200 --> 01:30:27,320
because that's the maximum value I can get.
1784
01:30:27,320 --> 01:30:31,680
And so this board would also have a maximum value of 1.
1785
01:30:31,680 --> 01:30:35,160
And so this tree can get very, very deep, especially as the game
1786
01:30:35,160 --> 01:30:37,520
starts to have more and more moves.
1787
01:30:37,520 --> 01:30:39,600
And this logic works not just for tic-tac-toe,
1788
01:30:39,600 --> 01:30:41,880
but any of these sorts of games, where I make a move,
1789
01:30:41,880 --> 01:30:44,120
my opponent makes a move, and ultimately, we
1790
01:30:44,120 --> 01:30:46,600
have these adversarial objectives.
1791
01:30:46,600 --> 01:30:50,480
And we can simplify the diagram into a diagram that looks like this.
1792
01:30:50,480 --> 01:30:53,360
This is a more abstract version of the minimax tree,
1793
01:30:53,360 --> 01:30:56,040
where these are each states, but I'm no longer representing them
1794
01:30:56,040 --> 01:30:57,960
as exactly like tic-tac-toe boards.
1795
01:30:57,960 --> 01:31:01,840
This is just representing some generic game that might be tic-tac-toe,
1796
01:31:01,840 --> 01:31:04,280
might be some other game altogether.
1797
01:31:04,280 --> 01:31:06,720
Any of these green arrows that are pointing up,
1798
01:31:06,720 --> 01:31:08,720
that represents a maximizing state.
1799
01:31:08,720 --> 01:31:11,440
I would like the score to be as big as possible.
1800
01:31:11,440 --> 01:31:13,560
And any of these red arrows pointing down,
1801
01:31:13,560 --> 01:31:16,720
those are minimizing states, where the player is the min player,
1802
01:31:16,720 --> 01:31:20,320
and they are trying to make the score as small as possible.
1803
01:31:20,320 --> 01:31:24,320
So if you imagine in this situation, I am the maximizing player, this player
1804
01:31:24,320 --> 01:31:26,600
here, and I have three choices.
1805
01:31:26,600 --> 01:31:30,320
One choice gives me a score of 5, one choice gives me a score of 3,
1806
01:31:30,320 --> 01:31:32,360
and one choice gives me a score of 9.
1807
01:31:32,360 --> 01:31:36,200
Well, then between those three choices, my best option
1808
01:31:36,200 --> 01:31:38,920
is to choose this 9 over here, the score that
1809
01:31:38,920 --> 01:31:42,120
maximizes my options out of all the three options.
1810
01:31:42,120 --> 01:31:46,720
And so I can give this state a value of 9, because among my three options,
1811
01:31:46,720 --> 01:31:50,480
that is the best choice that I have available to me.
1812
01:31:50,480 --> 01:31:51,960
So that's my decision now.
1813
01:31:51,960 --> 01:31:55,640
You imagine it's like one move away from the end of the game.
1814
01:31:55,640 --> 01:31:57,800
But then you could also ask a reasonable question,
1815
01:31:57,800 --> 01:32:01,480
what might my opponent do two moves away from the end of the game?
1816
01:32:01,480 --> 01:32:03,160
My opponent is the minimizing player.
1817
01:32:03,160 --> 01:32:05,840
They are trying to make the score as small as possible.
1818
01:32:05,840 --> 01:32:09,840
Imagine what would have happened if they had to pick which choice to make.
1819
01:32:09,840 --> 01:32:13,520
One choice leads us to this state, where I, the maximizing player,
1820
01:32:13,520 --> 01:32:16,960
am going to opt for 9, the biggest score that I can get.
1821
01:32:16,960 --> 01:32:21,280
And 1 leads to this state, where I, the maximizing player,
1822
01:32:21,280 --> 01:32:25,040
would choose 8, which is then the largest score that I can get.
1823
01:32:25,040 --> 01:32:28,920
Now the minimizing player, forced to choose between a 9 or an 8,
1824
01:32:28,920 --> 01:32:31,200
is going to choose the smallest possible score,
1825
01:32:31,200 --> 01:32:33,240
which in this case is an 8.
1826
01:32:33,240 --> 01:32:35,480
And that is then how this process would unfold,
1827
01:32:35,480 --> 01:32:39,160
that the minimizing player in this case considers both of their options,
1828
01:32:39,160 --> 01:32:43,720
and then all of the options that would happen as a result of that.
1829
01:32:43,720 --> 01:32:47,560
So this now is a general picture of what the minimax algorithm looks like.
1830
01:32:47,560 --> 01:32:50,760
Let's now try to formalize it using a little bit of pseudocode.
1831
01:32:50,760 --> 01:32:53,880
So what exactly is happening in the minimax algorithm?
1832
01:32:53,880 --> 01:32:57,640
Well, given a state s, we need to decide what to happen.
1833
01:32:57,640 --> 01:33:00,960
The max player, if it's max's player's turn,
1834
01:33:00,960 --> 01:33:05,360
then max is going to pick an action a in actions of s.
1835
01:33:05,360 --> 01:33:08,240
Recall that actions is a function that takes a state
1836
01:33:08,240 --> 01:33:11,080
and gives me back all of the possible actions that I can take.
1837
01:33:11,080 --> 01:33:15,000
It tells me all of the moves that are possible.
1838
01:33:15,000 --> 01:33:19,560
The max player is going to specifically pick an action a in this set of actions
1839
01:33:19,560 --> 01:33:26,360
that gives me the highest value of min value of result of s and a.
1840
01:33:26,360 --> 01:33:27,480
So what does that mean?
1841
01:33:27,480 --> 01:33:30,120
Well, it means that I want to make the option that
1842
01:33:30,120 --> 01:33:34,000
gives me the highest score of all of the actions a.
1843
01:33:34,000 --> 01:33:35,760
But what score is that going to have?
1844
01:33:35,760 --> 01:33:38,920
To calculate that, I need to know what my opponent, the min player,
1845
01:33:38,920 --> 01:33:44,520
is going to do if they try to minimize the value of the state that results.
1846
01:33:44,520 --> 01:33:48,400
So we say, what state results after I take this action?
1847
01:33:48,400 --> 01:33:53,720
And what happens when the min player tries to minimize the value of that state?
1848
01:33:53,720 --> 01:33:56,440
I consider that for all of my possible options.
1849
01:33:56,440 --> 01:33:58,960
And after I've considered that for all of my possible options,
1850
01:33:58,960 --> 01:34:02,800
I pick the action a that has the highest value.
1851
01:34:02,800 --> 01:34:06,240
Likewise, the min player is going to do the same thing but backwards.
1852
01:34:06,240 --> 01:34:09,320
They're also going to consider what are all of the possible actions they
1853
01:34:09,320 --> 01:34:10,960
can take if it's their turn.
1854
01:34:10,960 --> 01:34:12,880
And they're going to pick the action a that
1855
01:34:12,880 --> 01:34:16,160
has the smallest possible value of all the options.
1856
01:34:16,160 --> 01:34:19,120
And the way they know what the smallest possible value of all the options
1857
01:34:19,120 --> 01:34:24,480
is is by considering what the max player is going to do by saying,
1858
01:34:24,480 --> 01:34:27,880
what's the result of applying this action to the current state?
1859
01:34:27,880 --> 01:34:29,800
And then what would the max player try to do?
1860
01:34:29,800 --> 01:34:34,040
What value would the max player calculate for that particular state?
1861
01:34:34,040 --> 01:34:36,400
So everyone makes their decision based on trying
1862
01:34:36,400 --> 01:34:39,920
to estimate what the other person would do.
1863
01:34:39,920 --> 01:34:43,680
And now we need to turn our attention to these two functions, max value
1864
01:34:43,680 --> 01:34:44,680
and min value.
1865
01:34:44,680 --> 01:34:47,840
How do you actually calculate the value of a state
1866
01:34:47,840 --> 01:34:50,000
if you're trying to maximize its value?
1867
01:34:50,000 --> 01:34:52,080
And how do you calculate the value of a state
1868
01:34:52,080 --> 01:34:53,840
if you're trying to minimize the value?
1869
01:34:53,840 --> 01:34:56,880
If you can do that, then we have an entire implementation
1870
01:34:56,880 --> 01:34:58,960
of this min and max algorithm.
1871
01:34:58,960 --> 01:34:59,640
So let's try it.
1872
01:34:59,640 --> 01:35:03,960
Let's try and implement this max value function that takes a state
1873
01:35:03,960 --> 01:35:06,680
and returns as output the value of that state
1874
01:35:06,680 --> 01:35:10,000
if I'm trying to maximize the value of the state.
1875
01:35:10,000 --> 01:35:13,000
Well, the first thing I can check for is to see if the game is over.
1876
01:35:13,000 --> 01:35:14,920
Because if the game is over, in other words,
1877
01:35:14,920 --> 01:35:18,120
if the state is a terminal state, then this is easy.
1878
01:35:18,120 --> 01:35:21,080
I already have this utility function that tells me
1879
01:35:21,080 --> 01:35:22,440
what the value of the board is.
1880
01:35:22,440 --> 01:35:26,280
If the game is over, I just check, did x win, did o win, is it a tie?
1881
01:35:26,280 --> 01:35:30,320
And this utility function just knows what the value of the state is.
1882
01:35:30,320 --> 01:35:32,480
What's trickier is if the game isn't over.
1883
01:35:32,480 --> 01:35:35,360
Because then I need to do this recursive reasoning about thinking,
1884
01:35:35,360 --> 01:35:39,000
what is my opponent going to do on the next move?
1885
01:35:39,000 --> 01:35:41,960
And I want to calculate the value of this state.
1886
01:35:41,960 --> 01:35:45,120
And I want the value of the state to be as high as possible.
1887
01:35:45,120 --> 01:35:48,280
And I'll keep track of that value in a variable called v.
1888
01:35:48,280 --> 01:35:50,720
And if I want the value to be as high as possible,
1889
01:35:50,720 --> 01:35:53,480
I need to give v an initial value.
1890
01:35:53,480 --> 01:35:57,640
And initially, I'll just go ahead and set it to be as low as possible.
1891
01:35:57,640 --> 01:36:00,800
Because I don't know what options are available to me yet.
1892
01:36:00,800 --> 01:36:04,760
So initially, I'll set v equal to negative infinity, which
1893
01:36:04,760 --> 01:36:06,040
seems a little bit strange.
1894
01:36:06,040 --> 01:36:08,040
But the idea here is I want the value initially
1895
01:36:08,040 --> 01:36:09,840
to be as low as possible.
1896
01:36:09,840 --> 01:36:12,360
Because as I consider my actions, I'm always
1897
01:36:12,360 --> 01:36:16,360
going to try and do better than v. And if I set v to negative infinity,
1898
01:36:16,360 --> 01:36:19,000
I know I can always do better than that.
1899
01:36:19,000 --> 01:36:21,280
So now I consider my actions.
1900
01:36:21,280 --> 01:36:22,880
And this is going to be some kind of loop
1901
01:36:22,880 --> 01:36:26,240
where for every action in actions of state,
1902
01:36:26,240 --> 01:36:29,000
recall actions as a function that takes my state
1903
01:36:29,000 --> 01:36:32,720
and gives me all the possible actions that I can use in that state.
1904
01:36:32,720 --> 01:36:37,600
So for each one of those actions, I want to compare it to v and say,
1905
01:36:37,600 --> 01:36:44,360
all right, v is going to be equal to the maximum of v and this expression.
1906
01:36:44,360 --> 01:36:46,160
So what is this expression?
1907
01:36:46,160 --> 01:36:50,800
Well, first it is get the result of taking the action in the state
1908
01:36:50,800 --> 01:36:54,320
and then get the min value of that.
1909
01:36:54,320 --> 01:36:58,240
In other words, let's say I want to find out from that state
1910
01:36:58,240 --> 01:37:00,760
what is the best that the min player can do because they're
1911
01:37:00,760 --> 01:37:02,560
going to try and minimize the score.
1912
01:37:02,560 --> 01:37:06,360
So whatever the resulting score is of the min value of that state,
1913
01:37:06,360 --> 01:37:10,040
compare it to my current best value and just pick the maximum of those two
1914
01:37:10,040 --> 01:37:12,640
because I am trying to maximize the value.
1915
01:37:12,640 --> 01:37:14,720
In short, what these three lines of code are doing
1916
01:37:14,720 --> 01:37:18,520
are going through all of my possible actions and asking the question,
1917
01:37:18,520 --> 01:37:24,040
how do I maximize the score given what my opponent is going to try to do?
1918
01:37:24,040 --> 01:37:26,800
After this entire loop, I can just return v
1919
01:37:26,800 --> 01:37:30,280
and that is now the value of that particular state.
1920
01:37:30,280 --> 01:37:32,800
And for the min player, it's the exact opposite of this,
1921
01:37:32,800 --> 01:37:35,080
the same logic just backwards.
1922
01:37:35,080 --> 01:37:37,080
To calculate the minimum value of a state,
1923
01:37:37,080 --> 01:37:38,920
first we check if it's a terminal state.
1924
01:37:38,920 --> 01:37:41,120
If it is, we return its utility.
1925
01:37:41,120 --> 01:37:45,440
Otherwise, we're going to now try to minimize the value of the state
1926
01:37:45,440 --> 01:37:47,440
given all of my possible actions.
1927
01:37:47,440 --> 01:37:50,920
So I need an initial value for v, the value of the state.
1928
01:37:50,920 --> 01:37:53,800
And initially, I'll set it to infinity because I
1929
01:37:53,800 --> 01:37:56,440
know I can always get something less than infinity.
1930
01:37:56,440 --> 01:38:00,040
So by starting with v equals infinity, I make sure that the very first action
1931
01:38:00,040 --> 01:38:03,680
I find, that will be less than this value of v.
1932
01:38:03,680 --> 01:38:07,200
And then I do the same thing, loop over all of my possible actions.
1933
01:38:07,200 --> 01:38:10,760
And for each of the results that we could get when the max player makes
1934
01:38:10,760 --> 01:38:15,280
their decision, let's take the minimum of that and the current value of v.
1935
01:38:15,280 --> 01:38:19,360
So after all is said and done, I get the smallest possible value of v
1936
01:38:19,360 --> 01:38:22,520
that I then return back to the user.
1937
01:38:22,520 --> 01:38:25,160
So that, in effect, is the pseudocode for Minimax.
1938
01:38:25,160 --> 01:38:28,120
That is how we take a gain and figure out what the best move to make
1939
01:38:28,120 --> 01:38:32,480
is by recursively using these max value and min value functions,
1940
01:38:32,480 --> 01:38:36,920
where max value calls min value, min value calls max value back and forth,
1941
01:38:36,920 --> 01:38:39,680
all the way until we reach a terminal state, at which point
1942
01:38:39,680 --> 01:38:45,080
our algorithm can simply return the utility of that particular state.
1943
01:38:45,080 --> 01:38:48,760
So what you might imagine is that this is going to start to be a long process,
1944
01:38:48,760 --> 01:38:51,160
especially as games start to get more complex,
1945
01:38:51,160 --> 01:38:54,720
as we start to add more moves and more possible options and games that
1946
01:38:54,720 --> 01:38:56,840
might last quite a bit longer.
1947
01:38:56,840 --> 01:39:00,520
So the next question to ask is, what sort of optimizations can we make here?
1948
01:39:00,520 --> 01:39:05,360
How can we do better in order to use less space or take less time
1949
01:39:05,360 --> 01:39:08,120
to be able to solve this kind of problem?
1950
01:39:08,120 --> 01:39:10,880
And we'll take a look at a couple of possible optimizations.
1951
01:39:10,880 --> 01:39:13,360
But for one, we'll take a look at this example.
1952
01:39:13,360 --> 01:39:15,880
Again, returning to these up arrows and down arrows,
1953
01:39:15,880 --> 01:39:20,200
let's imagine that I now am the max player, this green arrow.
1954
01:39:20,200 --> 01:39:23,400
I am trying to make this score as high as possible.
1955
01:39:23,400 --> 01:39:26,480
And this is an easy game where there are just two moves.
1956
01:39:26,480 --> 01:39:29,120
I make a move, one of these three options.
1957
01:39:29,120 --> 01:39:32,040
And then my opponent makes a move, one of these three options,
1958
01:39:32,040 --> 01:39:33,440
based on what move I make.
1959
01:39:33,440 --> 01:39:36,480
And as a result, we get some value.
1960
01:39:36,480 --> 01:39:39,600
Let's look at the order in which I do these calculations
1961
01:39:39,600 --> 01:39:41,760
and figure out if there are any optimizations I
1962
01:39:41,760 --> 01:39:44,600
might be able to make to this calculation process.
1963
01:39:44,600 --> 01:39:47,240
I'm going to have to look at these states one at a time.
1964
01:39:47,240 --> 01:39:49,600
So let's say I start here on the left and say, all right,
1965
01:39:49,600 --> 01:39:52,960
now I'm going to consider, what will the min player, my opponent,
1966
01:39:52,960 --> 01:39:54,400
try to do here?
1967
01:39:54,400 --> 01:39:57,960
Well, the min player is going to look at all three of their possible actions
1968
01:39:57,960 --> 01:40:00,400
and look at their value, because these are terminal states.
1969
01:40:00,400 --> 01:40:01,560
They're the end of the game.
1970
01:40:01,560 --> 01:40:04,980
And so they'll see, all right, this node is a value of four, value of eight,
1971
01:40:04,980 --> 01:40:06,560
value of five.
1972
01:40:06,560 --> 01:40:08,940
And the min player is going to say, well, all right,
1973
01:40:08,940 --> 01:40:13,160
between these three options, four, eight, and five, I'll take the smallest one.
1974
01:40:13,160 --> 01:40:14,200
I'll take the four.
1975
01:40:14,200 --> 01:40:16,760
So this state now has a value of four.
1976
01:40:16,760 --> 01:40:20,200
Then I, as the max player, say, all right, if I take this action,
1977
01:40:20,200 --> 01:40:21,320
it will have a value of four.
1978
01:40:21,320 --> 01:40:23,400
That's the best that I can do, because min player
1979
01:40:23,400 --> 01:40:25,920
is going to try and minimize my score.
1980
01:40:25,920 --> 01:40:27,400
So now what if I take this option?
1981
01:40:27,400 --> 01:40:28,760
We'll explore this next.
1982
01:40:28,760 --> 01:40:32,360
And now explore what the min player would do if I choose this action.
1983
01:40:32,360 --> 01:40:35,400
And the min player is going to say, all right, what are the three options?
1984
01:40:35,400 --> 01:40:39,800
The min player has options between nine, three, and seven.
1985
01:40:39,800 --> 01:40:42,660
And so three is the smallest among nine, three, and seven.
1986
01:40:42,660 --> 01:40:45,720
So we'll go ahead and say this state has a value of three.
1987
01:40:45,720 --> 01:40:49,520
So now I, as the max player, I have now explored two of my three options.
1988
01:40:49,520 --> 01:40:53,560
I know that one of my options will guarantee me a score of four, at least.
1989
01:40:53,560 --> 01:40:57,240
And one of my options will guarantee me a score of three.
1990
01:40:57,240 --> 01:41:00,320
And now I consider my third option and say, all right, what happens here?
1991
01:41:00,320 --> 01:41:01,240
Same exact logic.
1992
01:41:01,240 --> 01:41:04,280
The min player is going to look at these three states, two, four, and six.
1993
01:41:04,280 --> 01:41:06,400
I'll say the minimum possible option is two.
1994
01:41:06,400 --> 01:41:08,640
So the min player wants the two.
1995
01:41:08,640 --> 01:41:11,920
Now I, as the max player, have calculated all of the information
1996
01:41:11,920 --> 01:41:15,400
by looking two layers deep, by looking at all of these nodes.
1997
01:41:15,400 --> 01:41:18,920
And I can now say, between the four, the three, and the two, you know what?
1998
01:41:18,920 --> 01:41:20,840
I'd rather take the four.
1999
01:41:20,840 --> 01:41:24,360
Because if I choose this option, if my opponent plays optimally,
2000
01:41:24,360 --> 01:41:26,400
they will try and get me to the four.
2001
01:41:26,400 --> 01:41:27,760
But that's the best I can do.
2002
01:41:27,760 --> 01:41:29,960
I can't guarantee a higher score.
2003
01:41:29,960 --> 01:41:32,840
Because if I pick either of these two options, I might get a three
2004
01:41:32,840 --> 01:41:33,920
or I might get a two.
2005
01:41:33,920 --> 01:41:36,440
And it's true that down here is a nine.
2006
01:41:36,440 --> 01:41:38,760
And that's the highest score out of any of the scores.
2007
01:41:38,760 --> 01:41:40,760
So I might be tempted to say, you know what?
2008
01:41:40,760 --> 01:41:43,520
Maybe I should take this option because I might get the nine.
2009
01:41:43,520 --> 01:41:46,120
But if the min player is playing intelligently,
2010
01:41:46,120 --> 01:41:48,800
if they're making the best moves at each possible option
2011
01:41:48,800 --> 01:41:52,520
they have when they get to make a choice, I'll be left with a three.
2012
01:41:52,520 --> 01:41:54,640
Whereas I could better, playing optimally,
2013
01:41:54,640 --> 01:41:58,040
have guaranteed that I would get the four.
2014
01:41:58,040 --> 01:42:01,560
So that is, in effect, the logic that I would use as a min and max player
2015
01:42:01,560 --> 01:42:05,040
trying to maximize my score from that node there.
2016
01:42:05,040 --> 01:42:08,360
But it turns out they took quite a bit of computation for me to figure that out.
2017
01:42:08,360 --> 01:42:10,240
I had to reason through all of these nodes
2018
01:42:10,240 --> 01:42:11,840
in order to draw this conclusion.
2019
01:42:11,840 --> 01:42:14,920
And this is for a pretty simple game where I have three choices,
2020
01:42:14,920 --> 01:42:18,400
my opponent has three choices, and then the game's over.
2021
01:42:18,400 --> 01:42:21,160
So what I'd like to do is come up with some way to optimize this.
2022
01:42:21,160 --> 01:42:24,520
Maybe I don't need to do all of this calculation
2023
01:42:24,520 --> 01:42:28,080
to still reach the conclusion that, you know what, this action to the left,
2024
01:42:28,080 --> 01:42:29,960
that's the best that I could do.
2025
01:42:29,960 --> 01:42:33,840
Let's go ahead and try again and try to be a little more intelligent about how
2026
01:42:33,840 --> 01:42:36,200
I go about doing this.
2027
01:42:36,200 --> 01:42:38,600
So first, I start the exact same way.
2028
01:42:38,600 --> 01:42:40,320
I don't know what to do initially, so I just
2029
01:42:40,320 --> 01:42:45,080
have to consider one of the options and consider what the min player might do.
2030
01:42:45,080 --> 01:42:47,720
Min has three options, four, eight, and five.
2031
01:42:47,720 --> 01:42:51,640
And between those three options, min says four is the best they can do
2032
01:42:51,640 --> 01:42:54,520
because they want to try to minimize the score.
2033
01:42:54,520 --> 01:42:58,120
Now I, the max player, will consider my second option,
2034
01:42:58,120 --> 01:43:02,880
making this move here, and considering what my opponent would do in response.
2035
01:43:02,880 --> 01:43:04,600
What will the min player do?
2036
01:43:04,600 --> 01:43:07,720
Well, the min player is going to, from that state, look at their options.
2037
01:43:07,720 --> 01:43:12,040
And I would say, all right, nine is an option, three is an option.
2038
01:43:12,040 --> 01:43:14,360
And if I am doing the math from this initial state,
2039
01:43:14,360 --> 01:43:17,560
doing all this calculation, when I see a three,
2040
01:43:17,560 --> 01:43:20,400
that should immediately be a red flag for me.
2041
01:43:20,400 --> 01:43:23,040
Because when I see a three down here at this state,
2042
01:43:23,040 --> 01:43:28,000
I know that the value of this state is going to be at most three.
2043
01:43:28,000 --> 01:43:30,760
It's going to be three or something less than three,
2044
01:43:30,760 --> 01:43:34,180
even though I haven't yet looked at this last action or even further actions
2045
01:43:34,180 --> 01:43:37,000
if there were more actions that could be taken here.
2046
01:43:37,000 --> 01:43:37,920
How do I know that?
2047
01:43:37,920 --> 01:43:42,120
Well, I know that the min player is going to try to minimize my score.
2048
01:43:42,120 --> 01:43:44,640
And if they see a three, the only way this
2049
01:43:44,640 --> 01:43:47,640
could be something other than a three is if this remaining thing
2050
01:43:47,640 --> 01:43:50,840
that I haven't yet looked at is less than three, which
2051
01:43:50,840 --> 01:43:54,960
means there is no way for this value to be anything more than three
2052
01:43:54,960 --> 01:43:57,520
because the min player can already guarantee a three
2053
01:43:57,520 --> 01:44:01,080
and they are trying to minimize my score.
2054
01:44:01,080 --> 01:44:02,400
So what does that tell me?
2055
01:44:02,400 --> 01:44:04,880
Well, it tells me that if I choose this action,
2056
01:44:04,880 --> 01:44:09,400
my score is going to be three or maybe even less than three if I'm unlucky.
2057
01:44:09,400 --> 01:44:13,720
But I already know that this action will guarantee me a four.
2058
01:44:13,720 --> 01:44:17,360
And so given that I know that this action guarantees me a score of four
2059
01:44:17,360 --> 01:44:20,280
and this action means I can't do better than three,
2060
01:44:20,280 --> 01:44:22,440
if I'm trying to maximize my options, there
2061
01:44:22,440 --> 01:44:25,440
is no need for me to consider this triangle here.
2062
01:44:25,440 --> 01:44:28,120
There is no value, no number that could go here
2063
01:44:28,120 --> 01:44:30,880
that would change my mind between these two options.
2064
01:44:30,880 --> 01:44:34,600
I'm always going to opt for this path that gets me a four as opposed
2065
01:44:34,600 --> 01:44:39,880
to this path where the best I can do is a three if my opponent plays optimally.
2066
01:44:39,880 --> 01:44:43,080
And this is going to be true for all the future states that I look at too.
2067
01:44:43,080 --> 01:44:45,600
That if I look over here at what min player might do over here,
2068
01:44:45,600 --> 01:44:50,640
if I see that this state is a two, I know that this state is at most a two
2069
01:44:50,640 --> 01:44:54,640
because the only way this value could be something other than two
2070
01:44:54,640 --> 01:44:57,960
is if one of these remaining states is less than a two
2071
01:44:57,960 --> 01:45:00,640
and so the min player would opt for that instead.
2072
01:45:00,640 --> 01:45:03,600
So even without looking at these remaining states,
2073
01:45:03,600 --> 01:45:08,760
I as the maximizing player can know that choosing this path to the left
2074
01:45:08,760 --> 01:45:13,400
is going to be better than choosing either of those two paths to the right
2075
01:45:13,400 --> 01:45:16,080
because this one can't be better than three.
2076
01:45:16,080 --> 01:45:17,960
This one can't be better than two.
2077
01:45:17,960 --> 01:45:21,440
And so four in this case is the best that I can do.
2078
01:45:21,440 --> 01:45:23,360
So in order to do this cut, and I can say now
2079
01:45:23,360 --> 01:45:25,720
that this state has a value of four.
2080
01:45:25,720 --> 01:45:27,840
So in order to do this type of calculation,
2081
01:45:27,840 --> 01:45:31,120
I was doing a little bit more bookkeeping, keeping track of things,
2082
01:45:31,120 --> 01:45:34,680
keeping track all the time of what is the best that I can do,
2083
01:45:34,680 --> 01:45:37,280
what is the worst that I can do, and for each of these states
2084
01:45:37,280 --> 01:45:41,440
saying, all right, well, if I already know that I can get a four,
2085
01:45:41,440 --> 01:45:44,200
then if the best I can do at this state is a three,
2086
01:45:44,200 --> 01:45:48,440
no reason for me to consider it, I can effectively prune this leaf
2087
01:45:48,440 --> 01:45:51,160
and anything below it from the tree.
2088
01:45:51,160 --> 01:45:54,560
And it's for that reason this approach, this optimization to minimax,
2089
01:45:54,560 --> 01:45:56,640
is called alpha, beta pruning.
2090
01:45:56,640 --> 01:45:58,600
Alpha and beta stand for these two values
2091
01:45:58,600 --> 01:46:01,100
that you'll have to keep track of of the best you can do so far
2092
01:46:01,100 --> 01:46:02,720
and the worst you can do so far.
2093
01:46:02,720 --> 01:46:07,200
And pruning is the idea of if I have a big, long, deep search tree,
2094
01:46:07,200 --> 01:46:09,280
I might be able to search it more efficiently
2095
01:46:09,280 --> 01:46:11,200
if I don't need to search through everything,
2096
01:46:11,200 --> 01:46:15,240
if I can remove some of the nodes to try and optimize the way that I
2097
01:46:15,240 --> 01:46:18,320
look through this entire search space.
2098
01:46:18,320 --> 01:46:21,640
So alpha, beta pruning can definitely save us a lot of time
2099
01:46:21,640 --> 01:46:25,600
as we go about the search process by making our searches more efficient.
2100
01:46:25,600 --> 01:46:29,880
But even then, it's still not great as games get more complex.
2101
01:46:29,880 --> 01:46:33,120
Tic-tac-toe, fortunately, is a relatively simple game.
2102
01:46:33,120 --> 01:46:35,880
And we might reasonably ask a question like,
2103
01:46:35,880 --> 01:46:39,560
how many total possible tic-tac-toe games are there?
2104
01:46:39,560 --> 01:46:40,640
You can think about it.
2105
01:46:40,640 --> 01:46:43,760
You can try and estimate how many moves are there at any given point,
2106
01:46:43,760 --> 01:46:45,640
how many moves long can the game last.
2107
01:46:45,640 --> 01:46:52,280
It turns out there are about 255,000 possible tic-tac-toe games
2108
01:46:52,280 --> 01:46:53,920
that can be played.
2109
01:46:53,920 --> 01:46:56,360
But compare that to a more complex game, something
2110
01:46:56,360 --> 01:46:58,200
like a game of chess, for example.
2111
01:46:58,200 --> 01:47:01,960
Far more pieces, far more moves, games that last much longer.
2112
01:47:01,960 --> 01:47:05,040
How many total possible chess games could there be?
2113
01:47:05,040 --> 01:47:08,760
It turns out that after just four moves each, four moves by the white player,
2114
01:47:08,760 --> 01:47:10,600
four moves by the black player, that there are
2115
01:47:10,600 --> 01:47:15,960
288 billion possible chess games that can result from that situation,
2116
01:47:15,960 --> 01:47:17,440
after just four moves each.
2117
01:47:17,440 --> 01:47:20,080
And going even further, if you look at entire chess games
2118
01:47:20,080 --> 01:47:23,520
and how many possible chess games there could be as a result there,
2119
01:47:23,520 --> 01:47:27,560
there are more than 10 to the 29,000 possible chess games,
2120
01:47:27,560 --> 01:47:30,560
far more chess games than could ever be considered.
2121
01:47:30,560 --> 01:47:33,400
And this is a pretty big problem for the Minimax algorithm,
2122
01:47:33,400 --> 01:47:36,440
because the Minimax algorithm starts with an initial state,
2123
01:47:36,440 --> 01:47:39,520
considers all the possible actions, and all the possible actions
2124
01:47:39,520 --> 01:47:44,660
after that, all the way until we get to the end of the game.
2125
01:47:44,660 --> 01:47:46,920
And that's going to be a problem if the computer is going
2126
01:47:46,920 --> 01:47:51,000
to need to look through this many states, which is far more than any computer
2127
01:47:51,000 --> 01:47:54,920
could ever do in any reasonable amount of time.
2128
01:47:54,920 --> 01:47:57,040
So what do we do in order to solve this problem?
2129
01:47:57,040 --> 01:47:59,120
Instead of looking through all these states which
2130
01:47:59,120 --> 01:48:02,560
is totally intractable for a computer, we need some better approach.
2131
01:48:02,560 --> 01:48:05,760
And it turns out that better approach generally takes the form of something
2132
01:48:05,760 --> 01:48:09,240
called depth-limited Minimax, where normally Minimax
2133
01:48:09,240 --> 01:48:10,760
is depth-unlimited.
2134
01:48:10,760 --> 01:48:13,360
We just keep going layer after layer, move after move,
2135
01:48:13,360 --> 01:48:15,080
until we get to the end of the game.
2136
01:48:15,080 --> 01:48:17,800
Depth-limited Minimax is instead going to say,
2137
01:48:17,800 --> 01:48:21,040
you know what, after a certain number of moves, maybe I'll look 10 moves ahead,
2138
01:48:21,040 --> 01:48:23,540
maybe I'll look 12 moves ahead, but after that point,
2139
01:48:23,540 --> 01:48:26,680
I'm going to stop and not consider additional moves that
2140
01:48:26,680 --> 01:48:30,400
might come after that, just because it would be computationally intractable
2141
01:48:30,400 --> 01:48:34,080
to consider all of those possible options.
2142
01:48:34,080 --> 01:48:36,880
But what do we do after we get 10 or 12 moves deep
2143
01:48:36,880 --> 01:48:40,120
when we arrive at a situation where the game's not over?
2144
01:48:40,120 --> 01:48:43,640
Minimax still needs a way to assign a score to that game board or game
2145
01:48:43,640 --> 01:48:47,280
state to figure out what its current value is, which is easy to do
2146
01:48:47,280 --> 01:48:51,720
if the game is over, but not so easy to do if the game is not yet over.
2147
01:48:51,720 --> 01:48:54,120
So in order to do that, we need to add one additional feature
2148
01:48:54,120 --> 01:48:57,760
to depth-limited Minimax called an evaluation function, which
2149
01:48:57,760 --> 01:49:01,920
is just some function that is going to estimate the expected utility
2150
01:49:01,920 --> 01:49:04,200
of a game from a given state.
2151
01:49:04,200 --> 01:49:07,160
So in a game like chess, if you imagine that a game value of 1
2152
01:49:07,160 --> 01:49:12,120
means white wins, negative 1 means black wins, 0 means it's a draw,
2153
01:49:12,120 --> 01:49:15,440
then you might imagine that a score of 0.8
2154
01:49:15,440 --> 01:49:19,160
means white is very likely to win, though certainly not guaranteed.
2155
01:49:19,160 --> 01:49:21,440
And you would have an evaluation function
2156
01:49:21,440 --> 01:49:25,640
that estimates how good the game state happens to be.
2157
01:49:25,640 --> 01:49:28,880
And depending on how good that evaluation function is,
2158
01:49:28,880 --> 01:49:32,240
that is ultimately what's going to constrain how good the AI is.
2159
01:49:32,240 --> 01:49:36,120
The better the AI is at estimating how good or how bad
2160
01:49:36,120 --> 01:49:38,600
any particular game state is, the better the AI
2161
01:49:38,600 --> 01:49:40,840
is going to be able to play that game.
2162
01:49:40,840 --> 01:49:44,160
If the evaluation function is worse and not as good as it estimating
2163
01:49:44,160 --> 01:49:47,840
what the expected utility is, then it's going to be a whole lot harder.
2164
01:49:47,840 --> 01:49:51,280
And you can imagine trying to come up with these evaluation functions.
2165
01:49:51,280 --> 01:49:54,040
In chess, for example, you might write an evaluation function
2166
01:49:54,040 --> 01:49:56,360
based on how many pieces you have as compared
2167
01:49:56,360 --> 01:49:59,640
to how many pieces your opponent has, because each one has a value.
2168
01:49:59,640 --> 01:50:02,240
And your evaluation function probably needs
2169
01:50:02,240 --> 01:50:04,280
to be a little bit more complicated than that
2170
01:50:04,280 --> 01:50:08,160
to consider other possible situations that might arise as well.
2171
01:50:08,160 --> 01:50:11,640
And there are many other variants on Minimax that add additional features
2172
01:50:11,640 --> 01:50:15,840
in order to help it perform better under these larger, more computationally
2173
01:50:15,840 --> 01:50:18,400
untractable situations where we couldn't possibly
2174
01:50:18,400 --> 01:50:20,760
explore all of the possible moves.
2175
01:50:20,760 --> 01:50:25,240
So we need to figure out how to use evaluation functions and other techniques
2176
01:50:25,240 --> 01:50:28,520
to be able to play these games ultimately better.
2177
01:50:28,520 --> 01:50:31,600
But this now was a look at this kind of adversarial search, these search
2178
01:50:31,600 --> 01:50:35,000
problems where we have situations where I am trying
2179
01:50:35,000 --> 01:50:37,480
to play against some sort of opponent.
2180
01:50:37,480 --> 01:50:40,000
And these search problems show up all over the place
2181
01:50:40,000 --> 01:50:41,720
throughout artificial intelligence.
2182
01:50:41,720 --> 01:50:44,840
We've been talking a lot today about more classical search problems,
2183
01:50:44,840 --> 01:50:48,120
like trying to find directions from one location to another.
2184
01:50:48,120 --> 01:50:51,480
But any time an AI is faced with trying to make a decision,
2185
01:50:51,480 --> 01:50:54,600
like what do I do now in order to do something that is rational,
2186
01:50:54,600 --> 01:50:57,360
or do something that is intelligent, or trying to play a game,
2187
01:50:57,360 --> 01:51:00,000
like figuring out what move to make, these sort of algorithms
2188
01:51:00,000 --> 01:51:01,760
can really come in handy.
2189
01:51:01,760 --> 01:51:04,560
It turns out that for tic-tac-toe, the solution is pretty simple
2190
01:51:04,560 --> 01:51:05,760
because it's a small game.
2191
01:51:05,760 --> 01:51:08,600
XKCD has famously put together a web comic
2192
01:51:08,600 --> 01:51:12,120
where he will tell you exactly what move to make as the optimal move to make
2193
01:51:12,120 --> 01:51:14,440
no matter what your opponent happens to do.
2194
01:51:14,440 --> 01:51:17,680
This type of thing is not quite as possible for a much larger game
2195
01:51:17,680 --> 01:51:21,520
like Checkers or Chess, for example, where chess is totally computationally
2196
01:51:21,520 --> 01:51:25,480
untractable for most computers to be able to explore all the possible states.
2197
01:51:25,480 --> 01:51:29,800
So we really need our AI to be far more intelligent about how
2198
01:51:29,800 --> 01:51:31,880
they go about trying to deal with these problems
2199
01:51:31,880 --> 01:51:35,560
and how they go about taking this environment that they find themselves in
2200
01:51:35,560 --> 01:51:38,880
and ultimately searching for one of these solutions.
2201
01:51:38,880 --> 01:51:41,840
So this, then, was a look at search in artificial intelligence.
2202
01:51:41,840 --> 01:51:43,760
Next time, we'll take a look at knowledge,
2203
01:51:43,760 --> 01:51:47,360
thinking about how it is that our AIs are able to know information, reason
2204
01:51:47,360 --> 01:51:51,320
about that information, and draw conclusions, all in our look at AI
2205
01:51:51,320 --> 01:51:52,880
and the principles behind it.
2206
01:51:52,880 --> 01:51:55,840
We'll see you next time.
2207
01:51:55,840 --> 01:51:58,800
["AIMS INTRO MUSIC"]
2208
01:52:13,800 --> 01:52:16,000
All right, welcome back, everyone, to an introduction
2209
01:52:16,000 --> 01:52:18,160
to artificial intelligence with Python.
2210
01:52:18,160 --> 01:52:20,840
Last time, we took a look at search problems, in particular,
2211
01:52:20,840 --> 01:52:24,280
where we have AI agents that are trying to solve some sort of problem
2212
01:52:24,280 --> 01:52:26,680
by taking actions in some sort of environment,
2213
01:52:26,680 --> 01:52:30,720
whether that environment is trying to take actions by playing moves in a game
2214
01:52:30,720 --> 01:52:32,760
or whether those actions are something like trying
2215
01:52:32,760 --> 01:52:35,680
to figure out where to make turns in order to get driving directions
2216
01:52:35,680 --> 01:52:38,400
from point A to point B. This time, we're
2217
01:52:38,400 --> 01:52:42,160
going to turn our attention more generally to just this idea of knowledge,
2218
01:52:42,160 --> 01:52:44,920
the idea that a lot of intelligence is based on knowledge,
2219
01:52:44,920 --> 01:52:47,200
especially if we think about human intelligence.
2220
01:52:47,200 --> 01:52:48,840
People know information.
2221
01:52:48,840 --> 01:52:50,600
We know facts about the world.
2222
01:52:50,600 --> 01:52:52,720
And using that information that we know, we're
2223
01:52:52,720 --> 01:52:55,520
able to draw conclusions, reason about the information
2224
01:52:55,520 --> 01:52:58,440
that we know in order to figure out how to do something
2225
01:52:58,440 --> 01:53:00,680
or figure out some other piece of information
2226
01:53:00,680 --> 01:53:05,080
that we conclude based on the information we already have available to us.
2227
01:53:05,080 --> 01:53:07,360
What we'd like to focus on now is the ability
2228
01:53:07,360 --> 01:53:11,360
to take this idea of knowledge and being able to reason based on knowledge
2229
01:53:11,360 --> 01:53:14,280
and apply those ideas to artificial intelligence.
2230
01:53:14,280 --> 01:53:16,200
In particular, we're going to be building what
2231
01:53:16,200 --> 01:53:19,200
are known as knowledge-based agents, agents that
2232
01:53:19,200 --> 01:53:23,040
are able to reason and act by representing knowledge internally.
2233
01:53:23,040 --> 01:53:25,960
Somehow inside of our AI, they have some understanding
2234
01:53:25,960 --> 01:53:27,960
of what it means to know something.
2235
01:53:27,960 --> 01:53:30,600
And ideally, they have some algorithms or some techniques
2236
01:53:30,600 --> 01:53:34,440
they can use based on that knowledge that they know in order to figure out
2237
01:53:34,440 --> 01:53:38,560
the solution to a problem or figure out some additional piece of information
2238
01:53:38,560 --> 01:53:40,800
that can be helpful in some sense.
2239
01:53:40,800 --> 01:53:43,120
So what do we mean by reasoning based on knowledge
2240
01:53:43,120 --> 01:53:44,680
to be able to draw conclusions?
2241
01:53:44,680 --> 01:53:47,960
Well, let's look at a simple example drawn from the world of Harry Potter.
2242
01:53:47,960 --> 01:53:50,600
We take one sentence that we know to be true.
2243
01:53:50,600 --> 01:53:55,080
Imagine if it didn't rain, then Harry visited Hagrid today.
2244
01:53:55,080 --> 01:53:57,840
So one fact that we might know about the world.
2245
01:53:57,840 --> 01:53:59,160
And then we take another fact.
2246
01:53:59,160 --> 01:54:02,960
Harry visited Hagrid or Dumbledore today, but not both.
2247
01:54:02,960 --> 01:54:05,560
So it tells us something about the world, that Harry either visited
2248
01:54:05,560 --> 01:54:09,600
Hagrid but not Dumbledore, or Harry visited Dumbledore but not Hagrid.
2249
01:54:09,600 --> 01:54:12,120
And now we have a third piece of information about the world
2250
01:54:12,120 --> 01:54:14,720
that Harry visited Dumbledore today.
2251
01:54:14,720 --> 01:54:17,920
So we now have three pieces of information now, three facts.
2252
01:54:17,920 --> 01:54:21,600
Inside of a knowledge base, so to speak, information that we know.
2253
01:54:21,600 --> 01:54:23,760
And now we, as humans, can try and reason about this
2254
01:54:23,760 --> 01:54:27,640
and figure out, based on this information, what additional information
2255
01:54:27,640 --> 01:54:29,280
can we begin to conclude?
2256
01:54:29,280 --> 01:54:31,440
And well, looking at these last two statements,
2257
01:54:31,440 --> 01:54:35,240
Harry either visited Hagrid or Dumbledore but not both,
2258
01:54:35,240 --> 01:54:38,140
and we know that Harry visited Dumbledore today, well,
2259
01:54:38,140 --> 01:54:40,680
then it's pretty reasonable that we could draw the conclusion that,
2260
01:54:40,680 --> 01:54:43,800
you know what, Harry must not have visited Hagrid today.
2261
01:54:43,800 --> 01:54:46,520
Because based on a combination of these two statements,
2262
01:54:46,520 --> 01:54:50,560
we can draw this inference, so to speak, a conclusion that Harry did not
2263
01:54:50,560 --> 01:54:52,120
visit Hagrid today.
2264
01:54:52,120 --> 01:54:54,560
But it turns out we can even do a little bit better than that,
2265
01:54:54,560 --> 01:54:57,720
get some more information by taking a look at this first statement
2266
01:54:57,720 --> 01:54:59,180
and reasoning about that.
2267
01:54:59,180 --> 01:55:01,920
This first statement says, if it didn't rain,
2268
01:55:01,920 --> 01:55:04,200
then Harry visited Hagrid today.
2269
01:55:04,200 --> 01:55:05,080
So what does that mean?
2270
01:55:05,080 --> 01:55:09,080
In all cases where it didn't rain, then we know that Harry visited Hagrid.
2271
01:55:09,080 --> 01:55:12,680
But if we also know now that Harry did not visit Hagrid,
2272
01:55:12,680 --> 01:55:15,540
then that tells us something about our initial premise
2273
01:55:15,540 --> 01:55:16,760
that we were thinking about.
2274
01:55:16,760 --> 01:55:21,200
In particular, it tells us that it did rain today, because we can reason,
2275
01:55:21,200 --> 01:55:24,240
if it didn't rain, that Harry would have visited Hagrid.
2276
01:55:24,240 --> 01:55:28,840
But we know for a fact that Harry did not visit Hagrid today.
2277
01:55:28,840 --> 01:55:31,760
So it's this kind of reason, this sort of logical reasoning,
2278
01:55:31,760 --> 01:55:33,960
where we use logic based on the information
2279
01:55:33,960 --> 01:55:38,040
that we know in order to take information and reach conclusions that
2280
01:55:38,040 --> 01:55:40,880
is going to be the focus of what we're going to be talking about today.
2281
01:55:40,880 --> 01:55:43,600
How can we make our artificial intelligence
2282
01:55:43,600 --> 01:55:47,220
logical so that they can perform the same kinds of deduction,
2283
01:55:47,220 --> 01:55:50,640
the same kinds of reasoning that we've been doing so far?
2284
01:55:50,640 --> 01:55:53,200
Of course, humans reason about logic generally
2285
01:55:53,200 --> 01:55:54,760
in terms of human language.
2286
01:55:54,760 --> 01:55:58,640
That I just now was speaking in English, talking in English about these
2287
01:55:58,640 --> 01:56:01,080
sentences and trying to reason through how it
2288
01:56:01,080 --> 01:56:02,600
is that they relate to one another.
2289
01:56:02,600 --> 01:56:05,000
We're going to need to be a little bit more formal when
2290
01:56:05,000 --> 01:56:07,440
we turn our attention to computers and being
2291
01:56:07,440 --> 01:56:11,200
able to encode this notion of logic and truthhood and falsehood
2292
01:56:11,200 --> 01:56:12,640
inside of a machine.
2293
01:56:12,640 --> 01:56:16,040
So we're going to need to introduce a few more terms and a few symbols that
2294
01:56:16,040 --> 01:56:18,440
will help us reason through this idea of logic
2295
01:56:18,440 --> 01:56:20,480
inside of an artificial intelligence.
2296
01:56:20,480 --> 01:56:22,840
And we'll begin with the idea of a sentence.
2297
01:56:22,840 --> 01:56:24,880
Now, a sentence in a natural language like English
2298
01:56:24,880 --> 01:56:28,040
is just something that I'm saying, like what I'm saying right now.
2299
01:56:28,040 --> 01:56:32,920
In the context of AI, though, a sentence is just an assertion about the world
2300
01:56:32,920 --> 01:56:36,740
in what we're going to call a knowledge representation language,
2301
01:56:36,740 --> 01:56:40,940
some way of representing knowledge inside of our computers.
2302
01:56:40,940 --> 01:56:44,680
And the way that we're going to spend most of today reasoning about knowledge
2303
01:56:44,680 --> 01:56:47,600
is through a type of logic known as propositional logic.
2304
01:56:47,600 --> 01:56:50,800
There are a number of different types of logic, some of which we'll touch on.
2305
01:56:50,800 --> 01:56:54,680
But propositional logic is based on a logic of propositions,
2306
01:56:54,680 --> 01:56:56,640
or just statements about the world.
2307
01:56:56,640 --> 01:57:01,040
And so we begin in propositional logic with a notion of propositional symbols.
2308
01:57:01,040 --> 01:57:04,080
We will have certain symbols that are oftentimes just letters,
2309
01:57:04,080 --> 01:57:07,760
something like P or Q or R, where each of those symbols
2310
01:57:07,760 --> 01:57:11,840
is going to represent some fact or sentence about the world.
2311
01:57:11,840 --> 01:57:15,800
So P, for example, might represent the fact that it is raining.
2312
01:57:15,800 --> 01:57:19,200
And so P is going to be a symbol that represents that idea.
2313
01:57:19,200 --> 01:57:22,960
And Q, for example, might represent Harry visited Hagrid today.
2314
01:57:22,960 --> 01:57:26,600
Each of these propositional symbols represents some sentence
2315
01:57:26,600 --> 01:57:29,320
or some fact about the world.
2316
01:57:29,320 --> 01:57:32,400
But in addition to just having individual facts about the world,
2317
01:57:32,400 --> 01:57:36,040
we want some way to connect these propositional symbols together
2318
01:57:36,040 --> 01:57:39,520
in order to reason more complexly about other facts that
2319
01:57:39,520 --> 01:57:42,200
might exist inside of the world in which we're reasoning.
2320
01:57:42,200 --> 01:57:45,240
So in order to do that, we'll need to introduce some additional symbols
2321
01:57:45,240 --> 01:57:47,600
that are known as logical connectives.
2322
01:57:47,600 --> 01:57:49,840
Now, there are a number of these logical connectives.
2323
01:57:49,840 --> 01:57:52,920
But five of the most important, and the ones we're going to focus on today,
2324
01:57:52,920 --> 01:57:56,520
are these five up here, each represented by a logical symbol.
2325
01:57:56,520 --> 01:58:00,520
Not is represented by this symbol here, and is represented
2326
01:58:00,520 --> 01:58:04,600
as sort of an upside down V, or is represented by a V shape.
2327
01:58:04,600 --> 01:58:07,600
Implication, and we'll talk about what that means in just a moment,
2328
01:58:07,600 --> 01:58:09,320
is represented by an arrow.
2329
01:58:09,320 --> 01:58:12,520
And biconditional, again, we'll talk about what that means in a moment,
2330
01:58:12,520 --> 01:58:14,560
is represented by these double arrows.
2331
01:58:14,560 --> 01:58:17,200
But these five logical connectives are the main ones
2332
01:58:17,200 --> 01:58:20,280
we're going to be focusing on in terms of thinking about how
2333
01:58:20,280 --> 01:58:22,920
it is that a computer can reason about facts
2334
01:58:22,920 --> 01:58:26,560
and draw conclusions based on the facts that it knows.
2335
01:58:26,560 --> 01:58:28,200
But in order to get there, we need to take
2336
01:58:28,200 --> 01:58:30,360
a look at each of these logical connectives
2337
01:58:30,360 --> 01:58:34,040
and build up an understanding for what it is that they actually mean.
2338
01:58:34,040 --> 01:58:38,200
So let's go ahead and begin with the not symbol, so this not symbol here.
2339
01:58:38,200 --> 01:58:41,160
And what we're going to show for each of these logical connectives
2340
01:58:41,160 --> 01:58:43,880
is what we're going to call a truth table, a table that
2341
01:58:43,880 --> 01:58:47,640
demonstrates what this word not means when we attach it
2342
01:58:47,640 --> 01:58:52,560
to a propositional symbol or any sentence inside of our logical language.
2343
01:58:52,560 --> 01:58:56,880
And so the truth table for not is shown right here.
2344
01:58:56,880 --> 01:59:01,560
If P, some propositional symbol, or some other sentence even, is false,
2345
01:59:01,560 --> 01:59:04,600
then not P is true.
2346
01:59:04,600 --> 01:59:08,960
And if P is true, then not P is false.
2347
01:59:08,960 --> 01:59:11,200
So you can imagine that placing this not symbol
2348
01:59:11,200 --> 01:59:14,080
in front of some sentence of propositional logic
2349
01:59:14,080 --> 01:59:16,200
just says the opposite of that.
2350
01:59:16,200 --> 01:59:19,840
So if, for example, P represented it is raining,
2351
01:59:19,840 --> 01:59:23,880
then not P would represent the idea that it is not raining.
2352
01:59:23,880 --> 01:59:27,560
And as you might expect, if P is false, meaning if the sentence,
2353
01:59:27,560 --> 01:59:32,920
it is raining, is false, well then the sentence not P must be true.
2354
01:59:32,920 --> 01:59:36,240
The sentence that it is not raining is therefore true.
2355
01:59:36,240 --> 01:59:40,000
So not, you can imagine, just takes whatever is in P and it inverts it.
2356
01:59:40,000 --> 01:59:43,440
It turns false into true and true into false,
2357
01:59:43,440 --> 01:59:46,520
much analogously to what the English word not means,
2358
01:59:46,520 --> 01:59:51,200
just taking whatever comes after it and inverting it to mean the opposite.
2359
01:59:51,200 --> 01:59:53,760
Next up, and also very English-like, is this idea
2360
01:59:53,760 --> 01:59:58,160
of and represented by this upside-down V shape or this point shape.
2361
01:59:58,160 --> 02:00:01,440
And as opposed to just taking a single argument the way not does,
2362
02:00:01,440 --> 02:00:07,040
we have P and we have not P. And is going to combine two different sentences
2363
02:00:07,040 --> 02:00:09,120
in propositional logic together.
2364
02:00:09,120 --> 02:00:12,480
So I might have one sentence P and another sentence Q,
2365
02:00:12,480 --> 02:00:16,800
and I want to combine them together to say P and Q.
2366
02:00:16,800 --> 02:00:19,760
And the general logic for what P and Q means
2367
02:00:19,760 --> 02:00:22,520
is it means that both of its operands are true.
2368
02:00:22,520 --> 02:00:26,600
P is true and also Q is true.
2369
02:00:26,600 --> 02:00:29,160
And so here's what that truth table looks like.
2370
02:00:29,160 --> 02:00:33,800
This time we have two variables, P and Q. And when we have two variables, each
2371
02:00:33,800 --> 02:00:36,920
of which can be in two possible states, true or false,
2372
02:00:36,920 --> 02:00:41,320
that leads to two squared or four possible combinations
2373
02:00:41,320 --> 02:00:42,640
of truth and falsehood.
2374
02:00:42,640 --> 02:00:45,000
So we have P is false and Q is false.
2375
02:00:45,000 --> 02:00:47,040
We have P is false and Q is true.
2376
02:00:47,040 --> 02:00:48,680
P is true and Q is false.
2377
02:00:48,680 --> 02:00:51,080
And then P and Q both are true.
2378
02:00:51,080 --> 02:00:55,520
And those are the only four possibilities for what P and Q could mean.
2379
02:00:55,520 --> 02:00:59,400
And in each of those situations, this third column here, P and Q,
2380
02:00:59,400 --> 02:01:03,760
is telling us a little bit about what it actually means for P and Q to be true.
2381
02:01:03,760 --> 02:01:08,040
And we see that the only case where P and Q is true is in this fourth row
2382
02:01:08,040 --> 02:01:12,840
here, where P happens to be true, Q also happens to be true.
2383
02:01:12,840 --> 02:01:18,080
And in all other situations, P and Q is going to evaluate to false.
2384
02:01:18,080 --> 02:01:21,600
So this, again, is much in line with what our intuition of and might mean.
2385
02:01:21,600 --> 02:01:29,320
If I say P and Q, I probably mean that I expect both P and Q to be true.
2386
02:01:29,320 --> 02:01:32,320
Next up, also potentially consistent with what we mean,
2387
02:01:32,320 --> 02:01:37,720
is this word or, represented by this V shape, sort of an upside down and symbol.
2388
02:01:37,720 --> 02:01:41,560
And or, as the name might suggest, is true if either of its arguments
2389
02:01:41,560 --> 02:01:47,440
are true, as long as P is true or Q is true, then P or Q is going to be true.
2390
02:01:47,440 --> 02:01:50,960
Which means the only time that P or Q is false
2391
02:01:50,960 --> 02:01:53,440
is if both of its operands are false.
2392
02:01:53,440 --> 02:01:58,760
If P is false and Q is false, then P or Q is going to be false.
2393
02:01:58,760 --> 02:02:03,160
But in all other cases, at least one of the operands is true.
2394
02:02:03,160 --> 02:02:08,600
Maybe they're both true, in which case P or Q is going to evaluate to true.
2395
02:02:08,600 --> 02:02:10,880
Now, this is mostly consistent with the way
2396
02:02:10,880 --> 02:02:14,200
that most people might use the word or, in the sense of speaking the word
2397
02:02:14,200 --> 02:02:17,080
or in normal English, though there is sometimes when we might say
2398
02:02:17,080 --> 02:02:21,440
or, where we mean P or Q, but not both, where we mean, sort of,
2399
02:02:21,440 --> 02:02:23,480
it can only be one or the other.
2400
02:02:23,480 --> 02:02:26,560
It's important to note that this symbol here, this or,
2401
02:02:26,560 --> 02:02:30,360
means P or Q or both, that those are totally OK.
2402
02:02:30,360 --> 02:02:33,120
As long as either or both of them are true,
2403
02:02:33,120 --> 02:02:36,320
then the or is going to evaluate to be true, as well.
2404
02:02:36,320 --> 02:02:38,760
It's only in the case where all of the operands
2405
02:02:38,760 --> 02:02:43,320
are false that P or Q ultimately evaluates to false, as well.
2406
02:02:43,320 --> 02:02:46,760
In logic, there's another symbol known as the exclusive or,
2407
02:02:46,760 --> 02:02:51,160
which encodes this idea of exclusivity of one or the other, but not both.
2408
02:02:51,160 --> 02:02:53,080
But we're not going to be focusing on that today.
2409
02:02:53,080 --> 02:02:56,720
Whenever we talk about or, we're always talking about either or both,
2410
02:02:56,720 --> 02:03:01,520
in this case, as represented by this truth table here.
2411
02:03:01,520 --> 02:03:04,840
So that now is not an and an or.
2412
02:03:04,840 --> 02:03:07,280
And next up is what we might call implication,
2413
02:03:07,280 --> 02:03:09,280
as denoted by this arrow symbol.
2414
02:03:09,280 --> 02:03:13,200
So we have P and Q. And this sentence here will generally
2415
02:03:13,200 --> 02:03:16,240
read as P implies Q.
2416
02:03:16,240 --> 02:03:23,400
And what P implies Q means is that if P is true, then Q is also true.
2417
02:03:23,400 --> 02:03:27,760
So I might say something like, if it is raining, then I will be indoors.
2418
02:03:27,760 --> 02:03:31,840
Meaning, it is raining implies I will be indoors,
2419
02:03:31,840 --> 02:03:34,640
as the logical sentence that I'm saying there.
2420
02:03:34,640 --> 02:03:37,760
And the truth table for this can sometimes be a little bit tricky.
2421
02:03:37,760 --> 02:03:44,280
So obviously, if P is true and Q is true, then P implies Q. That's true.
2422
02:03:44,280 --> 02:03:46,120
That definitely makes sense.
2423
02:03:46,120 --> 02:03:50,640
And it should also stand to reason that when P is true and Q is false,
2424
02:03:50,640 --> 02:03:52,600
then P implies Q is false.
2425
02:03:52,600 --> 02:03:57,400
Because if I said to you, if it is raining, then I will be out indoors.
2426
02:03:57,400 --> 02:04:01,000
And it is raining, but I'm not indoors?
2427
02:04:01,000 --> 02:04:04,680
Well, then it would seem to be that my original statement was not true.
2428
02:04:04,680 --> 02:04:09,360
P implies Q means that if P is true, then Q also needs to be true.
2429
02:04:09,360 --> 02:04:13,200
And if it's not, well, then the statement is false.
2430
02:04:13,200 --> 02:04:17,560
What's also worth noting, though, is what happens when P is false.
2431
02:04:17,560 --> 02:04:22,280
When P is false, the implication makes no claim at all.
2432
02:04:22,280 --> 02:04:26,680
If I say something like, if it is raining, then I will be indoors.
2433
02:04:26,680 --> 02:04:28,640
And it turns out it's not raining.
2434
02:04:28,640 --> 02:04:31,040
Then in that case, I am not making any statement
2435
02:04:31,040 --> 02:04:33,880
as to whether or not I will be indoors or not.
2436
02:04:33,880 --> 02:04:37,720
P implies Q just means that if P is true, Q must be true.
2437
02:04:37,720 --> 02:04:42,040
But if P is not true, then we make no claim about whether or not Q
2438
02:04:42,040 --> 02:04:43,040
is true at all.
2439
02:04:43,040 --> 02:04:46,840
So in either case, if P is false, it doesn't matter what Q is.
2440
02:04:46,840 --> 02:04:50,560
Whether it's false or true, we're not making any claim about Q whatsoever.
2441
02:04:50,560 --> 02:04:53,640
We can still evaluate the implication to true.
2442
02:04:53,640 --> 02:04:56,600
The only way that the implication is ever false
2443
02:04:56,600 --> 02:05:01,680
is if our premise, P, is true, but the conclusion that we're drawing Q
2444
02:05:01,680 --> 02:05:03,040
happens to be false.
2445
02:05:03,040 --> 02:05:09,400
So in that case, we would say P does not imply Q in that case.
2446
02:05:09,400 --> 02:05:13,200
Finally, the last connective that we'll discuss is this bi-conditional.
2447
02:05:13,200 --> 02:05:15,400
You can think of a bi-conditional as a condition
2448
02:05:15,400 --> 02:05:17,480
that goes in both directions.
2449
02:05:17,480 --> 02:05:20,440
So originally, when I said something like, if it is raining,
2450
02:05:20,440 --> 02:05:22,520
then I will be indoors.
2451
02:05:22,520 --> 02:05:24,920
I didn't say what would happen if it wasn't raining.
2452
02:05:24,920 --> 02:05:27,360
Maybe I'll be indoors, maybe I'll be outdoors.
2453
02:05:27,360 --> 02:05:31,440
This bi-conditional, you can read as an if and only if.
2454
02:05:31,440 --> 02:05:36,960
So I can say, I will be indoors if and only if it is raining,
2455
02:05:36,960 --> 02:05:39,560
meaning if it is raining, then I will be indoors.
2456
02:05:39,560 --> 02:05:43,560
And if I am indoors, it's reasonable to conclude that it is also raining.
2457
02:05:43,560 --> 02:05:48,640
So this bi-conditional is only true when P and Q are the same.
2458
02:05:48,640 --> 02:05:53,440
So if P is true and Q is true, then this bi-conditional is also true.
2459
02:05:53,440 --> 02:05:56,000
P implies Q, but also the reverse is true.
2460
02:05:56,000 --> 02:06:01,160
Q also implies P. So if P and Q both happen to be false,
2461
02:06:01,160 --> 02:06:02,440
we would still say it's true.
2462
02:06:02,440 --> 02:06:04,640
But in any of these other two situations,
2463
02:06:04,640 --> 02:06:08,920
this P if and only if Q is going to ultimately evaluate to false.
2464
02:06:08,920 --> 02:06:11,200
So a lot of trues and falses going on there,
2465
02:06:11,200 --> 02:06:13,840
but these five basic logical connectives
2466
02:06:13,840 --> 02:06:16,960
are going to form the core of the language of propositional logic,
2467
02:06:16,960 --> 02:06:20,040
the language that we're going to use in order to describe ideas,
2468
02:06:20,040 --> 02:06:21,960
and the language that we're going to use in order
2469
02:06:21,960 --> 02:06:26,520
to reason about those ideas in order to draw conclusions.
2470
02:06:26,520 --> 02:06:29,000
So let's now take a look at some of the additional terms
2471
02:06:29,000 --> 02:06:31,280
that we'll need to know about in order to go about trying
2472
02:06:31,280 --> 02:06:33,740
to form this language of propositional logic
2473
02:06:33,740 --> 02:06:37,600
and writing AI that's actually able to understand this sort of logic.
2474
02:06:37,600 --> 02:06:40,200
The next thing we're going to need is the notion of what
2475
02:06:40,200 --> 02:06:42,480
is actually true about the world.
2476
02:06:42,480 --> 02:06:46,880
We have a whole bunch of propositional symbols, P and Q and R and maybe others,
2477
02:06:46,880 --> 02:06:50,120
but we need some way of knowing what actually is true in the world.
2478
02:06:50,120 --> 02:06:51,200
Is P true or false?
2479
02:06:51,200 --> 02:06:52,580
Is Q true or false?
2480
02:06:52,580 --> 02:06:54,360
So on and so forth.
2481
02:06:54,360 --> 02:06:57,440
And to do that, we'll introduce the notion of a model.
2482
02:06:57,440 --> 02:07:02,320
A model just assigns a truth value, where a truth value is either true
2483
02:07:02,320 --> 02:07:05,680
or false, to every propositional symbol.
2484
02:07:05,680 --> 02:07:09,400
In other words, it's creating what we might call a possible world.
2485
02:07:09,400 --> 02:07:10,840
So let me give an example.
2486
02:07:10,840 --> 02:07:15,320
If, for example, I have two propositional symbols, P is it is raining
2487
02:07:15,320 --> 02:07:21,000
and Q is it is a Tuesday, a model just takes each of these two symbols
2488
02:07:21,000 --> 02:07:24,720
and assigns a truth value to them, either true or false.
2489
02:07:24,720 --> 02:07:26,040
So here's a sample model.
2490
02:07:26,040 --> 02:07:29,400
In this model, in other words, in this possible world,
2491
02:07:29,400 --> 02:07:33,920
it is possible that P is true, meaning it is raining, and Q is false,
2492
02:07:33,920 --> 02:07:36,000
meaning it is not a Tuesday.
2493
02:07:36,000 --> 02:07:39,240
But there are other possible worlds or other models as well.
2494
02:07:39,240 --> 02:07:41,920
There is some model where both of these variables are true,
2495
02:07:41,920 --> 02:07:44,320
some model where both of these variables are false.
2496
02:07:44,320 --> 02:07:48,320
In fact, if there are n variables that are propositional symbols like this
2497
02:07:48,320 --> 02:07:51,720
that are either true or false, then the number of possible models
2498
02:07:51,720 --> 02:07:55,600
is 2 to the n, because each of these possible models,
2499
02:07:55,600 --> 02:08:00,080
possible variables within my model, could be set to either true or false
2500
02:08:00,080 --> 02:08:03,840
if I don't know any information about it.
2501
02:08:03,840 --> 02:08:07,040
So now that I have the symbols and the connectives
2502
02:08:07,040 --> 02:08:11,080
that I'm going to need in order to construct these parts of knowledge,
2503
02:08:11,080 --> 02:08:13,400
we need some way to represent that knowledge.
2504
02:08:13,400 --> 02:08:15,880
And to do so, we're going to allow our AI access
2505
02:08:15,880 --> 02:08:18,400
to what we'll call a knowledge base.
2506
02:08:18,400 --> 02:08:21,880
And a knowledge base is really just a set of sentences
2507
02:08:21,880 --> 02:08:24,200
that our AI knows to be true.
2508
02:08:24,200 --> 02:08:27,160
Some set of sentences in propositional logic
2509
02:08:27,160 --> 02:08:30,960
that are things that our AI knows about the world.
2510
02:08:30,960 --> 02:08:35,360
And so we might tell our AI some information, information about a situation
2511
02:08:35,360 --> 02:08:38,200
that it finds itself in, or a situation about a problem
2512
02:08:38,200 --> 02:08:39,880
that it happens to be trying to solve.
2513
02:08:39,880 --> 02:08:41,720
And we would give that information to the AI
2514
02:08:41,720 --> 02:08:44,920
that the AI would store inside of its knowledge base.
2515
02:08:44,920 --> 02:08:47,440
And what happens next is the AI would like
2516
02:08:47,440 --> 02:08:49,880
to use that information in the knowledge base
2517
02:08:49,880 --> 02:08:53,440
to be able to draw conclusions about the rest of the world.
2518
02:08:53,440 --> 02:08:55,200
And what do those conclusions look like?
2519
02:08:55,200 --> 02:08:56,960
Well, to understand those conclusions, we'll
2520
02:08:56,960 --> 02:08:59,840
need to introduce one more idea, one more symbol.
2521
02:08:59,840 --> 02:09:02,600
And that is the notion of entailment.
2522
02:09:02,600 --> 02:09:06,500
So this sentence here, with this double turnstile in these Greek letters,
2523
02:09:06,500 --> 02:09:08,960
this is the Greek letter alpha and the Greek letter beta.
2524
02:09:08,960 --> 02:09:12,920
And we read this as alpha entails beta.
2525
02:09:12,920 --> 02:09:17,320
And alpha and beta here are just sentences in propositional logic.
2526
02:09:17,320 --> 02:09:20,680
And what this means is that alpha entails beta
2527
02:09:20,680 --> 02:09:23,360
means that in every model, in other words,
2528
02:09:23,360 --> 02:09:28,960
in every possible world in which sentence alpha is true,
2529
02:09:28,960 --> 02:09:31,600
then sentence beta is also true.
2530
02:09:31,600 --> 02:09:35,520
So if something entails something else, if alpha entails beta,
2531
02:09:35,520 --> 02:09:40,520
it means that if I know alpha to be true, then beta must therefore also
2532
02:09:40,520 --> 02:09:41,320
be true.
2533
02:09:41,320 --> 02:09:47,840
So if my alpha is something like I know that it is a Tuesday in January,
2534
02:09:47,840 --> 02:09:52,600
then a reasonable beta might be something like I know that it is January.
2535
02:09:52,600 --> 02:09:55,520
Because in all worlds where it is a Tuesday in January,
2536
02:09:55,520 --> 02:09:59,200
I know for sure that it must be January, just by definition.
2537
02:09:59,200 --> 02:10:01,940
This first statement or sentence about the world
2538
02:10:01,940 --> 02:10:03,840
entails the second statement.
2539
02:10:03,840 --> 02:10:07,440
And we can reasonably use deduction based on that first sentence
2540
02:10:07,440 --> 02:10:12,340
to figure out that the second sentence is, in fact, true as well.
2541
02:10:12,340 --> 02:10:14,840
And ultimately, it's this idea of entailment
2542
02:10:14,840 --> 02:10:17,240
that we're going to try and encode into our computer.
2543
02:10:17,240 --> 02:10:20,200
We want our AI agent to be able to figure out
2544
02:10:20,200 --> 02:10:22,040
what the possible entailments are.
2545
02:10:22,040 --> 02:10:26,080
We want our AI to be able to take these three sentences, sentences like,
2546
02:10:26,080 --> 02:10:28,480
if it didn't rain, Harry visited Hagrid.
2547
02:10:28,480 --> 02:10:31,440
That Harry visited Hagrid or Dumbledore, but not both.
2548
02:10:31,440 --> 02:10:33,160
And that Harry visited Dumbledore.
2549
02:10:33,160 --> 02:10:36,040
And just using that information, we'd like our AI
2550
02:10:36,040 --> 02:10:41,040
to be able to infer or figure out that using these three sentences inside
2551
02:10:41,040 --> 02:10:44,080
of a knowledge base, we can draw some conclusions.
2552
02:10:44,080 --> 02:10:47,520
In particular, we can draw the conclusions here that, one,
2553
02:10:47,520 --> 02:10:49,520
Harry did not visit Hagrid today.
2554
02:10:49,520 --> 02:10:53,840
And we can draw the entailment, too, that it did, in fact, rain today.
2555
02:10:53,840 --> 02:10:56,120
And this process is known as inference.
2556
02:10:56,120 --> 02:10:58,320
And that's what we're going to be focusing on today,
2557
02:10:58,320 --> 02:11:01,920
this process of deriving new sentences from old ones,
2558
02:11:01,920 --> 02:11:04,240
that I give you these three sentences, you put them
2559
02:11:04,240 --> 02:11:06,480
in the knowledge base in, say, the AI.
2560
02:11:06,480 --> 02:11:09,680
And the AI is able to use some sort of inference algorithm
2561
02:11:09,680 --> 02:11:14,040
to figure out that these two sentences must also be true.
2562
02:11:14,040 --> 02:11:16,600
And that is how we define inference.
2563
02:11:16,600 --> 02:11:18,760
So let's take a look at an inference example
2564
02:11:18,760 --> 02:11:22,520
to see how we might actually go about inferring things in a human sense
2565
02:11:22,520 --> 02:11:24,360
before we take a more algorithmic approach
2566
02:11:24,360 --> 02:11:27,360
to see how we could encode this idea of inference in AI.
2567
02:11:27,360 --> 02:11:30,920
And we'll see there are a number of ways that we can actually achieve this.
2568
02:11:30,920 --> 02:11:33,920
So again, we'll deal with a couple of propositional symbols.
2569
02:11:33,920 --> 02:11:37,640
We'll deal with P, Q, and R. P is it is a Tuesday.
2570
02:11:37,640 --> 02:11:39,200
Q is it is raining.
2571
02:11:39,200 --> 02:11:42,720
And R is Harry will go for a run, three propositional symbols
2572
02:11:42,720 --> 02:11:44,600
that we are just defining to mean this.
2573
02:11:44,600 --> 02:11:47,400
We're not saying anything yet about whether they're true or false.
2574
02:11:47,400 --> 02:11:50,240
We're just defining what they are.
2575
02:11:50,240 --> 02:11:53,960
Now, we'll give ourselves or an AI access to a knowledge base,
2576
02:11:53,960 --> 02:11:57,600
abbreviated to KB, the knowledge that we know about the world.
2577
02:11:57,600 --> 02:11:59,440
We know this statement.
2578
02:11:59,440 --> 02:11:59,920
All right.
2579
02:11:59,920 --> 02:12:00,880
So let's try to parse it.
2580
02:12:00,880 --> 02:12:02,840
The parentheses here are just used for precedent,
2581
02:12:02,840 --> 02:12:05,280
so we can see what associates with what.
2582
02:12:05,280 --> 02:12:11,720
But you would read this as P and not Q implies R.
2583
02:12:11,720 --> 02:12:12,240
All right.
2584
02:12:12,240 --> 02:12:13,040
So what does that mean?
2585
02:12:13,040 --> 02:12:14,520
Let's put it piece by piece.
2586
02:12:14,520 --> 02:12:16,880
P is it is a Tuesday.
2587
02:12:16,880 --> 02:12:21,600
Q is it is raining, so not Q is it is not raining,
2588
02:12:21,600 --> 02:12:25,080
and implies R is Harry will go for a run.
2589
02:12:25,080 --> 02:12:28,080
So the way to read this entire sentence in human natural language
2590
02:12:28,080 --> 02:12:33,240
at least is if it is a Tuesday and it is not raining,
2591
02:12:33,240 --> 02:12:35,600
then Harry will go for a run.
2592
02:12:35,600 --> 02:12:37,800
So if it is a Tuesday and it is not raining,
2593
02:12:37,800 --> 02:12:39,520
then Harry will go for a run.
2594
02:12:39,520 --> 02:12:41,600
And that is now inside of our knowledge base.
2595
02:12:41,600 --> 02:12:43,600
And let's now imagine that our knowledge base has
2596
02:12:43,600 --> 02:12:45,520
two other pieces of information as well.
2597
02:12:45,520 --> 02:12:49,720
It has information that P is true, that it is a Tuesday.
2598
02:12:49,720 --> 02:12:53,880
And we also have the information not Q, that it is not raining,
2599
02:12:53,880 --> 02:12:57,120
that this sentence Q, it is raining, happens to be false.
2600
02:12:57,120 --> 02:12:59,800
And those are the three sentences that we have access to.
2601
02:12:59,800 --> 02:13:05,520
P and not Q implies R, P and not Q. Using that information,
2602
02:13:05,520 --> 02:13:08,000
we should be able to draw some inferences.
2603
02:13:08,000 --> 02:13:14,120
P and not Q is only true if both P and not Q are true.
2604
02:13:14,120 --> 02:13:18,120
All right, we know that P is true and we know that not Q is true.
2605
02:13:18,120 --> 02:13:20,600
So we know that this whole expression is true.
2606
02:13:20,600 --> 02:13:24,000
And the definition of implication is if this whole thing on the left
2607
02:13:24,000 --> 02:13:27,080
is true, then this thing on the right must also be true.
2608
02:13:27,080 --> 02:13:31,480
So if we know that P and not Q is true, then R must be true as well.
2609
02:13:31,480 --> 02:13:34,160
So the inference we should be able to draw from all of this
2610
02:13:34,160 --> 02:13:38,200
is that R is true and we know that Harry will go for a run
2611
02:13:38,200 --> 02:13:40,560
by taking this knowledge inside of our knowledge base
2612
02:13:40,560 --> 02:13:43,760
and being able to reason based on that idea.
2613
02:13:43,760 --> 02:13:46,480
And so this ultimately is the beginning of what
2614
02:13:46,480 --> 02:13:49,480
we might consider to be some sort of inference algorithm,
2615
02:13:49,480 --> 02:13:52,840
some process that we can use to try and figure out
2616
02:13:52,840 --> 02:13:55,040
whether or not we can draw some conclusion.
2617
02:13:55,040 --> 02:13:58,040
And ultimately, what these inference algorithms are going to answer
2618
02:13:58,040 --> 02:14:00,880
is the central question about entailment.
2619
02:14:00,880 --> 02:14:02,720
Given some query about the world, something
2620
02:14:02,720 --> 02:14:06,120
we're wondering about the world, and we'll call that query alpha,
2621
02:14:06,120 --> 02:14:09,120
the question we want to ask using these inference algorithms
2622
02:14:09,120 --> 02:14:14,680
is does KB, our knowledge base, entail alpha?
2623
02:14:14,680 --> 02:14:16,640
In other words, using only the information
2624
02:14:16,640 --> 02:14:20,200
we know inside of our knowledge base, the knowledge that we have access to,
2625
02:14:20,200 --> 02:14:24,200
can we conclude that this sentence alpha is true?
2626
02:14:24,200 --> 02:14:26,840
And that's ultimately what we would like to do.
2627
02:14:26,840 --> 02:14:28,040
So how can we do that?
2628
02:14:28,040 --> 02:14:30,200
How can we go about writing an algorithm that
2629
02:14:30,200 --> 02:14:33,920
can look at this knowledge base and figure out whether or not this query
2630
02:14:33,920 --> 02:14:35,720
alpha is actually true?
2631
02:14:35,720 --> 02:14:39,000
Well, it turns out there are a couple of different algorithms for doing so.
2632
02:14:39,000 --> 02:14:43,120
And one of the simplest, perhaps, is known as model checking.
2633
02:14:43,120 --> 02:14:45,640
Now, remember that a model is just some assignment
2634
02:14:45,640 --> 02:14:49,120
of all of the propositional symbols inside of our language to a truth
2635
02:14:49,120 --> 02:14:51,080
value, true or false.
2636
02:14:51,080 --> 02:14:53,560
And you can think of a model as a possible world,
2637
02:14:53,560 --> 02:14:55,980
that there are many possible worlds where different things might
2638
02:14:55,980 --> 02:14:59,080
be true or false, and we can enumerate all of them.
2639
02:14:59,080 --> 02:15:02,480
And the model checking algorithm does exactly that.
2640
02:15:02,480 --> 02:15:04,600
So what does our model checking algorithm do?
2641
02:15:04,600 --> 02:15:08,280
Well, if we wanted to determine if our knowledge base entails
2642
02:15:08,280 --> 02:15:13,000
some query alpha, then we are going to enumerate all possible models.
2643
02:15:13,000 --> 02:15:16,760
In other words, consider all possible values of true and false
2644
02:15:16,760 --> 02:15:21,240
for our variables, all possible states in which our world can be in.
2645
02:15:21,240 --> 02:15:25,760
And if in every model where our knowledge base is true,
2646
02:15:25,760 --> 02:15:30,480
alpha is also true, then we know that the knowledge base entails alpha.
2647
02:15:30,480 --> 02:15:32,320
So let's take a closer look at that sentence
2648
02:15:32,320 --> 02:15:34,120
and try and figure out what it actually means.
2649
02:15:34,120 --> 02:15:38,120
If we know that in every model, in other words, in every possible world,
2650
02:15:38,120 --> 02:15:41,520
no matter what assignment of true and false to variables you give,
2651
02:15:41,520 --> 02:15:44,440
if we know that whenever our knowledge is true, what
2652
02:15:44,440 --> 02:15:49,400
we know to be true is true, that this query alpha is also true,
2653
02:15:49,400 --> 02:15:52,960
well, then it stands to reason that as long as our knowledge base is true,
2654
02:15:52,960 --> 02:15:56,080
then alpha must also be true.
2655
02:15:56,080 --> 02:15:58,600
And so this is going to form the foundation of our model checking
2656
02:15:58,600 --> 02:15:59,280
algorithm.
2657
02:15:59,280 --> 02:16:01,720
We're going to enumerate all of the possible worlds
2658
02:16:01,720 --> 02:16:05,720
and ask ourselves whenever the knowledge base is true, is alpha true?
2659
02:16:05,720 --> 02:16:09,320
And if that's the case, then we know alpha to be true.
2660
02:16:09,320 --> 02:16:11,520
And otherwise, there is no entailment.
2661
02:16:11,520 --> 02:16:14,960
Our knowledge base does not entail alpha.
2662
02:16:14,960 --> 02:16:15,440
All right.
2663
02:16:15,440 --> 02:16:17,300
So this is a little bit abstract, but let's
2664
02:16:17,300 --> 02:16:20,960
take a look at an example to try and put real propositional symbols
2665
02:16:20,960 --> 02:16:22,160
to this idea.
2666
02:16:22,160 --> 02:16:24,560
So again, we'll work with the same example.
2667
02:16:24,560 --> 02:16:29,200
P is it is a Tuesday, Q is it is raining, R as Harry will go for a run.
2668
02:16:29,200 --> 02:16:32,000
Our knowledge base contains these pieces of information.
2669
02:16:32,000 --> 02:16:35,280
P and not Q implies R. We also know P.
2670
02:16:35,280 --> 02:16:38,840
It is a Tuesday and not Q. It is not raining.
2671
02:16:38,840 --> 02:16:43,520
And our query, our alpha in this case, the thing we want to ask is R.
2672
02:16:43,520 --> 02:16:45,520
We want to know, is it guaranteed?
2673
02:16:45,520 --> 02:16:49,120
Is it entailed that Harry will go for a run?
2674
02:16:49,120 --> 02:16:52,480
So the first step is to enumerate all of the possible models.
2675
02:16:52,480 --> 02:16:55,800
We have three propositional symbols here, P, Q, and R,
2676
02:16:55,800 --> 02:16:59,800
which means we have 2 to the third power, or eight possible models.
2677
02:16:59,800 --> 02:17:04,680
All false, false, false true, false true, false, false true, true, et cetera.
2678
02:17:04,680 --> 02:17:09,560
Eight possible ways you could assign true and false to all of these models.
2679
02:17:09,560 --> 02:17:13,920
And we might ask in each one of them, is the knowledge base true?
2680
02:17:13,920 --> 02:17:15,840
Here are the set of things that we know.
2681
02:17:15,840 --> 02:17:20,160
In which of these worlds could this knowledge base possibly apply to?
2682
02:17:20,160 --> 02:17:22,960
In which world is this knowledge base true?
2683
02:17:22,960 --> 02:17:26,240
Well, in the knowledge base, for example, we know P.
2684
02:17:26,240 --> 02:17:31,680
We know it is a Tuesday, which means we know that these four first four rows
2685
02:17:31,680 --> 02:17:35,080
where P is false, none of those are going to be true
2686
02:17:35,080 --> 02:17:37,680
or are going to work for this particular knowledge base.
2687
02:17:37,680 --> 02:17:40,840
Our knowledge base is not true in those worlds.
2688
02:17:40,840 --> 02:17:46,200
Likewise, we also know not Q. We know that it is not raining.
2689
02:17:46,200 --> 02:17:51,120
So any of these models where Q is true, like these two and these two here,
2690
02:17:51,120 --> 02:17:55,360
those aren't going to work either because we know that Q is not true.
2691
02:17:55,360 --> 02:18:00,240
And finally, we also know that P and not Q implies R,
2692
02:18:00,240 --> 02:18:04,680
which means that when P is true or P is true here and Q is false,
2693
02:18:04,680 --> 02:18:08,800
Q is false in these two, then R must be true.
2694
02:18:08,800 --> 02:18:14,520
And if ever P is true, Q is false, but R is also false,
2695
02:18:14,520 --> 02:18:17,760
well, that doesn't satisfy this implication here.
2696
02:18:17,760 --> 02:18:21,840
That implication does not hold true under those situations.
2697
02:18:21,840 --> 02:18:24,080
So we could say that for our knowledge base,
2698
02:18:24,080 --> 02:18:27,240
we can conclude under which of these possible worlds
2699
02:18:27,240 --> 02:18:30,440
is our knowledge base true and under which of the possible worlds
2700
02:18:30,440 --> 02:18:31,880
is our knowledge base false.
2701
02:18:31,880 --> 02:18:35,040
And it turns out there is only one possible world
2702
02:18:35,040 --> 02:18:37,160
where our knowledge base is actually true.
2703
02:18:37,160 --> 02:18:39,280
In some cases, there might be multiple possible worlds
2704
02:18:39,280 --> 02:18:40,640
where the knowledge base is true.
2705
02:18:40,640 --> 02:18:44,880
But in this case, it just so happens that there's only one, one possible world
2706
02:18:44,880 --> 02:18:48,400
where we can definitively say something about our knowledge base.
2707
02:18:48,400 --> 02:18:50,920
And in this case, we would look at the query.
2708
02:18:50,920 --> 02:18:56,120
The query of R is R true, R is true, and so as a result,
2709
02:18:56,120 --> 02:18:58,840
we can draw that conclusion.
2710
02:18:58,840 --> 02:19:01,120
And so this is this idea of model check-in.
2711
02:19:01,120 --> 02:19:04,600
Enumerate all the possible models and look in those possible models
2712
02:19:04,600 --> 02:19:08,000
to see whether or not, if our knowledge base is true,
2713
02:19:08,000 --> 02:19:11,600
is the query in question true as well.
2714
02:19:11,600 --> 02:19:14,800
So let's now take a look at how we might actually go about writing this
2715
02:19:14,800 --> 02:19:16,360
in a programming language like Python.
2716
02:19:16,360 --> 02:19:18,280
Take a look at some actual code that would
2717
02:19:18,280 --> 02:19:21,400
encode this notion of propositional symbols and logic
2718
02:19:21,400 --> 02:19:25,560
and these connectives like and and or and not and implication and so forth
2719
02:19:25,560 --> 02:19:28,160
and see what that code might actually look like.
2720
02:19:28,160 --> 02:19:30,960
So I've written in advance a logic library that's
2721
02:19:30,960 --> 02:19:33,480
more detailed than we need to worry about entirely today.
2722
02:19:33,480 --> 02:19:37,480
But the important thing is that we have one class for every type
2723
02:19:37,480 --> 02:19:40,600
of logical symbol or connective that we might have.
2724
02:19:40,600 --> 02:19:44,040
So we just have one class for logical symbols, for example,
2725
02:19:44,040 --> 02:19:46,720
where every symbol is going to represent and store
2726
02:19:46,720 --> 02:19:49,360
some name for that particular symbol.
2727
02:19:49,360 --> 02:19:52,920
And we also have a class for not that takes an operand.
2728
02:19:52,920 --> 02:19:56,320
So we might say not one symbol to say something is not true
2729
02:19:56,320 --> 02:19:58,200
or some other sentence is not true.
2730
02:19:58,200 --> 02:20:02,000
We have one for and, one for or, so on and so forth.
2731
02:20:02,000 --> 02:20:03,760
And I'll just demonstrate how this works.
2732
02:20:03,760 --> 02:20:07,480
And you can take a look at the actual logic.py later on.
2733
02:20:07,480 --> 02:20:11,200
But I'll go ahead and call this file harry.py.
2734
02:20:11,200 --> 02:20:15,080
We're going to store information about this world of Harry Potter,
2735
02:20:15,080 --> 02:20:16,320
for example.
2736
02:20:16,320 --> 02:20:19,140
So I'll go ahead and import from my logic module.
2737
02:20:19,140 --> 02:20:20,640
I'll import everything.
2738
02:20:20,640 --> 02:20:25,520
And in this library, in order to create a symbol, you use capital S symbol.
2739
02:20:25,520 --> 02:20:30,720
And I'll create a symbol for rain, to mean it is raining, for example.
2740
02:20:30,720 --> 02:20:35,240
And I'll create a symbol for Hagrid, to mean Harry visited Hagrid,
2741
02:20:35,240 --> 02:20:36,920
is what this symbol is going to mean.
2742
02:20:36,920 --> 02:20:38,960
So this symbol means it is raining.
2743
02:20:38,960 --> 02:20:41,960
This symbol means Harry visited Hagrid.
2744
02:20:41,960 --> 02:20:49,760
And I'll add another symbol called Dumbledore for Harry visited Dumbledore.
2745
02:20:49,760 --> 02:20:52,400
Now, I'd like to save these symbols so that I can use them later
2746
02:20:52,400 --> 02:20:54,000
as I do some logical analysis.
2747
02:20:54,000 --> 02:20:56,760
So I'll go ahead and save each one of them inside of a variable.
2748
02:20:56,760 --> 02:21:02,560
So like rain, Hagrid, and Dumbledore, so you could call the variables anything.
2749
02:21:02,560 --> 02:21:04,360
And now that I have these logical symbols,
2750
02:21:04,360 --> 02:21:07,880
I can use logical connectives to combine them together.
2751
02:21:07,880 --> 02:21:14,680
So for example, if I have a sentence like and rain and Hagrid,
2752
02:21:14,680 --> 02:21:18,560
for example, which is not necessarily true, but just for demonstration,
2753
02:21:18,560 --> 02:21:22,160
I can now try and print out sentence.formula, which
2754
02:21:22,160 --> 02:21:25,440
is a function I wrote that takes a sentence in propositional logic
2755
02:21:25,440 --> 02:21:27,560
and just prints it out so that we, the programmers,
2756
02:21:27,560 --> 02:21:32,000
can now see this in order to get an understanding for how it actually works.
2757
02:21:32,000 --> 02:21:36,280
So if I run python harry.py, what we'll see
2758
02:21:36,280 --> 02:21:40,040
is this sentence in propositional logic, rain and Hagrid.
2759
02:21:40,040 --> 02:21:44,360
This is the logical representation of what we have here in our Python program
2760
02:21:44,360 --> 02:21:48,040
of saying and whose arguments are rain and Hagrid.
2761
02:21:48,040 --> 02:21:51,800
So we're saying rain and Hagrid by encoding that idea.
2762
02:21:51,800 --> 02:21:54,680
And this is quite common in Python object-oriented programming,
2763
02:21:54,680 --> 02:21:56,680
where you have a number of different classes,
2764
02:21:56,680 --> 02:22:01,000
and you pass arguments into them in order to create a new and object,
2765
02:22:01,000 --> 02:22:03,800
for example, in order to represent this idea.
2766
02:22:03,800 --> 02:22:07,480
But now what I'd like to do is somehow encode the knowledge
2767
02:22:07,480 --> 02:22:09,600
that I have about the world in order to solve
2768
02:22:09,600 --> 02:22:11,560
that problem from the beginning of class, where
2769
02:22:11,560 --> 02:22:14,360
we talked about trying to figure out who Harry visited
2770
02:22:14,360 --> 02:22:17,240
and trying to figure out if it's raining or if it's not raining.
2771
02:22:17,240 --> 02:22:19,520
And so what knowledge do I have?
2772
02:22:19,520 --> 02:22:22,600
I'll go ahead and create a new variable called knowledge.
2773
02:22:22,600 --> 02:22:23,440
And what do I know?
2774
02:22:23,440 --> 02:22:25,920
Well, I know the very first sentence that we talked about
2775
02:22:25,920 --> 02:22:30,840
was the idea that if it is not raining, then Harry will visit Hagrid.
2776
02:22:30,840 --> 02:22:33,720
So all right, how do I encode the idea that it is not raining?
2777
02:22:33,720 --> 02:22:36,640
Well, I can use not and then the rain symbol.
2778
02:22:36,640 --> 02:22:39,240
So here's me saying that it is not raining.
2779
02:22:39,240 --> 02:22:42,400
And now the implication is that if it is not raining,
2780
02:22:42,400 --> 02:22:45,040
then Harry visited Hagrid.
2781
02:22:45,040 --> 02:22:48,520
So I'll wrap this inside of an implication to say,
2782
02:22:48,520 --> 02:22:52,240
if it is not raining, this first argument to the implication
2783
02:22:52,240 --> 02:22:56,640
will then Harry visited Hagrid.
2784
02:22:56,640 --> 02:23:00,400
So I'm saying implication, the premise is that it's not raining.
2785
02:23:00,400 --> 02:23:04,000
And if it is not raining, then Harry visited Hagrid.
2786
02:23:04,000 --> 02:23:07,640
And I can print out knowledge.formula to see the logical formula
2787
02:23:07,640 --> 02:23:09,600
equivalent of that same idea.
2788
02:23:09,600 --> 02:23:11,760
So I run Python of harry.py.
2789
02:23:11,760 --> 02:23:13,840
And this is the logical formula that we see
2790
02:23:13,840 --> 02:23:16,080
as a result, which is a text-based version of what
2791
02:23:16,080 --> 02:23:18,760
we were looking at before, that if it is not raining,
2792
02:23:18,760 --> 02:23:23,160
then that implies that Harry visited Hagrid.
2793
02:23:23,160 --> 02:23:26,560
But there was additional information that we had access to as well.
2794
02:23:26,560 --> 02:23:31,640
In this case, we had access to the fact that Harry visited either Hagrid
2795
02:23:31,640 --> 02:23:32,920
or Dumbledore.
2796
02:23:32,920 --> 02:23:34,520
So how do I encode that?
2797
02:23:34,520 --> 02:23:36,520
Well, this means that in my knowledge, I've really
2798
02:23:36,520 --> 02:23:38,400
got multiple pieces of knowledge going on.
2799
02:23:38,400 --> 02:23:41,520
I know one thing and another thing and another thing.
2800
02:23:41,520 --> 02:23:44,920
So I'll go ahead and wrap all of my knowledge inside of an and.
2801
02:23:44,920 --> 02:23:47,480
And I'll move things on to new lines just for good measure.
2802
02:23:47,480 --> 02:23:49,120
But I know multiple things.
2803
02:23:49,120 --> 02:23:52,600
So I'm saying knowledge is an and of multiple different sentences.
2804
02:23:52,600 --> 02:23:55,800
I know multiple different sentences to be true.
2805
02:23:55,800 --> 02:23:59,280
One such sentence that I know to be true is this implication,
2806
02:23:59,280 --> 02:24:02,600
that if it is not raining, then Harry visited Hagrid.
2807
02:24:02,600 --> 02:24:08,640
Another such sentence that I know to be true is or Hagrid Dumbledore.
2808
02:24:08,640 --> 02:24:12,400
In other words, Hagrid or Dumbledore is true,
2809
02:24:12,400 --> 02:24:16,440
because I know that Harry visited Hagrid or Dumbledore.
2810
02:24:16,440 --> 02:24:17,800
But I know more than that, actually.
2811
02:24:17,800 --> 02:24:22,000
That initial sentence from before said that Harry visited Hagrid or Dumbledore,
2812
02:24:22,000 --> 02:24:23,320
but not both.
2813
02:24:23,320 --> 02:24:26,560
So now I want a sentence that will encode the idea that Harry didn't
2814
02:24:26,560 --> 02:24:29,840
visit both Hagrid and Dumbledore.
2815
02:24:29,840 --> 02:24:33,120
Well, the notion of Harry visiting Hagrid and Dumbledore
2816
02:24:33,120 --> 02:24:38,400
would be represented like this, and of Hagrid and Dumbledore.
2817
02:24:38,400 --> 02:24:41,600
And if that is not true, if I want to say not that,
2818
02:24:41,600 --> 02:24:46,000
then I'll just wrap this whole thing inside of a not.
2819
02:24:46,000 --> 02:24:50,040
So now these three lines, line 8 says that if it is not raining,
2820
02:24:50,040 --> 02:24:51,720
then Harry visited Hagrid.
2821
02:24:51,720 --> 02:24:55,760
Line 9 says Harry visited Hagrid or Dumbledore.
2822
02:24:55,760 --> 02:25:01,000
And line 10 says Harry didn't visit both Hagrid and Dumbledore,
2823
02:25:01,000 --> 02:25:04,840
that it is not true that both the Hagrid symbol and the Dumbledore
2824
02:25:04,840 --> 02:25:05,920
symbol are true.
2825
02:25:05,920 --> 02:25:08,240
Only one of them can be true.
2826
02:25:08,240 --> 02:25:11,360
And finally, the last piece of information that I knew
2827
02:25:11,360 --> 02:25:15,280
was the fact that Harry visited Dumbledore.
2828
02:25:15,280 --> 02:25:18,800
So these now are the pieces of knowledge that I know, one sentence
2829
02:25:18,800 --> 02:25:21,400
and another sentence and another and another.
2830
02:25:21,400 --> 02:25:24,600
And I can print out what I know just to see it a little bit more visually.
2831
02:25:24,600 --> 02:25:28,640
And here now is a logical representation of the information
2832
02:25:28,640 --> 02:25:31,320
that my computer is now internally representing
2833
02:25:31,320 --> 02:25:33,720
using these various different Python objects.
2834
02:25:33,720 --> 02:25:37,120
And again, take a look at logic.py if you want to take a look at how exactly
2835
02:25:37,120 --> 02:25:40,240
it's implementing this, but no need to worry too much about all of the details
2836
02:25:40,240 --> 02:25:40,840
there.
2837
02:25:40,840 --> 02:25:44,800
We're here saying that if it is not raining, then Harry visited Hagrid.
2838
02:25:44,800 --> 02:25:47,880
We're saying that Hagrid or Dumbledore is true.
2839
02:25:47,880 --> 02:25:52,560
And we're saying it is not the case that Hagrid and Dumbledore is true,
2840
02:25:52,560 --> 02:25:54,120
that they're not both true.
2841
02:25:54,120 --> 02:25:57,280
And we also know that Dumbledore is true.
2842
02:25:57,280 --> 02:26:01,200
So this long logical sentence represents our knowledge base.
2843
02:26:01,200 --> 02:26:03,600
It is the thing that we know.
2844
02:26:03,600 --> 02:26:06,600
And now what we'd like to do is we'd like to use model checking
2845
02:26:06,600 --> 02:26:10,320
to ask a query, to ask a question like, based on this information,
2846
02:26:10,320 --> 02:26:12,160
do I know whether or not it's raining?
2847
02:26:12,160 --> 02:26:15,200
And we as humans were able to logic our way through it and figure out that,
2848
02:26:15,200 --> 02:26:18,040
all right, based on these sentences, we can conclude this and that
2849
02:26:18,040 --> 02:26:20,400
to figure out that, yes, it must have been raining.
2850
02:26:20,400 --> 02:26:23,640
But now we'd like for the computer to do that as well.
2851
02:26:23,640 --> 02:26:26,000
So let's take a look at the model checking algorithm
2852
02:26:26,000 --> 02:26:27,840
that is going to follow that same pattern
2853
02:26:27,840 --> 02:26:30,400
that we drew out in pseudocode a moment ago.
2854
02:26:30,400 --> 02:26:32,400
So I've defined a function here in logic.py
2855
02:26:32,400 --> 02:26:35,880
that you can take a look at called model check.
2856
02:26:35,880 --> 02:26:39,480
Model check takes two arguments, the knowledge that I already know,
2857
02:26:39,480 --> 02:26:41,000
and the query.
2858
02:26:41,000 --> 02:26:43,440
And the idea is, in order to do model checking,
2859
02:26:43,440 --> 02:26:46,160
I need to enumerate all of the possible models.
2860
02:26:46,160 --> 02:26:49,280
And for each of the possible models, I need to ask myself,
2861
02:26:49,280 --> 02:26:50,960
is the knowledge base true?
2862
02:26:50,960 --> 02:26:52,800
And is the query true?
2863
02:26:52,800 --> 02:26:54,560
So the first thing I need to do is somehow
2864
02:26:54,560 --> 02:26:57,040
enumerate all of the possible models, meaning
2865
02:26:57,040 --> 02:26:59,720
for all possible symbols that exist, I need
2866
02:26:59,720 --> 02:27:02,440
to assign true and false to each one of them
2867
02:27:02,440 --> 02:27:05,200
and see whether or not it's still true.
2868
02:27:05,200 --> 02:27:07,520
And so here is the way we're going to do that.
2869
02:27:07,520 --> 02:27:08,800
We're going to start.
2870
02:27:08,800 --> 02:27:10,840
So I've defined another helper function internally
2871
02:27:10,840 --> 02:27:12,400
that we'll get to in just a moment.
2872
02:27:12,400 --> 02:27:17,080
But this function starts by getting all of the symbols in both the knowledge
2873
02:27:17,080 --> 02:27:20,000
and the query, by figuring out what symbols am I dealing with.
2874
02:27:20,000 --> 02:27:24,080
In this case, the symbols I'm dealing with are rain and Hagrid and Dumbledore,
2875
02:27:24,080 --> 02:27:26,600
but there might be other symbols depending on the problem.
2876
02:27:26,600 --> 02:27:29,680
And we'll take a look soon at some examples of situations
2877
02:27:29,680 --> 02:27:32,880
where ultimately we're going to need some additional symbols in order
2878
02:27:32,880 --> 02:27:34,720
to represent the problem.
2879
02:27:34,720 --> 02:27:38,200
And then we're going to run this check all function, which
2880
02:27:38,200 --> 02:27:41,960
is a helper function that's basically going to recursively call itself
2881
02:27:41,960 --> 02:27:46,880
checking every possible configuration of propositional symbols.
2882
02:27:46,880 --> 02:27:51,080
So we start out by looking at this check all function.
2883
02:27:51,080 --> 02:27:52,320
And what do we do?
2884
02:27:52,320 --> 02:27:57,040
So if not symbols means if we finish assigning all of the symbols.
2885
02:27:57,040 --> 02:27:58,840
We've assigned every symbol a value.
2886
02:27:58,840 --> 02:28:03,280
So far we haven't done that, but if we ever do, then we check.
2887
02:28:03,280 --> 02:28:05,440
In this model, is the knowledge true?
2888
02:28:05,440 --> 02:28:06,720
That's what this line is saying.
2889
02:28:06,720 --> 02:28:10,000
If we evaluate the knowledge propositional logic formula
2890
02:28:10,000 --> 02:28:14,280
using the model's assignment of truth values, is the knowledge true?
2891
02:28:14,280 --> 02:28:19,480
If the knowledge is true, then we should return true only if the query is true.
2892
02:28:19,480 --> 02:28:22,080
Because if the knowledge is true, we want the query
2893
02:28:22,080 --> 02:28:25,400
to be true as well in order for there to be entailment.
2894
02:28:25,400 --> 02:28:29,040
Otherwise, we don't know that there otherwise there won't be an entailment
2895
02:28:29,040 --> 02:28:33,000
if there's ever a situation where what we know in our knowledge is true,
2896
02:28:33,000 --> 02:28:36,200
but the query, the thing we're asking, happens to be false.
2897
02:28:36,200 --> 02:28:38,120
So this line here is checking that same idea
2898
02:28:38,120 --> 02:28:44,000
that in all worlds where the knowledge is true, the query must also be true.
2899
02:28:44,000 --> 02:28:47,720
Otherwise, we can just return true because if the knowledge isn't true,
2900
02:28:47,720 --> 02:28:48,720
then we don't care.
2901
02:28:48,720 --> 02:28:50,520
This is equivalent to when we were enumerating
2902
02:28:50,520 --> 02:28:52,240
this table from a moment ago.
2903
02:28:52,240 --> 02:28:56,080
In all situations where the knowledge base wasn't true, all of these seven
2904
02:28:56,080 --> 02:29:00,160
rows here, we didn't care whether or not our query was true or not.
2905
02:29:00,160 --> 02:29:03,080
We only care to check whether the query is true
2906
02:29:03,080 --> 02:29:06,920
when the knowledge base is actually true, which was just this green highlighted
2907
02:29:06,920 --> 02:29:08,840
row right there.
2908
02:29:08,840 --> 02:29:12,560
So that logic is encoded using that statement there.
2909
02:29:12,560 --> 02:29:15,200
And otherwise, if we haven't assigned symbols yet,
2910
02:29:15,200 --> 02:29:18,200
which we haven't seen anything yet, then the first thing we do
2911
02:29:18,200 --> 02:29:20,640
is pop one of the symbols.
2912
02:29:20,640 --> 02:29:23,520
I make a copy of the symbols first just to save an existing copy.
2913
02:29:23,520 --> 02:29:26,200
But I pop one symbol off of the remaining symbols
2914
02:29:26,200 --> 02:29:29,080
so that I just pick one symbol at random.
2915
02:29:29,080 --> 02:29:33,680
And I create one copy of the model where that symbol is true.
2916
02:29:33,680 --> 02:29:38,040
And I create a second copy of the model where that symbol is false.
2917
02:29:38,040 --> 02:29:41,480
So I now have two copies of the model, one where the symbol is true
2918
02:29:41,480 --> 02:29:43,200
and one where the symbol is false.
2919
02:29:43,200 --> 02:29:47,920
And I need to make sure that this entailment holds in both of those models.
2920
02:29:47,920 --> 02:29:52,200
So I recursively check all on the model where the statement is true
2921
02:29:52,200 --> 02:29:57,080
and check all on the model where the statement is false.
2922
02:29:57,080 --> 02:29:59,120
So again, you can take a look at that function
2923
02:29:59,120 --> 02:30:02,120
to try to get a sense for how exactly this logic is working.
2924
02:30:02,120 --> 02:30:03,960
But in effect, what it's doing is recursively
2925
02:30:03,960 --> 02:30:07,000
calling this check all function again and again and again.
2926
02:30:07,000 --> 02:30:09,160
And on every level of the recursion, we're
2927
02:30:09,160 --> 02:30:13,280
saying let's pick a new symbol that we haven't yet assigned,
2928
02:30:13,280 --> 02:30:16,000
assign it to true and assign it to false,
2929
02:30:16,000 --> 02:30:19,360
and then check to make sure that the entailment holds in both cases.
2930
02:30:19,360 --> 02:30:22,160
Because ultimately, I need to check every possible world.
2931
02:30:22,160 --> 02:30:24,360
I need to take every combination of symbols
2932
02:30:24,360 --> 02:30:27,520
and try every combination of true and false
2933
02:30:27,520 --> 02:30:31,960
in order to figure out whether the entailment relation actually holds.
2934
02:30:31,960 --> 02:30:34,320
So that function we've written for you.
2935
02:30:34,320 --> 02:30:37,720
But in order to use that function inside of harry.py,
2936
02:30:37,720 --> 02:30:39,520
what I'll write is something like this.
2937
02:30:39,520 --> 02:30:43,300
I would like to model check based on the knowledge.
2938
02:30:43,300 --> 02:30:46,240
And then I provide as a second argument what the query is,
2939
02:30:46,240 --> 02:30:48,120
what the thing I want to ask is.
2940
02:30:48,120 --> 02:30:51,880
And what I want to ask in this case is, is it raining?
2941
02:30:51,880 --> 02:30:54,040
So model check again takes two arguments.
2942
02:30:54,040 --> 02:30:57,480
The first argument is the information that I know, this knowledge,
2943
02:30:57,480 --> 02:31:01,960
which in this case is this information that was given to me at the beginning.
2944
02:31:01,960 --> 02:31:06,800
And the second argument, rain, is encoding the idea of the query.
2945
02:31:06,800 --> 02:31:07,720
What am I asking?
2946
02:31:07,720 --> 02:31:10,120
I would like to ask, based on this knowledge,
2947
02:31:10,120 --> 02:31:13,360
do I know for sure that it is raining?
2948
02:31:13,360 --> 02:31:17,200
And I can try and print out the result of that.
2949
02:31:17,200 --> 02:31:20,680
And when I run this program, I see that the answer is true.
2950
02:31:20,680 --> 02:31:23,200
That based on this information, I can conclusively
2951
02:31:23,200 --> 02:31:26,800
say that it is raining, because using this model checking algorithm,
2952
02:31:26,800 --> 02:31:30,920
we were able to check that in every world where this knowledge is true,
2953
02:31:30,920 --> 02:31:31,720
it is raining.
2954
02:31:31,720 --> 02:31:35,240
In other words, there is no world where this knowledge is true,
2955
02:31:35,240 --> 02:31:36,680
and it is not raining.
2956
02:31:36,680 --> 02:31:41,200
So you can conclude that it is, in fact, raining.
2957
02:31:41,200 --> 02:31:43,640
And this sort of logic can be applied to a number
2958
02:31:43,640 --> 02:31:47,200
of different types of problems, that if confronted with a problem where
2959
02:31:47,200 --> 02:31:50,880
some sort of logical deduction can be used in order to try to solve it,
2960
02:31:50,880 --> 02:31:54,080
you might try thinking about what propositional symbols you might
2961
02:31:54,080 --> 02:31:56,440
need in order to represent that information,
2962
02:31:56,440 --> 02:31:58,880
and what statements and propositional logic
2963
02:31:58,880 --> 02:32:03,400
you might use in order to encode that information which you know.
2964
02:32:03,400 --> 02:32:05,720
And this process of trying to take a problem
2965
02:32:05,720 --> 02:32:08,420
and figure out what propositional symbols to use in order
2966
02:32:08,420 --> 02:32:11,520
to encode that idea, or how to represent it logically,
2967
02:32:11,520 --> 02:32:13,640
is known as knowledge engineering.
2968
02:32:13,640 --> 02:32:16,520
That software engineers and AI engineers will take a problem
2969
02:32:16,520 --> 02:32:19,000
and try and figure out how to distill it down
2970
02:32:19,000 --> 02:32:22,640
into knowledge that is representable by a computer.
2971
02:32:22,640 --> 02:32:25,240
And if we can take any general purpose problem, some problem
2972
02:32:25,240 --> 02:32:27,200
that we find in the human world, and turn it
2973
02:32:27,200 --> 02:32:30,160
into a problem that computers know how to solve
2974
02:32:30,160 --> 02:32:32,600
as by using any number of different variables, well,
2975
02:32:32,600 --> 02:32:35,320
then we can take a computer that is able to do something
2976
02:32:35,320 --> 02:32:37,960
like model checking or some other inference algorithm
2977
02:32:37,960 --> 02:32:41,960
and actually figure out how to solve that problem.
2978
02:32:41,960 --> 02:32:45,320
So now we'll take a look at two or three examples of knowledge engineering
2979
02:32:45,320 --> 02:32:47,960
and practice, of taking some problem and figuring out
2980
02:32:47,960 --> 02:32:51,760
how we can apply logical symbols and use logical formulas
2981
02:32:51,760 --> 02:32:53,960
to be able to encode that idea.
2982
02:32:53,960 --> 02:32:57,040
And we'll start with a very popular board game in the US and the UK
2983
02:32:57,040 --> 02:32:58,040
known as Clue.
2984
02:32:58,040 --> 02:33:00,880
Now, in the game of Clue, there's a number of different factors
2985
02:33:00,880 --> 02:33:01,720
that are going on.
2986
02:33:01,720 --> 02:33:04,360
But the basic premise of the game, if you've never played it before,
2987
02:33:04,360 --> 02:33:06,200
is that there are a number of different people.
2988
02:33:06,200 --> 02:33:09,120
For now, we'll just use three, Colonel Mustard, Professor Plumb,
2989
02:33:09,120 --> 02:33:10,160
and Miss Scarlet.
2990
02:33:10,160 --> 02:33:12,800
There are a number of different rooms, like a ballroom, a kitchen,
2991
02:33:12,800 --> 02:33:13,480
and a library.
2992
02:33:13,480 --> 02:33:17,400
And there are a number of different weapons, a knife, a revolver, and a wrench.
2993
02:33:17,400 --> 02:33:21,880
And three of these, one person, one room, and one weapon,
2994
02:33:21,880 --> 02:33:26,720
is the solution to the mystery, the murderer and what room they were in
2995
02:33:26,720 --> 02:33:28,520
and what weapon they happened to use.
2996
02:33:28,520 --> 02:33:30,640
And what happens at the beginning of the game
2997
02:33:30,640 --> 02:33:32,800
is that all these cards are randomly shuffled together.
2998
02:33:32,800 --> 02:33:35,360
And three of them, one person, one room, and one weapon,
2999
02:33:35,360 --> 02:33:37,800
are placed into a sealed envelope that we don't know.
3000
02:33:37,800 --> 02:33:41,360
And we would like to figure out, using some sort of logical process,
3001
02:33:41,360 --> 02:33:45,560
what's inside the envelope, which person, which room, and which weapon.
3002
02:33:45,560 --> 02:33:50,000
And we do so by looking at some, but not all, of these cards here,
3003
02:33:50,000 --> 02:33:54,920
by looking at these cards to try and figure out what might be going on.
3004
02:33:54,920 --> 02:33:56,480
And so this is a very popular game.
3005
02:33:56,480 --> 02:33:58,280
But let's now try and formalize it and see
3006
02:33:58,280 --> 02:34:01,960
if we could train a computer to be able to play this game by reasoning
3007
02:34:01,960 --> 02:34:04,120
through it logically.
3008
02:34:04,120 --> 02:34:06,620
So in order to do this, we'll begin by thinking about what
3009
02:34:06,620 --> 02:34:09,480
propositional symbols we're ultimately going to need.
3010
02:34:09,480 --> 02:34:12,560
Remember, again, that propositional symbols are just some symbol,
3011
02:34:12,560 --> 02:34:17,560
some variable, that can be either true or false in the world.
3012
02:34:17,560 --> 02:34:20,040
And so in this case, the propositional symbols
3013
02:34:20,040 --> 02:34:25,000
are really just going to correspond to each of the possible things that
3014
02:34:25,000 --> 02:34:26,480
could be inside the envelope.
3015
02:34:26,480 --> 02:34:29,360
Mustard is a propositional symbol that, in this case,
3016
02:34:29,360 --> 02:34:32,640
will just be true if Colonel Mustard is inside the envelope,
3017
02:34:32,640 --> 02:34:35,200
if he is the murderer, and false otherwise.
3018
02:34:35,200 --> 02:34:38,520
And likewise for Plum, for Professor Plum, and Scarlet, for Miss Scarlet.
3019
02:34:38,520 --> 02:34:41,600
And likewise for each of the rooms and for each of the weapons.
3020
02:34:41,600 --> 02:34:46,120
We have one propositional symbol for each of these ideas.
3021
02:34:46,120 --> 02:34:48,560
Then using those propositional symbols, we
3022
02:34:48,560 --> 02:34:52,320
can begin to create logical sentences, create knowledge
3023
02:34:52,320 --> 02:34:54,320
that we know about the world.
3024
02:34:54,320 --> 02:34:57,520
So for example, we know that someone is the murderer,
3025
02:34:57,520 --> 02:35:00,560
that one of the three people is, in fact, the murderer.
3026
02:35:00,560 --> 02:35:01,880
And how would we encode that?
3027
02:35:01,880 --> 02:35:04,280
Well, we don't know for sure who the murderer is.
3028
02:35:04,280 --> 02:35:09,320
But we know it is one person or the second person or the third person.
3029
02:35:09,320 --> 02:35:10,760
So I could say something like this.
3030
02:35:10,760 --> 02:35:13,760
Mustard or Plum or Scarlet.
3031
02:35:13,760 --> 02:35:17,280
And this piece of knowledge encodes that one of these three people
3032
02:35:17,280 --> 02:35:17,960
is the murderer.
3033
02:35:17,960 --> 02:35:22,680
We don't know which, but one of these three things must be true.
3034
02:35:22,680 --> 02:35:24,320
What other information do we know?
3035
02:35:24,320 --> 02:35:26,640
Well, we know that, for example, one of the rooms
3036
02:35:26,640 --> 02:35:28,960
must have been the room in the envelope.
3037
02:35:28,960 --> 02:35:33,120
The crime was committed either in the ballroom or the kitchen or the library.
3038
02:35:33,120 --> 02:35:34,720
Again, right now, we don't know which.
3039
02:35:34,720 --> 02:35:36,440
But this is knowledge we know at the outset,
3040
02:35:36,440 --> 02:35:40,440
knowledge that one of these three must be inside the envelope.
3041
02:35:40,440 --> 02:35:42,480
And likewise, we can say the same thing about the weapon,
3042
02:35:42,480 --> 02:35:45,480
that it was either the knife or the revolver or the wrench,
3043
02:35:45,480 --> 02:35:48,480
that one of those weapons must have been the weapon of choice
3044
02:35:48,480 --> 02:35:51,640
and therefore the weapon in the envelope.
3045
02:35:51,640 --> 02:35:53,680
And then as the game progresses, the gameplay
3046
02:35:53,680 --> 02:35:55,840
works by people get various different cards.
3047
02:35:55,840 --> 02:35:59,000
And using those cards, you can deduce information.
3048
02:35:59,000 --> 02:36:01,080
That if someone gives you a card, for example,
3049
02:36:01,080 --> 02:36:04,200
I have the Professor Plum card in my hand,
3050
02:36:04,200 --> 02:36:07,960
then I know the Professor Plum card can't be inside the envelope.
3051
02:36:07,960 --> 02:36:11,320
I know that Professor Plum is not the criminal,
3052
02:36:11,320 --> 02:36:15,040
so I know a piece of information like not Plum, for example.
3053
02:36:15,040 --> 02:36:18,440
I know that Professor Plum has to be false.
3054
02:36:18,440 --> 02:36:21,360
This propositional symbol is not true.
3055
02:36:21,360 --> 02:36:24,760
And sometimes I might not know for sure that a particular card is not
3056
02:36:24,760 --> 02:36:27,080
in the middle, but sometimes someone will make a guess
3057
02:36:27,080 --> 02:36:30,280
and I'll know that one of three possibilities is not true.
3058
02:36:30,280 --> 02:36:33,600
Someone will guess Colonel Mustard in the library with the revolver
3059
02:36:33,600 --> 02:36:35,000
or something to that effect.
3060
02:36:35,000 --> 02:36:38,080
And in that case, a card might be revealed that I don't see.
3061
02:36:38,080 --> 02:36:43,040
But if it is a card and it is either Colonel Mustard or the revolver
3062
02:36:43,040 --> 02:36:46,760
or the library, then I know that at least one of them
3063
02:36:46,760 --> 02:36:47,760
can't be in the middle.
3064
02:36:47,760 --> 02:36:51,240
So I know something like it is either not Mustard
3065
02:36:51,240 --> 02:36:55,360
or it is not the library or it is not the revolver.
3066
02:36:55,360 --> 02:36:57,200
Now maybe multiple of these are not true,
3067
02:36:57,200 --> 02:37:01,200
but I know that at least one of Mustard, Library, and Revolver
3068
02:37:01,200 --> 02:37:03,920
must, in fact, be false.
3069
02:37:03,920 --> 02:37:07,920
And so this now is a propositional logic representation
3070
02:37:07,920 --> 02:37:10,640
of this game of Clue, a way of encoding the knowledge that we
3071
02:37:10,640 --> 02:37:13,560
know inside this game using propositional logic
3072
02:37:13,560 --> 02:37:15,960
that a computer algorithm, something like model checking
3073
02:37:15,960 --> 02:37:19,920
that we saw a moment ago, can actually look at and understand.
3074
02:37:19,920 --> 02:37:21,920
So let's now take a look at some code to see
3075
02:37:21,920 --> 02:37:26,920
how this algorithm might actually work in practice.
3076
02:37:26,920 --> 02:37:30,000
All right, so I'm now going to open up a file called Clue.py, which
3077
02:37:30,000 --> 02:37:31,200
I've started already.
3078
02:37:31,200 --> 02:37:33,520
And what we'll see here is I've defined a couple of things.
3079
02:37:33,520 --> 02:37:35,720
To find some symbols initially, notice I
3080
02:37:35,720 --> 02:37:38,680
have a symbol for Colonel Mustard, a symbol for Professor Plum,
3081
02:37:38,680 --> 02:37:40,480
a symbol for Miss Scarlett, all of which
3082
02:37:40,480 --> 02:37:42,600
I've put inside of this list of characters.
3083
02:37:42,600 --> 02:37:45,400
I have a symbol for Ballroom and Kitchen and Library
3084
02:37:45,400 --> 02:37:46,960
inside of a list of rooms.
3085
02:37:46,960 --> 02:37:49,440
And then I have symbols for Knife and Revolver and Wrench.
3086
02:37:49,440 --> 02:37:50,760
These are my weapons.
3087
02:37:50,760 --> 02:37:53,760
And so all of these characters and rooms and weapons altogether,
3088
02:37:53,760 --> 02:37:55,840
those are my symbols.
3089
02:37:55,840 --> 02:37:59,200
And now I also have this check knowledge function.
3090
02:37:59,200 --> 02:38:02,760
And what the check knowledge function does is it takes my knowledge
3091
02:38:02,760 --> 02:38:07,280
and it's going to try and draw conclusions about what I know.
3092
02:38:07,280 --> 02:38:10,920
So for example, we'll loop over all of the possible symbols
3093
02:38:10,920 --> 02:38:13,680
and we'll check, do I know that that symbol is true?
3094
02:38:13,680 --> 02:38:15,960
And a symbol is going to be something like Professor Plum
3095
02:38:15,960 --> 02:38:17,400
or the Knife or the Library.
3096
02:38:17,400 --> 02:38:19,520
And if I know that it is true, in other words,
3097
02:38:19,520 --> 02:38:22,400
I know that it must be the card in the envelope,
3098
02:38:22,400 --> 02:38:24,880
then I'm going to print out using a function called
3099
02:38:24,880 --> 02:38:26,720
cprint, which prints things in color.
3100
02:38:26,720 --> 02:38:28,520
I'm going to print out the word yes, and I'm
3101
02:38:28,520 --> 02:38:32,240
going to print that in green, just to make it very clear to us.
3102
02:38:32,240 --> 02:38:35,160
If we're not sure that the symbol is true,
3103
02:38:35,160 --> 02:38:38,640
maybe I can check to see if I'm sure that the symbol is not true.
3104
02:38:38,640 --> 02:38:42,560
Like if I know for sure that it is not Professor Plum, for example.
3105
02:38:42,560 --> 02:38:44,840
And I do that by running model check again,
3106
02:38:44,840 --> 02:38:48,320
this time checking if my knowledge is not the symbol,
3107
02:38:48,320 --> 02:38:52,200
if I know for sure that the symbol is not true.
3108
02:38:52,200 --> 02:38:55,160
And if I don't know for sure that the symbol is not true,
3109
02:38:55,160 --> 02:38:59,480
because I say if not model check, meaning I'm not sure that the symbol is
3110
02:38:59,480 --> 02:39:03,080
false, well, then I'll go ahead and print out maybe next to the symbol.
3111
02:39:03,080 --> 02:39:07,920
Because maybe the symbol is true, maybe it's not, I don't actually know.
3112
02:39:07,920 --> 02:39:10,280
So what knowledge do I actually have?
3113
02:39:10,280 --> 02:39:12,360
Well, let's try and represent my knowledge now.
3114
02:39:12,360 --> 02:39:16,440
So my knowledge is, I know a couple of things, so I'll put them in an and.
3115
02:39:16,440 --> 02:39:20,280
And I know that one of the three people must be the criminal.
3116
02:39:20,280 --> 02:39:23,920
So I know or mustard, plum, scarlet.
3117
02:39:23,920 --> 02:39:26,920
This is my way of encoding that it is either Colonel Mustard or Professor
3118
02:39:26,920 --> 02:39:28,680
Plum or Miss Scarlet.
3119
02:39:28,680 --> 02:39:31,080
I know that it must have happened in one of the rooms.
3120
02:39:31,080 --> 02:39:36,040
So I know or ballroom, kitchen, library, for example.
3121
02:39:36,040 --> 02:39:38,800
And I know that one of the weapons must have been used as well.
3122
02:39:38,800 --> 02:39:43,320
So I know or knife, revolver, wrench.
3123
02:39:43,320 --> 02:39:45,280
So that might be my initial knowledge, that I
3124
02:39:45,280 --> 02:39:47,040
know that it must have been one of the people,
3125
02:39:47,040 --> 02:39:48,840
I know it must have been in one of the rooms,
3126
02:39:48,840 --> 02:39:51,800
and I know that it must have been one of the weapons.
3127
02:39:51,800 --> 02:39:54,120
And I can see what that knowledge looks like as a formula
3128
02:39:54,120 --> 02:39:56,920
by printing out knowledge.formula.
3129
02:39:56,920 --> 02:39:58,960
So I'll run python clue.py.
3130
02:39:58,960 --> 02:40:02,440
And here now is the information that I know in logical format.
3131
02:40:02,440 --> 02:40:05,800
I know that it is Colonel Mustard or Professor Plum or Miss Scarlet.
3132
02:40:05,800 --> 02:40:08,760
And I know that it is the ballroom, the kitchen, or the library.
3133
02:40:08,760 --> 02:40:11,800
And I know that it is the knife, the revolver, or the wrench.
3134
02:40:11,800 --> 02:40:13,800
But I don't know much more than that.
3135
02:40:13,800 --> 02:40:16,240
I can't really draw any firm conclusions.
3136
02:40:16,240 --> 02:40:19,320
And in fact, we can see that if I try and do,
3137
02:40:19,320 --> 02:40:24,000
let me go ahead and run my knowledge check function on my knowledge.
3138
02:40:24,000 --> 02:40:27,880
Knowledge check is this function that I, or check knowledge rather,
3139
02:40:27,880 --> 02:40:31,240
is this function that I just wrote that looks over all of the symbols
3140
02:40:31,240 --> 02:40:33,600
and tries to see what conclusions I can actually
3141
02:40:33,600 --> 02:40:36,280
draw about any of the symbols.
3142
02:40:36,280 --> 02:40:41,600
So I'll go ahead and run clue.py and see what it is that I know.
3143
02:40:41,600 --> 02:40:43,840
And it seems that I don't really know anything for sure.
3144
02:40:43,840 --> 02:40:47,200
I have all three people are maybes, all three of the rooms are maybes,
3145
02:40:47,200 --> 02:40:48,840
all three of the weapons are maybes.
3146
02:40:48,840 --> 02:40:52,120
I don't really know anything for certain just yet.
3147
02:40:52,120 --> 02:40:54,720
But now let me try and add some additional information
3148
02:40:54,720 --> 02:40:57,400
and see if additional information, additional knowledge,
3149
02:40:57,400 --> 02:41:00,560
can help us to logically reason our way through this process.
3150
02:41:00,560 --> 02:41:02,640
And we are just going to provide the information.
3151
02:41:02,640 --> 02:41:05,760
Our AI is going to take care of doing the inference
3152
02:41:05,760 --> 02:41:09,200
and figuring out what conclusions it's able to draw.
3153
02:41:09,200 --> 02:41:11,200
So I start with some cards.
3154
02:41:11,200 --> 02:41:12,720
And those cards tell me something.
3155
02:41:12,720 --> 02:41:15,600
So if I have the kernel mustard card, for example,
3156
02:41:15,600 --> 02:41:19,480
I know that the mustard symbol must be false.
3157
02:41:19,480 --> 02:41:22,480
In other words, mustard is not the one in the envelope,
3158
02:41:22,480 --> 02:41:23,680
is not the criminal.
3159
02:41:23,680 --> 02:41:26,840
So I can say, knowledge supports something called,
3160
02:41:26,840 --> 02:41:30,240
every and in this library supports dot add,
3161
02:41:30,240 --> 02:41:32,280
which is a way of adding knowledge or adding
3162
02:41:32,280 --> 02:41:35,480
an additional logical sentence to an and clause.
3163
02:41:35,480 --> 02:41:40,280
So I can say, knowledge dot add, not mustard.
3164
02:41:40,280 --> 02:41:42,920
I happen to know, because I have the mustard card,
3165
02:41:42,920 --> 02:41:44,960
that kernel mustard is not the suspect.
3166
02:41:44,960 --> 02:41:46,840
And maybe I have a couple of other cards too.
3167
02:41:46,840 --> 02:41:49,200
Maybe I also have a card for the kitchen.
3168
02:41:49,200 --> 02:41:50,760
So I know it's not the kitchen.
3169
02:41:50,760 --> 02:41:54,480
And maybe I have another card that says that it is not the revolver.
3170
02:41:54,480 --> 02:41:57,480
So I have three cards, kernel mustard, the kitchen, and the revolver.
3171
02:41:57,480 --> 02:42:01,880
And I encode that into my AI this way by saying, it's not kernel mustard,
3172
02:42:01,880 --> 02:42:04,400
it's not the kitchen, and it's not the revolver.
3173
02:42:04,400 --> 02:42:06,320
And I know those to be true.
3174
02:42:06,320 --> 02:42:09,640
So now, when I rerun clue.py, we'll see that I've
3175
02:42:09,640 --> 02:42:12,240
been able to eliminate some possibilities.
3176
02:42:12,240 --> 02:42:15,920
Before, I wasn't sure if it was the knife or the revolver or the wrench.
3177
02:42:15,920 --> 02:42:18,760
If a knife was maybe, a revolver was maybe, wrench is maybe.
3178
02:42:18,760 --> 02:42:21,080
Now I'm down to just the knife and the wrench.
3179
02:42:21,080 --> 02:42:23,080
Between those two, I don't know which one it is.
3180
02:42:23,080 --> 02:42:24,160
They're both maybes.
3181
02:42:24,160 --> 02:42:27,040
But I've been able to eliminate the revolver, which
3182
02:42:27,040 --> 02:42:31,840
is one that I know to be false, because I have the revolver card.
3183
02:42:31,840 --> 02:42:34,640
And so additional information might be acquired
3184
02:42:34,640 --> 02:42:36,080
over the course of this game.
3185
02:42:36,080 --> 02:42:41,280
And we would represent that just by adding knowledge to our knowledge set
3186
02:42:41,280 --> 02:42:43,320
or knowledge base that we've been building here.
3187
02:42:43,320 --> 02:42:46,120
So if, for example, we additionally got the information
3188
02:42:46,120 --> 02:42:49,320
that someone made a guess, someone guessed like Miss Scarlet
3189
02:42:49,320 --> 02:42:51,000
in the library with the wrench.
3190
02:42:51,000 --> 02:42:53,880
And we know that a card was revealed, which
3191
02:42:53,880 --> 02:42:56,520
means that one of those three cards, either Miss Scarlet
3192
02:42:56,520 --> 02:42:59,760
or the library or the wrench, one of those at minimum
3193
02:42:59,760 --> 02:43:02,400
must not be inside of the envelope.
3194
02:43:02,400 --> 02:43:05,640
So I could add some knowledge, say knowledge.add.
3195
02:43:05,640 --> 02:43:09,080
And I'm going to add an or clause, because I don't know for sure which one
3196
02:43:09,080 --> 02:43:12,200
it's not, but I know one of them is not in the envelope.
3197
02:43:12,200 --> 02:43:15,600
So it's either not Scarlet, or it's not the library,
3198
02:43:15,600 --> 02:43:17,080
and or supports multiple arguments.
3199
02:43:17,080 --> 02:43:20,600
I can say it's also or not the wrench.
3200
02:43:20,600 --> 02:43:23,320
So at least one of those needs a Scarlet library and wrench.
3201
02:43:23,320 --> 02:43:25,280
At least one of those needs to be false.
3202
02:43:25,280 --> 02:43:26,320
I don't know which, though.
3203
02:43:26,320 --> 02:43:27,240
Maybe it's multiple.
3204
02:43:27,240 --> 02:43:32,280
Maybe it's just one, but at least one I know needs to hold.
3205
02:43:32,280 --> 02:43:35,120
And so now if I rerun clue.py, I don't actually
3206
02:43:35,120 --> 02:43:37,560
have any additional information just yet.
3207
02:43:37,560 --> 02:43:38,880
Nothing I can say conclusively.
3208
02:43:38,880 --> 02:43:41,960
I still know that maybe it's Professor Plum, maybe it's Miss Scarlet.
3209
02:43:41,960 --> 02:43:44,520
I haven't eliminated any options.
3210
02:43:44,520 --> 02:43:46,520
But let's imagine that I get some more information,
3211
02:43:46,520 --> 02:43:50,360
that someone shows me the Professor Plum card, for example.
3212
02:43:50,360 --> 02:43:57,040
So I say, all right, let's go back here, knowledge.add, not Plum.
3213
02:43:57,040 --> 02:43:58,440
So I have the Professor Plum card.
3214
02:43:58,440 --> 02:44:00,600
I know the Professor Plum is not in the middle.
3215
02:44:00,600 --> 02:44:02,400
I rerun clue.py.
3216
02:44:02,400 --> 02:44:04,920
And right now, I'm able to draw some conclusions.
3217
02:44:04,920 --> 02:44:07,160
Now I've been able to eliminate Professor Plum,
3218
02:44:07,160 --> 02:44:10,320
and the only person it could left remaining be is Miss Scarlet.
3219
02:44:10,320 --> 02:44:14,320
So I know, yes, Miss Scarlet, this variable must be true.
3220
02:44:14,320 --> 02:44:17,720
And I've been able to infer that based on the information I already had.
3221
02:44:17,720 --> 02:44:20,640
Now between the ballroom and the library and the knife and the wrench,
3222
02:44:20,640 --> 02:44:22,600
for those two, I'm still not sure.
3223
02:44:22,600 --> 02:44:25,200
So let's add one more piece of information.
3224
02:44:25,200 --> 02:44:28,200
Let's say that I know that it's not the ballroom.
3225
02:44:28,200 --> 02:44:30,960
Someone has shown me the ballroom card, so I know it's not the ballroom.
3226
02:44:30,960 --> 02:44:33,960
Which means at this point, I should be able to conclude that it's the library.
3227
02:44:33,960 --> 02:44:35,040
Let's see.
3228
02:44:35,040 --> 02:44:40,000
I'll say knowledge.add, not the ballroom.
3229
02:44:40,000 --> 02:44:43,080
And we'll go ahead and run that.
3230
02:44:43,080 --> 02:44:46,240
And it turns out that after all of this, not only can I conclude that I
3231
02:44:46,240 --> 02:44:49,840
know that it's the library, but I also know that the weapon was the knife.
3232
02:44:49,840 --> 02:44:52,720
And that might have been an inference that was a little bit trickier, something
3233
02:44:52,720 --> 02:44:55,160
I wouldn't have realized immediately, but the AI,
3234
02:44:55,160 --> 02:44:58,320
via this model checking algorithm, is able to draw that conclusion,
3235
02:44:58,320 --> 02:45:02,400
that we know for sure that it must be Miss Scarlet in the library with the knife.
3236
02:45:02,400 --> 02:45:03,800
And how did we know that?
3237
02:45:03,800 --> 02:45:07,480
Well, we know it from this or clause up here,
3238
02:45:07,480 --> 02:45:11,520
that we know that it's either not Scarlet, or it's not the library,
3239
02:45:11,520 --> 02:45:13,440
or it's not the wrench.
3240
02:45:13,440 --> 02:45:16,000
And given that we know that it is Miss Scarlet,
3241
02:45:16,000 --> 02:45:20,200
and we know that it is the library, then the only remaining option for the weapon
3242
02:45:20,200 --> 02:45:24,360
is that it is not the wrench, which means that it must be the knife.
3243
02:45:24,360 --> 02:45:26,760
So we as humans now can go back and reason through that,
3244
02:45:26,760 --> 02:45:28,920
even though it might not have been immediately clear.
3245
02:45:28,920 --> 02:45:32,840
And that's one of the advantages of using an AI or some sort of algorithm
3246
02:45:32,840 --> 02:45:36,520
in order to do this, is that the computer can exhaust all of these possibilities
3247
02:45:36,520 --> 02:45:40,720
and try and figure out what the solution actually should be.
3248
02:45:40,720 --> 02:45:43,240
And so for that reason, it's often helpful to be
3249
02:45:43,240 --> 02:45:45,040
able to represent knowledge in this way.
3250
02:45:45,040 --> 02:45:47,280
Knowledge engineering, some situation where
3251
02:45:47,280 --> 02:45:50,240
we can use a computer to be able to represent knowledge
3252
02:45:50,240 --> 02:45:52,760
and draw conclusions based on that knowledge.
3253
02:45:52,760 --> 02:45:56,440
And any time we can translate something into propositional logic symbols
3254
02:45:56,440 --> 02:45:59,400
like this, this type of approach can be useful.
3255
02:45:59,400 --> 02:46:01,360
So you might be familiar with logic puzzles,
3256
02:46:01,360 --> 02:46:04,120
where you have to puzzle your way through trying to figure something out.
3257
02:46:04,120 --> 02:46:06,520
This is what a classic logic puzzle might look like.
3258
02:46:06,520 --> 02:46:09,640
Something like Gilderoy, Minerva, Pomona, and Horace each
3259
02:46:09,640 --> 02:46:14,000
belong to a different one of the four houses, Gryffindor, Hufflepuff, Ravenclaw,
3260
02:46:14,000 --> 02:46:15,080
and Slytherin.
3261
02:46:15,080 --> 02:46:16,640
And then we have some information.
3262
02:46:16,640 --> 02:46:20,160
The Gilderoy belongs to Gryffindor or Ravenclaw, Pomona
3263
02:46:20,160 --> 02:46:24,360
does not belong in Slytherin, and Minerva does belong to Gryffindor.
3264
02:46:24,360 --> 02:46:26,200
So we have a couple pieces of information.
3265
02:46:26,200 --> 02:46:28,200
And using that information, we need to be
3266
02:46:28,200 --> 02:46:31,200
able to draw some conclusions about which person should
3267
02:46:31,200 --> 02:46:33,240
be assigned to which house.
3268
02:46:33,240 --> 02:46:37,600
And again, we can use the exact same idea to try and implement this notion.
3269
02:46:37,600 --> 02:46:39,720
So we need some propositional symbols.
3270
02:46:39,720 --> 02:46:41,440
And in this case, the propositional symbols
3271
02:46:41,440 --> 02:46:43,800
are going to get a little more complex, although we'll
3272
02:46:43,800 --> 02:46:46,480
see ways to make this a little bit cleaner later on.
3273
02:46:46,480 --> 02:46:51,960
But we'll need 16 propositional symbols, one for each person and house.
3274
02:46:51,960 --> 02:46:54,560
So we need to say, remember, every propositional symbol
3275
02:46:54,560 --> 02:46:56,480
is either true or false.
3276
02:46:56,480 --> 02:46:59,280
So Gilderoy Gryffindor is either true or false.
3277
02:46:59,280 --> 02:47:01,480
Either he's in Gryffindor or he is not.
3278
02:47:01,480 --> 02:47:03,720
Likewise, Gilderoy Hufflepuff also true or false.
3279
02:47:03,720 --> 02:47:05,880
Either it is true or it's false.
3280
02:47:05,880 --> 02:47:09,760
And that's true for every combination of person and house
3281
02:47:09,760 --> 02:47:10,880
that we could come up with.
3282
02:47:10,880 --> 02:47:14,880
We have some sort of propositional symbol for each one of those.
3283
02:47:14,880 --> 02:47:17,200
Using this type of knowledge, we can then
3284
02:47:17,200 --> 02:47:20,560
begin to think about what types of logical sentences
3285
02:47:20,560 --> 02:47:22,440
we can say about the puzzle.
3286
02:47:22,440 --> 02:47:25,480
That if we know what will before even think about the information we were
3287
02:47:25,480 --> 02:47:28,000
given, we can think about the premise of the problem,
3288
02:47:28,000 --> 02:47:31,560
that every person is assigned to a different house.
3289
02:47:31,560 --> 02:47:32,760
So what does that tell us?
3290
02:47:32,760 --> 02:47:34,440
Well, it tells us sentences like this.
3291
02:47:34,440 --> 02:47:39,880
It tells us like Pomona Slytherin implies not Pomona Hufflepuff.
3292
02:47:39,880 --> 02:47:42,240
Something like if Pomona is in Slytherin,
3293
02:47:42,240 --> 02:47:44,680
then we know that Pomona is not in Hufflepuff.
3294
02:47:44,680 --> 02:47:48,480
And we know this for all four people and for all combinations of houses,
3295
02:47:48,480 --> 02:47:51,320
that no matter what person you pick, if they're in one house,
3296
02:47:51,320 --> 02:47:53,600
then they're not in some other house.
3297
02:47:53,600 --> 02:47:56,120
So I'll probably have a whole bunch of knowledge statements
3298
02:47:56,120 --> 02:47:59,040
that are of this form, that if we know Pomona is in Slytherin,
3299
02:47:59,040 --> 02:48:01,760
then we know Pomona is not in Hufflepuff.
3300
02:48:01,760 --> 02:48:04,120
We were also given the information that each person
3301
02:48:04,120 --> 02:48:05,720
is in a different house.
3302
02:48:05,720 --> 02:48:08,560
So I also have pieces of knowledge that look something like this.
3303
02:48:08,560 --> 02:48:13,200
Minerva Ravenclaw implies not Gilderoy Ravenclaw.
3304
02:48:13,200 --> 02:48:16,600
If they're all in different houses, then if Minerva is in Ravenclaw,
3305
02:48:16,600 --> 02:48:20,040
then we know the Gilderoy is not in Ravenclaw as well.
3306
02:48:20,040 --> 02:48:22,040
And I have a whole bunch of similar sentences
3307
02:48:22,040 --> 02:48:26,320
like this that are expressing that idea for other people and other houses
3308
02:48:26,320 --> 02:48:27,480
as well.
3309
02:48:27,480 --> 02:48:29,760
And so in addition to sentences of these form,
3310
02:48:29,760 --> 02:48:32,120
I also have the knowledge that was given to me.
3311
02:48:32,120 --> 02:48:35,880
Information like Gilderoy was in Gryffindor or in Ravenclaw
3312
02:48:35,880 --> 02:48:39,640
that would be represented like this, Gilderoy Gryffindor or Gilderoy
3313
02:48:39,640 --> 02:48:40,640
Ravenclaw.
3314
02:48:40,640 --> 02:48:42,920
And then using these sorts of sentences,
3315
02:48:42,920 --> 02:48:46,720
I can begin to draw some conclusions about the world.
3316
02:48:46,720 --> 02:48:48,120
So let's see an example of this.
3317
02:48:48,120 --> 02:48:50,800
We'll go ahead and actually try and implement this logic puzzle
3318
02:48:50,800 --> 02:48:53,360
to see if we can figure out what the answer is.
3319
02:48:53,360 --> 02:48:56,680
I'll go ahead and open up puzzle.py, where I've already
3320
02:48:56,680 --> 02:48:58,840
started to implement this sort of idea.
3321
02:48:58,840 --> 02:49:01,880
I've defined a list of people and a list of houses.
3322
02:49:01,880 --> 02:49:06,760
And I've so far created one symbol for every person and for every house.
3323
02:49:06,760 --> 02:49:09,600
That's what this double four loop is doing, looping over all people,
3324
02:49:09,600 --> 02:49:13,560
looping over all houses, creating a new symbol for each of them.
3325
02:49:13,560 --> 02:49:16,240
And then I've added some information.
3326
02:49:16,240 --> 02:49:19,320
I know that every person belongs to a house,
3327
02:49:19,320 --> 02:49:24,200
so I've added the information for every person that person Gryffindor
3328
02:49:24,200 --> 02:49:28,240
or person Hufflepuff or person Ravenclaw or person Slytherin,
3329
02:49:28,240 --> 02:49:30,820
that one of those four things must be true.
3330
02:49:30,820 --> 02:49:33,220
Every person belongs to a house.
3331
02:49:33,220 --> 02:49:34,840
What other information do I know?
3332
02:49:34,840 --> 02:49:37,960
I also know that only one house per person,
3333
02:49:37,960 --> 02:49:41,680
so no person belongs to multiple houses.
3334
02:49:41,680 --> 02:49:42,840
So how does this work?
3335
02:49:42,840 --> 02:49:44,720
Well, this is going to be true for all people.
3336
02:49:44,720 --> 02:49:47,080
So I'll loop over every person.
3337
02:49:47,080 --> 02:49:51,080
And then I need to loop over all different pairs of houses.
3338
02:49:51,080 --> 02:49:54,840
The idea is I want to encode the idea that if Minerva is in Gryffindor,
3339
02:49:54,840 --> 02:49:57,480
then Minerva can't be in Ravenclaw.
3340
02:49:57,480 --> 02:49:59,760
So I'll loop over all houses, each one.
3341
02:49:59,760 --> 02:50:02,580
And I'll loop over all houses again, h2.
3342
02:50:02,580 --> 02:50:06,200
And as long as they're different, h1 not equal to h2,
3343
02:50:06,200 --> 02:50:09,200
then I'll add to my knowledge base this piece of information.
3344
02:50:09,200 --> 02:50:14,320
That implication, in other words, an if then, if the person is in h1,
3345
02:50:14,320 --> 02:50:18,560
then I know that they are not in house h2.
3346
02:50:18,560 --> 02:50:22,160
So these lines here are encoding the notion that for every person,
3347
02:50:22,160 --> 02:50:25,920
if they belong to house one, then they are not in house two.
3348
02:50:25,920 --> 02:50:27,880
And the other piece of logic we need to encode
3349
02:50:27,880 --> 02:50:30,920
is the idea that every house can only have one person.
3350
02:50:30,920 --> 02:50:33,120
In other words, if Pomona is in Hufflepuff,
3351
02:50:33,120 --> 02:50:35,960
then nobody else is allowed to be in Hufflepuff either.
3352
02:50:35,960 --> 02:50:37,960
And that's the same logic, but sort of backwards.
3353
02:50:37,960 --> 02:50:42,840
I loop over all of the houses and loop over all different pairs of people.
3354
02:50:42,840 --> 02:50:45,600
So I loop over people once, loop over people again,
3355
02:50:45,600 --> 02:50:50,120
and only do this when the people are different, p1 not equal to p2.
3356
02:50:50,120 --> 02:50:54,960
And I add the knowledge that if, as given by the implication,
3357
02:50:54,960 --> 02:50:58,360
if person one belongs to the house, then it
3358
02:50:58,360 --> 02:51:03,880
is not the case that person two belongs to the same house.
3359
02:51:03,880 --> 02:51:05,800
So here I'm just encoding the knowledge that
3360
02:51:05,800 --> 02:51:07,880
represents the problem's constraints.
3361
02:51:07,880 --> 02:51:09,760
I know that everyone's in a different house.
3362
02:51:09,760 --> 02:51:12,800
I know that any person can only belong to one house.
3363
02:51:12,800 --> 02:51:17,480
And I can now take my knowledge and try and print out the information
3364
02:51:17,480 --> 02:51:18,600
that I happen to know.
3365
02:51:18,600 --> 02:51:22,120
So I'll go ahead and print out knowledge.formula,
3366
02:51:22,120 --> 02:51:24,880
just to see this in action, and I'll go ahead and skip this for now.
3367
02:51:24,880 --> 02:51:26,840
But we'll come back to this in a second.
3368
02:51:26,840 --> 02:51:31,840
Let's print out the knowledge that I know by running Python puzzle.py.
3369
02:51:31,840 --> 02:51:34,320
It's a lot of information, a lot that I have to scroll through,
3370
02:51:34,320 --> 02:51:36,960
because there are 16 different variables all going on.
3371
02:51:36,960 --> 02:51:39,320
But the basic idea, if we scroll up to the very top,
3372
02:51:39,320 --> 02:51:41,040
is I see my initial information.
3373
02:51:41,040 --> 02:51:44,560
Gilderoy is either in Gryffindor, or Gilderoy is in Hufflepuff,
3374
02:51:44,560 --> 02:51:48,040
or Gilderoy is in Ravenclaw, or Gilderoy is in Slytherin,
3375
02:51:48,040 --> 02:51:50,920
and then way more information as well.
3376
02:51:50,920 --> 02:51:54,040
So this is quite messy, more than we really want to be looking at.
3377
02:51:54,040 --> 02:51:55,920
And soon, too, we'll see ways of representing
3378
02:51:55,920 --> 02:51:58,200
this a little bit more nicely using logic.
3379
02:51:58,200 --> 02:52:00,400
But for now, we can just say these are the variables
3380
02:52:00,400 --> 02:52:01,520
that we're dealing with.
3381
02:52:01,520 --> 02:52:05,560
And now we'd like to add some information.
3382
02:52:05,560 --> 02:52:09,560
So the information we're going to add is Gilderoy is in Gryffindor,
3383
02:52:09,560 --> 02:52:10,680
or he is in Ravenclaw.
3384
02:52:10,680 --> 02:52:12,680
So that knowledge was given to us.
3385
02:52:12,680 --> 02:52:15,520
So I'll go ahead and say knowledge.add.
3386
02:52:15,520 --> 02:52:26,400
And I know that either or Gilderoy Gryffindor or Gilderoy Ravenclaw.
3387
02:52:26,400 --> 02:52:29,280
One of those two things must be true.
3388
02:52:29,280 --> 02:52:32,200
I also know that Pomona was not in Slytherin,
3389
02:52:32,200 --> 02:52:37,680
so I can say knowledge.add not this symbol, not the Pomona-Slytherin
3390
02:52:37,680 --> 02:52:38,720
symbol.
3391
02:52:38,720 --> 02:52:42,240
And then I can add the knowledge that Minerva is in Gryffindor
3392
02:52:42,240 --> 02:52:46,760
by adding the symbol Minerva Gryffindor.
3393
02:52:46,760 --> 02:52:49,200
So those are the pieces of knowledge that I know.
3394
02:52:49,200 --> 02:52:52,920
And this loop here at the bottom just loops over all of my symbols,
3395
02:52:52,920 --> 02:52:56,040
checks to see if the knowledge entails that symbol
3396
02:52:56,040 --> 02:52:58,520
by calling this model check function again.
3397
02:52:58,520 --> 02:53:03,600
And if it does, if we know the symbol is true, we print out the symbol.
3398
02:53:03,600 --> 02:53:07,000
So now I can run Python, puzzle.py, and Python
3399
02:53:07,000 --> 02:53:08,880
is going to solve this puzzle for me.
3400
02:53:08,880 --> 02:53:11,520
We're able to conclude that Gilderoy belongs to Ravenclaw,
3401
02:53:11,520 --> 02:53:15,480
Pomona belongs to Hufflepuff, Minerva to Gryffindor, and Horace to Slytherin
3402
02:53:15,480 --> 02:53:18,120
just by encoding this knowledge inside the computer,
3403
02:53:18,120 --> 02:53:20,360
although it was quite tedious to do in this case.
3404
02:53:20,360 --> 02:53:24,880
And as a result, we were able to get the conclusion from that as well.
3405
02:53:24,880 --> 02:53:27,240
And you can imagine this being applied to many sorts
3406
02:53:27,240 --> 02:53:29,000
of different deductive situations.
3407
02:53:29,000 --> 02:53:31,120
So not only these situations where we're trying
3408
02:53:31,120 --> 02:53:33,640
to deal with Harry Potter characters in this puzzle,
3409
02:53:33,640 --> 02:53:35,800
but if you've ever played games like Mastermind, where
3410
02:53:35,800 --> 02:53:39,040
you're trying to figure out which order different colors go in
3411
02:53:39,040 --> 02:53:40,840
and trying to make predictions about it, I
3412
02:53:40,840 --> 02:53:44,600
could tell you, for example, let's play a simplified version of Mastermind
3413
02:53:44,600 --> 02:53:47,760
where there are four colors, red, blue, green, and yellow,
3414
02:53:47,760 --> 02:53:51,000
and they're in some order, but I'm not telling you what order.
3415
02:53:51,000 --> 02:53:53,080
You just have to make a guess, and I'll tell you
3416
02:53:53,080 --> 02:53:55,400
of red, blue, green, and yellow how many of the four
3417
02:53:55,400 --> 02:53:57,320
you got in the right position.
3418
02:53:57,320 --> 02:53:59,480
So a simplified version of this game, you
3419
02:53:59,480 --> 02:54:01,800
might make a guess like red, blue, green, yellow,
3420
02:54:01,800 --> 02:54:05,320
and I would tell you something like two of those four
3421
02:54:05,320 --> 02:54:08,040
are in the correct position, but the other two are not.
3422
02:54:08,040 --> 02:54:10,560
And then you could reasonably make a guess and say, all right,
3423
02:54:10,560 --> 02:54:13,000
look at this, blue, red, green, yellow.
3424
02:54:13,000 --> 02:54:16,040
Try switching two of them around, and this time maybe I tell you,
3425
02:54:16,040 --> 02:54:19,480
you know what, none of those are in the correct position.
3426
02:54:19,480 --> 02:54:23,000
And the question then is, all right, what is the correct order
3427
02:54:23,000 --> 02:54:24,360
of these four colors?
3428
02:54:24,360 --> 02:54:26,240
And we as humans could begin to reason this through.
3429
02:54:26,240 --> 02:54:28,760
All right, well, if none of these were correct,
3430
02:54:28,760 --> 02:54:31,280
but two of these were correct, well, it must have been
3431
02:54:31,280 --> 02:54:34,560
because I switched the red and the blue, which means red and blue here
3432
02:54:34,560 --> 02:54:37,440
must be correct, which means green and yellow are probably not correct.
3433
02:54:37,440 --> 02:54:40,400
You can begin to do this sort of deductive reasoning.
3434
02:54:40,400 --> 02:54:42,840
And we can also equivalently try and take this
3435
02:54:42,840 --> 02:54:45,400
and encode it inside of our computer as well.
3436
02:54:45,400 --> 02:54:48,000
And it's going to be very similar to the logic puzzle
3437
02:54:48,000 --> 02:54:49,480
that we just did a moment ago.
3438
02:54:49,480 --> 02:54:52,520
So I won't spend too much time on this code because it is fairly similar.
3439
02:54:52,520 --> 02:54:54,920
But again, we have a whole bunch of colors
3440
02:54:54,920 --> 02:54:58,600
and four different positions in which those colors can be.
3441
02:54:58,600 --> 02:55:00,440
And then we have some additional knowledge.
3442
02:55:00,440 --> 02:55:02,120
And I encode all of that knowledge.
3443
02:55:02,120 --> 02:55:04,960
And you can take a look at this code on your own time.
3444
02:55:04,960 --> 02:55:07,880
But I just want to demonstrate that when we run this code,
3445
02:55:07,880 --> 02:55:12,720
run python mastermind.py and run and see what we get,
3446
02:55:12,720 --> 02:55:16,880
we ultimately are able to compute red 0 in the 0 position,
3447
02:55:16,880 --> 02:55:19,460
blue in the 1 position, yellow in the 2 position,
3448
02:55:19,460 --> 02:55:24,160
and green in the 3 position as the ordering of those symbols.
3449
02:55:24,160 --> 02:55:25,840
Now, ultimately, what you might have noticed
3450
02:55:25,840 --> 02:55:28,560
is this process was taking quite a long time.
3451
02:55:28,560 --> 02:55:32,360
And in fact, model checking is not a particularly efficient algorithm, right?
3452
02:55:32,360 --> 02:55:34,320
What I need to do in order to model check
3453
02:55:34,320 --> 02:55:36,800
is take all of my possible different variables
3454
02:55:36,800 --> 02:55:39,480
and enumerate all of the possibilities that they could be in.
3455
02:55:39,480 --> 02:55:44,040
If I have n variables, I have 2 to the n possible worlds
3456
02:55:44,040 --> 02:55:45,840
that I need to be looking through in order
3457
02:55:45,840 --> 02:55:48,060
to perform this model checking algorithm.
3458
02:55:48,060 --> 02:55:50,440
And this is probably not tractable, especially
3459
02:55:50,440 --> 02:55:53,480
as we start to get to much larger and larger sets of data
3460
02:55:53,480 --> 02:55:56,320
where you have many, many more variables that are at play.
3461
02:55:56,320 --> 02:55:59,240
Right here, we only have a relatively small number of variables.
3462
02:55:59,240 --> 02:56:01,560
So this sort of approach can actually work.
3463
02:56:01,560 --> 02:56:04,800
But as the number of variables increases, model checking
3464
02:56:04,800 --> 02:56:07,240
becomes less and less good of a way of trying
3465
02:56:07,240 --> 02:56:09,560
to solve these sorts of problems.
3466
02:56:09,560 --> 02:56:12,240
So while it might have been OK for something like Mastermind
3467
02:56:12,240 --> 02:56:15,280
to conclude that this is indeed the correct sequence where all four
3468
02:56:15,280 --> 02:56:17,720
are in the correct position, what we'd like to do
3469
02:56:17,720 --> 02:56:21,760
is come up with some better ways to be able to make inferences rather than
3470
02:56:21,760 --> 02:56:24,600
just enumerate all of the possibilities.
3471
02:56:24,600 --> 02:56:26,960
And to do so, what we'll transition to next
3472
02:56:26,960 --> 02:56:29,880
is the idea of inference rules, some sort of rules
3473
02:56:29,880 --> 02:56:33,200
that we can apply to take knowledge that already exists
3474
02:56:33,200 --> 02:56:36,000
and translate it into new forms of knowledge.
3475
02:56:36,000 --> 02:56:38,440
And the general way we'll structure an inference rule
3476
02:56:38,440 --> 02:56:40,840
is by having a horizontal line here.
3477
02:56:40,840 --> 02:56:44,240
Anything above the line is going to represent a premise, something
3478
02:56:44,240 --> 02:56:45,960
that we know to be true.
3479
02:56:45,960 --> 02:56:48,680
And then anything below the line will be the conclusion
3480
02:56:48,680 --> 02:56:53,360
that we can arrive at after we apply the logic from the inference rule
3481
02:56:53,360 --> 02:56:54,640
that we're going to demonstrate.
3482
02:56:54,640 --> 02:56:56,140
So we'll do some of these inference rules
3483
02:56:56,140 --> 02:56:59,040
by demonstrating them in English first, but then translating them
3484
02:56:59,040 --> 02:57:01,120
into the world of propositional logic so you
3485
02:57:01,120 --> 02:57:04,800
can see what those inference rules actually look like.
3486
02:57:04,800 --> 02:57:07,000
So for example, let's imagine that I have access
3487
02:57:07,000 --> 02:57:08,720
to two pieces of information.
3488
02:57:08,720 --> 02:57:11,320
I know, for example, that if it is raining,
3489
02:57:11,320 --> 02:57:14,120
then Harry is inside, for example.
3490
02:57:14,120 --> 02:57:16,960
And let's say I also know it is raining.
3491
02:57:16,960 --> 02:57:19,460
Then most of us could reasonably then look at this information
3492
02:57:19,460 --> 02:57:23,880
and conclude that, all right, Harry must be inside.
3493
02:57:23,880 --> 02:57:27,160
This inference rule is known as modus ponens,
3494
02:57:27,160 --> 02:57:29,920
and it's phrased more formally in logic as this.
3495
02:57:29,920 --> 02:57:35,480
If we know that alpha implies beta, in other words, if alpha, then beta,
3496
02:57:35,480 --> 02:57:38,380
and we also know that alpha is true, then we
3497
02:57:38,380 --> 02:57:41,560
should be able to conclude that beta is also true.
3498
02:57:41,560 --> 02:57:45,640
We can apply this inference rule to take these two pieces of information
3499
02:57:45,640 --> 02:57:47,860
and generate this new piece of information.
3500
02:57:47,860 --> 02:57:51,080
Notice that this is a totally different approach from the model checking
3501
02:57:51,080 --> 02:57:54,520
approach, where the approach was look at all of the possible worlds
3502
02:57:54,520 --> 02:57:56,720
and see what's true in each of these worlds.
3503
02:57:56,720 --> 02:57:59,240
Here, we're not dealing with any specific world.
3504
02:57:59,240 --> 02:58:01,560
We're just dealing with the knowledge that we know
3505
02:58:01,560 --> 02:58:04,480
and what conclusions we can arrive at based on that knowledge.
3506
02:58:04,480 --> 02:58:10,040
That I know that A implies B, and I know A, and the conclusion is B.
3507
02:58:10,040 --> 02:58:12,680
And this should seem like a relatively obvious rule.
3508
02:58:12,680 --> 02:58:16,160
But of course, if alpha, then beta, and we know alpha,
3509
02:58:16,160 --> 02:58:19,160
then we should be able to conclude that beta is also true.
3510
02:58:19,160 --> 02:58:21,400
And that's going to be true for many, but maybe even
3511
02:58:21,400 --> 02:58:23,560
all of the inference rules that we'll take a look at.
3512
02:58:23,560 --> 02:58:25,320
You should be able to look at them and say,
3513
02:58:25,320 --> 02:58:27,360
yeah, of course that's going to be true.
3514
02:58:27,360 --> 02:58:30,440
But it's putting these all together, figuring out the right combination
3515
02:58:30,440 --> 02:58:32,920
of inference rules that can be applied that ultimately
3516
02:58:32,920 --> 02:58:38,440
is going to allow us to generate interesting knowledge inside of our AI.
3517
02:58:38,440 --> 02:58:41,440
So that's modus ponensis application of implication,
3518
02:58:41,440 --> 02:58:44,640
that if we know alpha and we know that alpha implies beta,
3519
02:58:44,640 --> 02:58:47,280
then we can conclude beta.
3520
02:58:47,280 --> 02:58:48,760
Let's take a look at another example.
3521
02:58:48,760 --> 02:58:52,560
Fairly straightforward, something like Harry is friends with Ron and Hermione.
3522
02:58:52,560 --> 02:58:54,800
Based on that information, we can reasonably
3523
02:58:54,800 --> 02:58:56,920
conclude Harry is friends with Hermione.
3524
02:58:56,920 --> 02:58:58,760
That must also be true.
3525
02:58:58,760 --> 02:59:01,880
And this inference rule is known as and elimination.
3526
02:59:01,880 --> 02:59:06,920
And what and elimination says is that if we have a situation where alpha
3527
02:59:06,920 --> 02:59:11,560
and beta are both true, I have information alpha and beta,
3528
02:59:11,560 --> 02:59:14,440
well then, just alpha is true.
3529
02:59:14,440 --> 02:59:16,560
Or likewise, just beta is true.
3530
02:59:16,560 --> 02:59:19,800
That if I know that both parts are true, then one of those parts
3531
02:59:19,800 --> 02:59:21,040
must also be true.
3532
02:59:21,040 --> 02:59:24,360
Again, something obvious from the point of view of human intuition,
3533
02:59:24,360 --> 02:59:27,160
but a computer needs to be told this kind of information.
3534
02:59:27,160 --> 02:59:28,960
To be able to apply the inference rule, we
3535
02:59:28,960 --> 02:59:32,200
need to tell the computer that this is an inference rule that you can apply,
3536
02:59:32,200 --> 02:59:35,160
so the computer has access to it and is able to use it
3537
02:59:35,160 --> 02:59:39,880
in order to translate information from one form to another.
3538
02:59:39,880 --> 02:59:42,720
In addition to that, let's take a look at another example of an inference
3539
02:59:42,720 --> 02:59:48,600
rule, something like it is not true that Harry did not pass the test.
3540
02:59:48,600 --> 02:59:50,000
Bit of a tricky sentence to parse.
3541
02:59:50,000 --> 02:59:50,840
I'll read it again.
3542
02:59:50,840 --> 02:59:54,960
It is not true, or it is false, that Harry did not pass the test.
3543
02:59:54,960 --> 02:59:58,800
Well, if it is false that Harry did not pass the test,
3544
02:59:58,800 --> 03:00:02,840
then the only reasonable conclusion is that Harry did pass the test.
3545
03:00:02,840 --> 03:00:05,120
And so this, instead of being and elimination,
3546
03:00:05,120 --> 03:00:07,560
is what we call double negation elimination.
3547
03:00:07,560 --> 03:00:10,360
That if we have two negatives inside of our premise,
3548
03:00:10,360 --> 03:00:12,080
then we can just remove them altogether.
3549
03:00:12,080 --> 03:00:13,120
They cancel each other out.
3550
03:00:13,120 --> 03:00:17,440
One turns true to false, and the other one turns false back into true.
3551
03:00:17,440 --> 03:00:19,300
Phrased a little bit more formally, we say
3552
03:00:19,300 --> 03:00:23,800
that if the premise is not alpha, then the conclusion
3553
03:00:23,800 --> 03:00:25,780
we can draw is just alpha.
3554
03:00:25,780 --> 03:00:28,400
We can say that alpha is true.
3555
03:00:28,400 --> 03:00:30,280
We'll take a look at a couple more of these.
3556
03:00:30,280 --> 03:00:33,960
If I have it is raining, then Harry is inside.
3557
03:00:33,960 --> 03:00:35,920
How do I reframe this?
3558
03:00:35,920 --> 03:00:37,800
Well, this one is a little bit trickier.
3559
03:00:37,800 --> 03:00:41,080
But if I know if it is raining, then Harry is inside,
3560
03:00:41,080 --> 03:00:43,960
then I conclude one of two things must be true.
3561
03:00:43,960 --> 03:00:48,280
Either it is not raining, or Harry is inside.
3562
03:00:48,280 --> 03:00:49,280
Now, this one's trickier.
3563
03:00:49,280 --> 03:00:50,820
So let's think about it a little bit.
3564
03:00:50,820 --> 03:00:54,400
This first premise here, if it is raining, then Harry is inside,
3565
03:00:54,400 --> 03:00:59,200
is saying that if I know that it is raining, then Harry must be inside.
3566
03:00:59,200 --> 03:01:01,840
So what is the other possible case?
3567
03:01:01,840 --> 03:01:06,760
Well, if Harry is not inside, then I know that it must not be raining.
3568
03:01:06,760 --> 03:01:09,640
So one of those two situations must be true.
3569
03:01:09,640 --> 03:01:14,800
Either it's not raining, or it is raining, in which case Harry is inside.
3570
03:01:14,800 --> 03:01:18,280
So the conclusion I can draw is either it is not raining,
3571
03:01:18,280 --> 03:01:22,840
or it is raining, so therefore, Harry is inside.
3572
03:01:22,840 --> 03:01:28,000
And so this is a way to translate if-then statements into or statements.
3573
03:01:28,000 --> 03:01:31,000
And this is known as implication elimination.
3574
03:01:31,000 --> 03:01:33,360
And this is similar to what we actually did in the beginning
3575
03:01:33,360 --> 03:01:35,840
when we were first looking at those very first sentences
3576
03:01:35,840 --> 03:01:37,960
about Harry and Hagrid and Dumbledore.
3577
03:01:37,960 --> 03:01:39,800
And phrased a little bit more formally, this
3578
03:01:39,800 --> 03:01:43,560
says that if I have the implication, alpha implies beta,
3579
03:01:43,560 --> 03:01:49,120
that I can draw the conclusion that either not alpha or beta,
3580
03:01:49,120 --> 03:01:50,760
because there are only two possibilities.
3581
03:01:50,760 --> 03:01:54,040
Either alpha is true or alpha is not true.
3582
03:01:54,040 --> 03:01:57,320
So one of those possibilities is alpha is not true.
3583
03:01:57,320 --> 03:02:00,320
But if alpha is true, well, then we can draw the conclusion
3584
03:02:00,320 --> 03:02:01,560
that beta must be true.
3585
03:02:01,560 --> 03:02:07,920
So either alpha is not true or alpha is true, in which case beta is also true.
3586
03:02:07,920 --> 03:02:12,440
So this is one way to turn an implication into just a statement about or.
3587
03:02:12,440 --> 03:02:14,560
In addition to eliminating implications,
3588
03:02:14,560 --> 03:02:17,640
we can also eliminate biconditionals as well.
3589
03:02:17,640 --> 03:02:19,960
So let's take an English example, something like,
3590
03:02:19,960 --> 03:02:23,600
it is raining if and only if Harry is inside.
3591
03:02:23,600 --> 03:02:26,960
And this if and only if really sounds like that biconditional,
3592
03:02:26,960 --> 03:02:31,200
that double arrow sign that we saw in propositional logic not too long ago.
3593
03:02:31,200 --> 03:02:33,800
And what does this actually mean if we were to translate this?
3594
03:02:33,800 --> 03:02:37,760
Well, this means that if it is raining, then Harry is inside.
3595
03:02:37,760 --> 03:02:40,200
And if Harry is inside, then it is raining,
3596
03:02:40,200 --> 03:02:43,040
that this implication goes both ways.
3597
03:02:43,040 --> 03:02:45,960
And this is what we would call biconditional elimination,
3598
03:02:45,960 --> 03:02:50,040
that I can take a biconditional, a if and only if b,
3599
03:02:50,040 --> 03:02:56,360
and translate that into something like this, a implies b, and b implies a.
3600
03:02:56,360 --> 03:03:00,400
So many of these inference rules are taking logic that uses certain symbols
3601
03:03:00,400 --> 03:03:03,960
and turning them into different symbols, taking an implication
3602
03:03:03,960 --> 03:03:06,680
and turning it into an or, or taking a biconditional
3603
03:03:06,680 --> 03:03:08,640
and turning it into implication.
3604
03:03:08,640 --> 03:03:11,640
And another example of it would be something like this.
3605
03:03:11,640 --> 03:03:16,200
It is not true that both Harry and Ron passed the test.
3606
03:03:16,200 --> 03:03:17,880
Well, all right, how do we translate that?
3607
03:03:17,880 --> 03:03:18,920
What does that mean?
3608
03:03:18,920 --> 03:03:22,920
Well, if it is not true that both of them passed the test, well,
3609
03:03:22,920 --> 03:03:25,080
then the reasonable conclusion we might draw
3610
03:03:25,080 --> 03:03:28,040
is that at least one of them didn't pass the test.
3611
03:03:28,040 --> 03:03:31,280
So the conclusion is either Harry did not pass the test
3612
03:03:31,280 --> 03:03:33,640
or Ron did not pass the test, or both.
3613
03:03:33,640 --> 03:03:35,240
This is not an exclusive or.
3614
03:03:35,240 --> 03:03:40,480
But if it is true that it is not true that both Harry and Ron passed the test,
3615
03:03:40,480 --> 03:03:45,240
well, then either Harry didn't pass the test or Ron didn't pass the test.
3616
03:03:45,240 --> 03:03:48,000
And this type of law is one of De Morgan's laws.
3617
03:03:48,000 --> 03:03:52,160
Quite famous in logic where the idea is that we can turn an and into an or.
3618
03:03:52,160 --> 03:03:56,360
We can say we can take this and that both Harry and Ron passed the test
3619
03:03:56,360 --> 03:03:59,920
and turn it into an or by moving the nots around.
3620
03:03:59,920 --> 03:04:03,360
So if it is not true that Harry and Ron passed the test,
3621
03:04:03,360 --> 03:04:05,800
well, then either Harry did not pass the test
3622
03:04:05,800 --> 03:04:08,880
or Ron did not pass the test either.
3623
03:04:08,880 --> 03:04:12,280
And the way we frame that more formally using logic is to say this.
3624
03:04:12,280 --> 03:04:20,400
If it is not true that alpha and beta, well, then either not alpha or not beta.
3625
03:04:20,400 --> 03:04:22,320
The way I like to think about this is that if you
3626
03:04:22,320 --> 03:04:25,240
have a negation in front of an and expression,
3627
03:04:25,240 --> 03:04:27,880
you move the negation inwards, so to speak,
3628
03:04:27,880 --> 03:04:31,920
moving the negation into each of these individual sentences
3629
03:04:31,920 --> 03:04:34,720
and then flip the and into an or.
3630
03:04:34,720 --> 03:04:37,800
So the negation moves inwards and the and flips into an or.
3631
03:04:37,800 --> 03:04:43,600
So I go from not a and b to not a or not b.
3632
03:04:43,600 --> 03:04:45,640
And there's actually a reverse of De Morgan's law
3633
03:04:45,640 --> 03:04:48,320
that goes in the other direction for something like this.
3634
03:04:48,320 --> 03:04:52,240
If I say it is not true that Harry or Ron passed the test,
3635
03:04:52,240 --> 03:04:56,160
meaning neither of them passed the test, well, then the conclusion I can draw
3636
03:04:56,160 --> 03:05:01,040
is that Harry did not pass the test and Ron did not pass the test.
3637
03:05:01,040 --> 03:05:04,160
So in this case, instead of turning an and into an or,
3638
03:05:04,160 --> 03:05:06,560
we're turning an or into an and.
3639
03:05:06,560 --> 03:05:07,760
But the idea is the same.
3640
03:05:07,760 --> 03:05:10,880
And this, again, is another example of De Morgan's laws.
3641
03:05:10,880 --> 03:05:15,720
And the way that works is that if I have not a or b this time,
3642
03:05:15,720 --> 03:05:17,080
the same logic is going to apply.
3643
03:05:17,080 --> 03:05:19,240
I'm going to move the negation inwards.
3644
03:05:19,240 --> 03:05:22,640
And I'm going to flip this time, flip the or into an and.
3645
03:05:22,640 --> 03:05:28,520
So if not a or b, meaning it is not true that a or b or alpha or beta,
3646
03:05:28,520 --> 03:05:34,200
then I can say not alpha and not beta, moving the negation inwards
3647
03:05:34,200 --> 03:05:36,120
in order to make that conclusion.
3648
03:05:36,120 --> 03:05:38,840
So those are De Morgan's laws and a couple other inference rules
3649
03:05:38,840 --> 03:05:40,680
that are worth just taking a look at.
3650
03:05:40,680 --> 03:05:43,360
One is the distributive law that works this way.
3651
03:05:43,360 --> 03:05:49,600
So if I have alpha and beta or gamma, well, then much in the same way
3652
03:05:49,600 --> 03:05:52,640
that you can use in math, use distributive laws to distribute
3653
03:05:52,640 --> 03:05:55,440
operands like addition and multiplication,
3654
03:05:55,440 --> 03:06:01,120
I can do a similar thing here, where I can say if alpha and beta or gamma,
3655
03:06:01,120 --> 03:06:06,600
then I can say something like alpha and beta or alpha and gamma,
3656
03:06:06,600 --> 03:06:11,200
that I've been able to distribute this and sign throughout this expression.
3657
03:06:11,200 --> 03:06:13,200
So this is an example of the distributive property
3658
03:06:13,200 --> 03:06:16,960
or the distributive law as applied to logic in much the same way
3659
03:06:16,960 --> 03:06:19,800
that you would distribute a multiplication over the addition
3660
03:06:19,800 --> 03:06:22,160
of something, for example.
3661
03:06:22,160 --> 03:06:23,760
This works the other way too.
3662
03:06:23,760 --> 03:06:27,960
So if, for example, I have alpha or beta and gamma,
3663
03:06:27,960 --> 03:06:30,280
I can distribute the or throughout the expression.
3664
03:06:30,280 --> 03:06:34,440
I can say alpha or beta and alpha or gamma.
3665
03:06:34,440 --> 03:06:36,520
So the distributive law works in that way too.
3666
03:06:36,520 --> 03:06:40,320
And it's helpful if I want to take an or and move it into the expression.
3667
03:06:40,320 --> 03:06:43,160
And we'll see an example soon of why it is that we might actually
3668
03:06:43,160 --> 03:06:46,400
care to do something like that.
3669
03:06:46,400 --> 03:06:49,640
All right, so now we've seen a lot of different inference rules.
3670
03:06:49,640 --> 03:06:53,640
And the question now is, how can we use those inference rules to actually try
3671
03:06:53,640 --> 03:06:57,120
and draw some conclusions, to actually try and prove something about entailment,
3672
03:06:57,120 --> 03:06:59,400
proving that given some initial knowledge base,
3673
03:06:59,400 --> 03:07:04,480
we would like to find some way to prove that a query is true?
3674
03:07:04,480 --> 03:07:06,520
Well, one way to think about it is actually
3675
03:07:06,520 --> 03:07:08,600
to think back to what we talked about last time
3676
03:07:08,600 --> 03:07:10,480
when we talked about search problems.
3677
03:07:10,480 --> 03:07:13,400
Recall again that search problems have some sort of initial state.
3678
03:07:13,400 --> 03:07:16,200
They have actions that you can take from one state to another
3679
03:07:16,200 --> 03:07:18,360
as defined by a transition model that tells you
3680
03:07:18,360 --> 03:07:20,240
how to get from one state to another.
3681
03:07:20,240 --> 03:07:22,800
We talked about testing to see if you were at a goal.
3682
03:07:22,800 --> 03:07:26,280
And then some path cost function to see how many steps
3683
03:07:26,280 --> 03:07:31,040
did you have to take or how costly was the solution that you found.
3684
03:07:31,040 --> 03:07:33,080
Now that we have these inference rules that
3685
03:07:33,080 --> 03:07:36,720
take some set of sentences in propositional logic
3686
03:07:36,720 --> 03:07:40,400
and get us some new set of sentences in propositional logic,
3687
03:07:40,400 --> 03:07:44,760
we can actually treat those sentences or those sets of sentences
3688
03:07:44,760 --> 03:07:47,320
as states inside of a search problem.
3689
03:07:47,320 --> 03:07:49,760
So if we want to prove that some query is true,
3690
03:07:49,760 --> 03:07:52,160
prove that some logical theorem is true,
3691
03:07:52,160 --> 03:07:55,860
we can treat theorem proving as a form of a search problem.
3692
03:07:55,860 --> 03:07:59,240
I can say that we begin in some initial state, where
3693
03:07:59,240 --> 03:08:02,040
that initial state is the knowledge base that I begin with,
3694
03:08:02,040 --> 03:08:05,600
the set of all of the sentences that I know to be true.
3695
03:08:05,600 --> 03:08:07,280
What actions are available to me?
3696
03:08:07,280 --> 03:08:09,520
Well, the actions are any of the inference rules
3697
03:08:09,520 --> 03:08:12,080
that I can apply at any given time.
3698
03:08:12,080 --> 03:08:16,440
The transition model just tells me after I apply the inference rule,
3699
03:08:16,440 --> 03:08:18,360
here is the new set of all of the knowledge
3700
03:08:18,360 --> 03:08:20,560
that I have, which will be the old set of knowledge,
3701
03:08:20,560 --> 03:08:23,540
plus some additional inference that I've been able to draw,
3702
03:08:23,540 --> 03:08:26,600
much as in the same way we saw what we got when we applied those inference
3703
03:08:26,600 --> 03:08:28,720
rules and got some sort of conclusion.
3704
03:08:28,720 --> 03:08:31,600
That conclusion gets added to our knowledge base,
3705
03:08:31,600 --> 03:08:34,240
and our transition model will encode that.
3706
03:08:34,240 --> 03:08:35,440
What is the goal test?
3707
03:08:35,440 --> 03:08:38,160
Well, our goal test is checking to see if we
3708
03:08:38,160 --> 03:08:40,480
have proved the statement we're trying to prove,
3709
03:08:40,480 --> 03:08:44,880
if the thing we're trying to prove is inside of our knowledge base.
3710
03:08:44,880 --> 03:08:47,920
And the path cost function, the thing we're trying to minimize,
3711
03:08:47,920 --> 03:08:50,960
is maybe the number of inference rules that we needed to use,
3712
03:08:50,960 --> 03:08:54,840
the number of steps, so to speak, inside of our proof.
3713
03:08:54,840 --> 03:08:57,760
And so here we've been able to apply the same types of ideas
3714
03:08:57,760 --> 03:08:59,840
that we saw last time with search problems
3715
03:08:59,840 --> 03:09:02,560
to something like trying to prove something about knowledge
3716
03:09:02,560 --> 03:09:05,640
by taking our knowledge and framing it in terms
3717
03:09:05,640 --> 03:09:08,560
that we can understand as a search problem with an initial state,
3718
03:09:08,560 --> 03:09:10,920
with actions, with a transition model.
3719
03:09:10,920 --> 03:09:14,680
So this shows a couple of things, one being how versatile search problems
3720
03:09:14,680 --> 03:09:16,960
are, that they can be the same types of algorithms
3721
03:09:16,960 --> 03:09:19,280
that we use to solve a maze or figure out
3722
03:09:19,280 --> 03:09:22,360
how to get from point A to point B inside of driving directions,
3723
03:09:22,360 --> 03:09:25,480
for example, can also be used as a theorem proving
3724
03:09:25,480 --> 03:09:28,320
method of taking some sort of starting knowledge base
3725
03:09:28,320 --> 03:09:31,920
and trying to prove something about that knowledge.
3726
03:09:31,920 --> 03:09:35,120
So this, yet again, is a second way, in addition to model checking,
3727
03:09:35,120 --> 03:09:38,720
to try and prove that certain statements are true.
3728
03:09:38,720 --> 03:09:42,160
But it turns out there's yet another way that we can try and apply inference.
3729
03:09:42,160 --> 03:09:45,120
And we'll talk about this now, which is not the only way, but certainly one
3730
03:09:45,120 --> 03:09:48,560
of the most common, which is known as resolution.
3731
03:09:48,560 --> 03:09:51,880
And resolution is based on another inference rule
3732
03:09:51,880 --> 03:09:54,700
that we'll take a look at now, quite a powerful inference rule that
3733
03:09:54,700 --> 03:09:58,800
will let us prove anything that can be proven about a knowledge base.
3734
03:09:58,800 --> 03:10:01,440
And it's based on this basic idea.
3735
03:10:01,440 --> 03:10:05,360
Let's say I know that either Ron is in the Great Hall
3736
03:10:05,360 --> 03:10:08,040
or Hermione is in the library.
3737
03:10:08,040 --> 03:10:12,480
And let's say I also know that Ron is not in the Great Hall.
3738
03:10:12,480 --> 03:10:16,160
Based on those two pieces of information, what can I conclude?
3739
03:10:16,160 --> 03:10:18,640
Well, I could pretty reasonably conclude that Hermione
3740
03:10:18,640 --> 03:10:20,160
must be in the library.
3741
03:10:20,160 --> 03:10:21,160
How do I know that?
3742
03:10:21,160 --> 03:10:24,440
Well, it's because these two statements, these two
3743
03:10:24,440 --> 03:10:28,640
what we'll call complementary literals, literals that complement each other,
3744
03:10:28,640 --> 03:10:32,600
they're opposites of each other, seem to conflict with each other.
3745
03:10:32,600 --> 03:10:35,480
This sentence tells us that either Ron is in the Great Hall
3746
03:10:35,480 --> 03:10:37,680
or Hermione is in the library.
3747
03:10:37,680 --> 03:10:40,120
So if we know that Ron is not in the Great Hall,
3748
03:10:40,120 --> 03:10:45,720
that conflicts with this one, which means Hermione must be in the library.
3749
03:10:45,720 --> 03:10:48,640
And this we can frame as a more general rule
3750
03:10:48,640 --> 03:10:54,320
known as the unit resolution rule, a rule that says that if we have p or q
3751
03:10:54,320 --> 03:11:00,400
and we also know not p, well then from that we can reasonably conclude q.
3752
03:11:00,400 --> 03:11:03,880
That if p or q are true and we know that p is not true,
3753
03:11:03,880 --> 03:11:07,880
the only possibility is for q to then be true.
3754
03:11:07,880 --> 03:11:10,360
And this, it turns out, is quite a powerful inference rule
3755
03:11:10,360 --> 03:11:13,160
in terms of what it can do, in part because we can quickly
3756
03:11:13,160 --> 03:11:14,960
start to generalize this rule.
3757
03:11:14,960 --> 03:11:19,040
This q right here doesn't need to just be a single propositional symbol.
3758
03:11:19,040 --> 03:11:22,400
It could be multiple, all chained together in a single clause,
3759
03:11:22,400 --> 03:11:23,400
as we'll call it.
3760
03:11:23,400 --> 03:11:29,640
So if I had something like p or q1 or q2 or q3, so on and so forth, up until qn,
3761
03:11:29,640 --> 03:11:34,320
so I had n different other variables, and I have not p,
3762
03:11:34,320 --> 03:11:37,400
well then what happens when these two complement each other
3763
03:11:37,400 --> 03:11:40,720
is that these two clauses resolve, so to speak,
3764
03:11:40,720 --> 03:11:46,280
to produce a new clause that is just q1 or q2 all the way up to qn.
3765
03:11:46,280 --> 03:11:49,600
And in an or, the order of the arguments in the or doesn't actually matter.
3766
03:11:49,600 --> 03:11:50,960
The p doesn't need to be the first thing.
3767
03:11:50,960 --> 03:11:52,240
It could have been in the middle.
3768
03:11:52,240 --> 03:11:56,160
But the idea here is that if I have p in one clause and not
3769
03:11:56,160 --> 03:11:59,920
p in the other clause, well then I know that one of these remaining things
3770
03:11:59,920 --> 03:12:00,800
must be true.
3771
03:12:00,800 --> 03:12:04,640
I've resolved them in order to produce a new clause.
3772
03:12:04,640 --> 03:12:08,520
But it turns out we can generalize this idea even further, in fact,
3773
03:12:08,520 --> 03:12:12,640
and display even more power that we can have with this resolution rule.
3774
03:12:12,640 --> 03:12:14,520
So let's take another example.
3775
03:12:14,520 --> 03:12:17,240
Let's say, for instance, that I know the same piece of information
3776
03:12:17,240 --> 03:12:21,400
that either Ron is in the Great Hall or Hermione is in the library.
3777
03:12:21,400 --> 03:12:23,680
And the second piece of information I know
3778
03:12:23,680 --> 03:12:29,360
is that Ron is not in the Great Hall or Harry is sleeping.
3779
03:12:29,360 --> 03:12:31,520
So it's not just a single piece of information.
3780
03:12:31,520 --> 03:12:33,800
I have two different clauses.
3781
03:12:33,800 --> 03:12:37,360
And we'll define clauses more precisely in just a moment.
3782
03:12:37,360 --> 03:12:38,600
What do I know here?
3783
03:12:38,600 --> 03:12:42,360
Well again, for any propositional symbol like Ron is in the Great Hall,
3784
03:12:42,360 --> 03:12:44,320
there are only two possibilities.
3785
03:12:44,320 --> 03:12:48,520
Either Ron is in the Great Hall, in which case, based on resolution,
3786
03:12:48,520 --> 03:12:53,840
we know that Harry must be sleeping, or Ron is not in the Great Hall,
3787
03:12:53,840 --> 03:12:56,160
in which case we know based on the same rule
3788
03:12:56,160 --> 03:12:59,320
that Hermione must be in the library.
3789
03:12:59,320 --> 03:13:01,320
Based on those two things in combination,
3790
03:13:01,320 --> 03:13:03,920
I can say based on these two premises that I
3791
03:13:03,920 --> 03:13:10,400
can conclude that either Hermione is in the library or Harry is sleeping.
3792
03:13:10,400 --> 03:13:13,200
So again, because these two conflict with each other,
3793
03:13:13,200 --> 03:13:15,600
I know that one of these two must be true.
3794
03:13:15,600 --> 03:13:18,560
And you can take a closer look and try and reason through that logic.
3795
03:13:18,560 --> 03:13:22,400
Make sure you convince yourself that you believe this conclusion.
3796
03:13:22,400 --> 03:13:25,320
Stated more generally, we can name this resolution rule
3797
03:13:25,320 --> 03:13:28,680
by saying that if we know p or q is true,
3798
03:13:28,680 --> 03:13:33,040
and we also know that not p or r is true,
3799
03:13:33,040 --> 03:13:37,760
we resolve these two clauses together to get a new clause, q or r,
3800
03:13:37,760 --> 03:13:41,320
that either q or r must be true.
3801
03:13:41,320 --> 03:13:43,920
And again, much as in the last case, q and r
3802
03:13:43,920 --> 03:13:46,720
don't need to just be single propositional symbols.
3803
03:13:46,720 --> 03:13:48,160
It could be multiple symbols.
3804
03:13:48,160 --> 03:13:52,720
So if I had a rule that had p or q1 or q2 or q3, so on and so forth,
3805
03:13:52,720 --> 03:13:55,680
up until qn, where n is just some number.
3806
03:13:55,680 --> 03:14:02,440
And likewise, I had not p or r1 or r2, so on and so forth, up until rm,
3807
03:14:02,440 --> 03:14:05,340
where m, again, is just some other number.
3808
03:14:05,340 --> 03:14:09,680
I can resolve these two clauses together to get one of these must be true,
3809
03:14:09,680 --> 03:14:14,680
q1 or q2 up until qn or r1 or r2 up until rm.
3810
03:14:14,680 --> 03:14:19,520
And this is just a generalization of that same rule we saw before.
3811
03:14:19,520 --> 03:14:23,160
Each of these things here are what we're going to call a clause,
3812
03:14:23,160 --> 03:14:27,760
where a clause is formally defined as a disjunction of literals,
3813
03:14:27,760 --> 03:14:31,720
where a disjunction means it's a bunch of things that are connected with or.
3814
03:14:31,720 --> 03:14:34,120
Disjunction means things connected with or.
3815
03:14:34,120 --> 03:14:37,400
Conjunction, meanwhile, is things connected with and.
3816
03:14:37,400 --> 03:14:40,360
And a literal is either a propositional symbol
3817
03:14:40,360 --> 03:14:42,320
or the opposite of a propositional symbol.
3818
03:14:42,320 --> 03:14:46,160
So it's something like p or q or not p or not q.
3819
03:14:46,160 --> 03:14:50,360
Those are all propositional symbols or not of the propositional symbols.
3820
03:14:50,360 --> 03:14:52,920
And we call those literals.
3821
03:14:52,920 --> 03:14:57,920
And so a clause is just something like this, p or q or r, for example.
3822
03:14:57,920 --> 03:15:00,440
Meanwhile, what this gives us an ability to do
3823
03:15:00,440 --> 03:15:04,520
is it gives us an ability to turn logic, any logical sentence,
3824
03:15:04,520 --> 03:15:07,960
into something called conjunctive normal form.
3825
03:15:07,960 --> 03:15:11,480
A conjunctive normal form sentence is a logical sentence
3826
03:15:11,480 --> 03:15:14,240
that is a conjunction of clauses.
3827
03:15:14,240 --> 03:15:18,760
Recall, again, conjunction means things are connected to one another using and.
3828
03:15:18,760 --> 03:15:23,840
And so a conjunction of clauses means it is an and of individual clauses,
3829
03:15:23,840 --> 03:15:25,440
each of which has ors in it.
3830
03:15:25,440 --> 03:15:32,240
So something like this, a or b or c, and d or not e, and f or g.
3831
03:15:32,240 --> 03:15:35,440
Everything in parentheses is one clause.
3832
03:15:35,440 --> 03:15:38,960
All of the clauses are connected to each other using an and.
3833
03:15:38,960 --> 03:15:43,080
And everything in the clause is separated using an or.
3834
03:15:43,080 --> 03:15:46,680
And this is just a standard form that we can translate a logical sentence
3835
03:15:46,680 --> 03:15:50,440
into that just makes it easy to work with and easy to manipulate.
3836
03:15:50,440 --> 03:15:53,360
And it turns out that we can take any sentence in logic
3837
03:15:53,360 --> 03:15:56,400
and turn it into conjunctive normal form just
3838
03:15:56,400 --> 03:15:59,960
by applying some inference rules and transformations to it.
3839
03:15:59,960 --> 03:16:03,080
So we'll take a look at how we can actually do that.
3840
03:16:03,080 --> 03:16:06,000
So what is the process for taking a logical formula
3841
03:16:06,000 --> 03:16:10,480
and converting it into conjunctive normal form, otherwise known as c and f?
3842
03:16:10,480 --> 03:16:12,520
Well, the process looks a little something like this.
3843
03:16:12,520 --> 03:16:14,840
We need to take all of the symbols that are not
3844
03:16:14,840 --> 03:16:16,200
part of conjunctive normal form.
3845
03:16:16,200 --> 03:16:18,920
The bi-conditionals and the implications and so forth,
3846
03:16:18,920 --> 03:16:23,320
and turn them into something that is more closely like conjunctive normal
3847
03:16:23,320 --> 03:16:24,160
form.
3848
03:16:24,160 --> 03:16:26,760
So the first step will be to eliminate bi-conditionals,
3849
03:16:26,760 --> 03:16:29,160
those if and only if double arrows.
3850
03:16:29,160 --> 03:16:31,120
And we know how to eliminate bi-conditionals
3851
03:16:31,120 --> 03:16:34,200
because we saw there was an inference rule to do just that.
3852
03:16:34,200 --> 03:16:38,400
Any time I have an expression like alpha if and only if beta,
3853
03:16:38,400 --> 03:16:43,400
I can turn that into alpha implies beta and beta implies alpha
3854
03:16:43,400 --> 03:16:46,480
based on that inference rule we saw before.
3855
03:16:46,480 --> 03:16:48,880
Likewise, in addition to eliminating bi-conditionals,
3856
03:16:48,880 --> 03:16:52,680
I can eliminate implications as well, the if then arrows.
3857
03:16:52,680 --> 03:16:56,120
And I can do that using the same inference rule we saw before too,
3858
03:16:56,120 --> 03:17:01,480
taking alpha implies beta and turning that into not alpha or beta
3859
03:17:01,480 --> 03:17:06,440
because that is logically equivalent to this first thing here.
3860
03:17:06,440 --> 03:17:08,760
Then we can move knots inwards because we don't
3861
03:17:08,760 --> 03:17:10,800
want knots on the outsides of our expressions.
3862
03:17:10,800 --> 03:17:14,280
Conjunctive normal form requires that it's just claws and claws
3863
03:17:14,280 --> 03:17:15,800
and claws and claws.
3864
03:17:15,800 --> 03:17:19,560
Any knots need to be immediately next to propositional symbols.
3865
03:17:19,560 --> 03:17:22,520
But we can move those knots around using De Morgan's laws
3866
03:17:22,520 --> 03:17:29,000
by taking something like not A and B and turn it into not A or not B,
3867
03:17:29,000 --> 03:17:31,800
for example, using De Morgan's laws to manipulate that.
3868
03:17:31,800 --> 03:17:34,600
And after that, all we'll be left with are ands and ors.
3869
03:17:34,600 --> 03:17:35,920
And those are easy to deal with.
3870
03:17:35,920 --> 03:17:39,160
We can use the distributive law to distribute the ors
3871
03:17:39,160 --> 03:17:42,760
so that the ors end up on the inside of the expression, so to speak,
3872
03:17:42,760 --> 03:17:45,320
and the ands end up on the outside.
3873
03:17:45,320 --> 03:17:47,900
So this is the general pattern for how we'll take a formula
3874
03:17:47,900 --> 03:17:50,160
and convert it into conjunctive normal form.
3875
03:17:50,160 --> 03:17:53,400
And let's now take a look at an example of how we would do this
3876
03:17:53,400 --> 03:17:57,520
and explore then why it is that we would want to do something like this.
3877
03:17:57,520 --> 03:17:58,600
Here's how we can do it.
3878
03:17:58,600 --> 03:18:00,600
Let's take this formula, for example.
3879
03:18:00,600 --> 03:18:06,160
P or Q implies R. And I'd like to convert this into conjunctive normal form,
3880
03:18:06,160 --> 03:18:10,800
where it's all ands of clauses, and every clause is a disjunctive clause.
3881
03:18:10,800 --> 03:18:12,400
It's ors together.
3882
03:18:12,400 --> 03:18:14,120
So what's the first thing I need to do?
3883
03:18:14,120 --> 03:18:15,840
Well, this is an implication.
3884
03:18:15,840 --> 03:18:18,160
So let me go ahead and remove that implication.
3885
03:18:18,160 --> 03:18:25,220
Using the implication inference rule, I can turn P or Q into P or Q implies R
3886
03:18:25,220 --> 03:18:29,880
into not P or Q or R. So that's the first step.
3887
03:18:29,880 --> 03:18:32,100
I've gotten rid of the implication.
3888
03:18:32,100 --> 03:18:36,080
And next, I can get rid of the not on the outside of this expression, too.
3889
03:18:36,080 --> 03:18:41,560
I can move the nots inwards so they're closer to the literals themselves
3890
03:18:41,560 --> 03:18:43,080
by using De Morgan's laws.
3891
03:18:43,080 --> 03:18:50,480
And De Morgan's law says that not P or Q is equivalent to not P and not Q.
3892
03:18:50,480 --> 03:18:52,920
Again, here, just applying the inference rules
3893
03:18:52,920 --> 03:18:57,120
that we've already seen in order to translate these statements.
3894
03:18:57,120 --> 03:19:00,920
And now, I have two things that are separated by an or,
3895
03:19:00,920 --> 03:19:03,080
where this thing on the inside is an and.
3896
03:19:03,080 --> 03:19:06,560
What I'd really like to move the ors so the ors are on the inside,
3897
03:19:06,560 --> 03:19:10,040
because conjunctive normal form means I need clause and clause
3898
03:19:10,040 --> 03:19:11,680
and clause and clause.
3899
03:19:11,680 --> 03:19:14,260
And so to do that, I can use the distributive law.
3900
03:19:14,260 --> 03:19:21,080
If I have not P and not Q or R, I can distribute the or R to both of these
3901
03:19:21,080 --> 03:19:26,800
to get not P or R and not Q or R using the distributive law.
3902
03:19:26,800 --> 03:19:30,520
And this now here at the bottom is in conjunctive normal form.
3903
03:19:30,520 --> 03:19:35,840
It is a conjunction and and of disjunctions of clauses
3904
03:19:35,840 --> 03:19:38,200
that just are separated by ors.
3905
03:19:38,200 --> 03:19:42,120
So this process can be used by any formula to take a logical sentence
3906
03:19:42,120 --> 03:19:44,920
and turn it into this conjunctive normal form, where
3907
03:19:44,920 --> 03:19:49,800
I have clause and clause and clause and clause and clause and so on.
3908
03:19:49,800 --> 03:19:50,800
So why is this helpful?
3909
03:19:50,800 --> 03:19:52,960
Why do we even care about taking all these sentences
3910
03:19:52,960 --> 03:19:54,640
and converting them into this form?
3911
03:19:54,640 --> 03:19:58,560
It's because once they're in this form where we have these clauses,
3912
03:19:58,560 --> 03:20:02,360
these clauses are the inputs to the resolution inference rule
3913
03:20:02,360 --> 03:20:05,640
that we saw a moment ago, that if I have two clauses where there's
3914
03:20:05,640 --> 03:20:08,040
something that conflicts or something complementary
3915
03:20:08,040 --> 03:20:10,680
between those two clauses, I can resolve them
3916
03:20:10,680 --> 03:20:13,160
to get a new clause, to draw a new conclusion.
3917
03:20:13,160 --> 03:20:16,220
And we call this process inference by resolution,
3918
03:20:16,220 --> 03:20:19,640
using the resolution rule to draw some sort of inference.
3919
03:20:19,640 --> 03:20:23,720
And it's based on the same idea, that if I have P or Q, this clause,
3920
03:20:23,720 --> 03:20:28,380
and I have not P or R, that I can resolve these two clauses together
3921
03:20:28,380 --> 03:20:32,960
to get Q or R as the resulting clause, a new piece of information
3922
03:20:32,960 --> 03:20:35,000
that I didn't have before.
3923
03:20:35,000 --> 03:20:37,500
Now, a couple of key points that are worth noting about this
3924
03:20:37,500 --> 03:20:39,720
before we talk about the actual algorithm.
3925
03:20:39,720 --> 03:20:43,560
One thing is that, let's imagine we have P or Q or S,
3926
03:20:43,560 --> 03:20:48,200
and I also have not P or R or S. The resolution rule
3927
03:20:48,200 --> 03:20:51,680
says that because this P conflicts with this not P,
3928
03:20:51,680 --> 03:20:57,000
we would resolve to put everything else together to get Q or S or R or S.
3929
03:20:57,000 --> 03:21:01,480
But it turns out that this double S is redundant, or S here and or S there.
3930
03:21:01,480 --> 03:21:03,680
It doesn't change the meaning of the sentence.
3931
03:21:03,680 --> 03:21:06,240
So in resolution, when we do this resolution process,
3932
03:21:06,240 --> 03:21:08,880
we'll usually also do a process known as factoring,
3933
03:21:08,880 --> 03:21:11,360
where we take any duplicate variables that show up
3934
03:21:11,360 --> 03:21:12,480
and just eliminate them.
3935
03:21:12,480 --> 03:21:18,880
So Q or S or R or S just becomes Q or R or S. The S only needs to appear once,
3936
03:21:18,880 --> 03:21:22,000
no need to include it multiple times.
3937
03:21:22,000 --> 03:21:24,120
Now, one final question worth considering
3938
03:21:24,120 --> 03:21:28,960
is what happens if I try to resolve P and not P together?
3939
03:21:28,960 --> 03:21:32,440
If I know that P is true and I know that not P is true,
3940
03:21:32,440 --> 03:21:35,240
well, resolution says I can merge these clauses together
3941
03:21:35,240 --> 03:21:37,160
and look at everything else.
3942
03:21:37,160 --> 03:21:39,320
Well, in this case, there is nothing else,
3943
03:21:39,320 --> 03:21:42,280
so I'm left with what we might call the empty clause.
3944
03:21:42,280 --> 03:21:43,840
I'm left with nothing.
3945
03:21:43,840 --> 03:21:46,920
And the empty clause is always false.
3946
03:21:46,920 --> 03:21:49,920
The empty clause is equivalent to just being false.
3947
03:21:49,920 --> 03:21:55,720
And that's pretty reasonable because it's impossible for both P and not P
3948
03:21:55,720 --> 03:21:57,400
to both hold at the same time.
3949
03:21:57,400 --> 03:21:59,800
P is either true or it's not true, which
3950
03:21:59,800 --> 03:22:02,960
means that if P is true, then this must be false.
3951
03:22:02,960 --> 03:22:05,000
And if this is true, then this must be false.
3952
03:22:05,000 --> 03:22:07,880
There is no way for both of these to hold at the same time.
3953
03:22:07,880 --> 03:22:11,320
So if ever I try and resolve these two, it's a contradiction,
3954
03:22:11,320 --> 03:22:14,600
and I'll end up getting this empty clause where the empty clause I
3955
03:22:14,600 --> 03:22:17,440
can call equivalent to false.
3956
03:22:17,440 --> 03:22:21,400
And this idea that if I resolve these two contradictory terms,
3957
03:22:21,400 --> 03:22:25,280
I get the empty clause, this is the basis for our inference
3958
03:22:25,280 --> 03:22:26,880
by resolution algorithm.
3959
03:22:26,880 --> 03:22:29,480
Here's how we're going to perform inference by resolution
3960
03:22:29,480 --> 03:22:31,040
at a very high level.
3961
03:22:31,040 --> 03:22:35,760
We want to prove that our knowledge base entails some query alpha,
3962
03:22:35,760 --> 03:22:39,040
that based on the knowledge we have, we can prove conclusively
3963
03:22:39,040 --> 03:22:41,600
that alpha is going to be true.
3964
03:22:41,600 --> 03:22:43,200
How are we going to do that?
3965
03:22:43,200 --> 03:22:45,160
Well, in order to do that, we're going to try
3966
03:22:45,160 --> 03:22:49,440
to prove that if we know the knowledge and not alpha,
3967
03:22:49,440 --> 03:22:51,560
that that would be a contradiction.
3968
03:22:51,560 --> 03:22:53,560
And this is a common technique in computer science
3969
03:22:53,560 --> 03:22:57,440
more generally, this idea of proving something by contradiction.
3970
03:22:57,440 --> 03:23:00,200
If I want to prove that something is true,
3971
03:23:00,200 --> 03:23:04,000
I can do so by first assuming that it is false
3972
03:23:04,000 --> 03:23:06,160
and showing that it would be contradictory,
3973
03:23:06,160 --> 03:23:08,360
showing that it leads to some contradiction.
3974
03:23:08,360 --> 03:23:11,800
And if the thing I'm trying to prove, if when I assume it's false,
3975
03:23:11,800 --> 03:23:14,760
leads to a contradiction, then it must be true.
3976
03:23:14,760 --> 03:23:18,560
And that's the logical approach or the idea behind a proof by contradiction.
3977
03:23:18,560 --> 03:23:20,160
And that's what we're going to do here.
3978
03:23:20,160 --> 03:23:23,400
We want to prove that this query alpha is true.
3979
03:23:23,400 --> 03:23:26,040
So we're going to assume that it's not true.
3980
03:23:26,040 --> 03:23:28,120
We're going to assume not alpha.
3981
03:23:28,120 --> 03:23:30,680
And we're going to try and prove that it's a contradiction.
3982
03:23:30,680 --> 03:23:32,960
If we do get a contradiction, well, then we
3983
03:23:32,960 --> 03:23:36,440
know that our knowledge entails the query alpha.
3984
03:23:36,440 --> 03:23:39,040
If we don't get a contradiction, there is no entailment.
3985
03:23:39,040 --> 03:23:41,400
This is this idea of a proof by contradiction
3986
03:23:41,400 --> 03:23:44,000
of assuming the opposite of what you're trying to prove.
3987
03:23:44,000 --> 03:23:46,520
And if you can demonstrate that that's a contradiction,
3988
03:23:46,520 --> 03:23:49,840
then what you're proving must be true.
3989
03:23:49,840 --> 03:23:51,760
But more formally, how do we actually do this?
3990
03:23:51,760 --> 03:23:56,160
How do we check that knowledge base and not alpha
3991
03:23:56,160 --> 03:23:58,000
is going to lead to a contradiction?
3992
03:23:58,000 --> 03:24:01,320
Well, here is where resolution comes into play.
3993
03:24:01,320 --> 03:24:05,160
To determine if our knowledge base entails some query alpha,
3994
03:24:05,160 --> 03:24:08,400
we're going to convert knowledge base and not alpha
3995
03:24:08,400 --> 03:24:10,520
to conjunctive normal form, that form where
3996
03:24:10,520 --> 03:24:14,400
we have a whole bunch of clauses that are all anded together.
3997
03:24:14,400 --> 03:24:16,680
And when we have these individual clauses,
3998
03:24:16,680 --> 03:24:21,600
now we can keep checking to see if we can use resolution
3999
03:24:21,600 --> 03:24:23,640
to produce a new clause.
4000
03:24:23,640 --> 03:24:26,720
We can take any pair of clauses and check,
4001
03:24:26,720 --> 03:24:29,920
is there some literal that is the opposite of each other
4002
03:24:29,920 --> 03:24:32,240
or complementary to each other in both of them?
4003
03:24:32,240 --> 03:24:35,880
For example, I have a p in one clause and a not p in another clause.
4004
03:24:35,880 --> 03:24:39,480
Or an r in one clause and a not r in another clause.
4005
03:24:39,480 --> 03:24:41,640
If ever I have that situation where once I
4006
03:24:41,640 --> 03:24:44,920
convert to conjunctive normal form and I have a whole bunch of clauses,
4007
03:24:44,920 --> 03:24:49,720
I see two clauses that I can resolve to produce a new clause, then I'll do so.
4008
03:24:49,720 --> 03:24:50,960
This process occurs in a loop.
4009
03:24:50,960 --> 03:24:53,960
I'm going to keep checking to see if I can use resolution
4010
03:24:53,960 --> 03:24:56,760
to produce a new clause and keep using those new clauses
4011
03:24:56,760 --> 03:25:00,520
to try to generate more new clauses after that.
4012
03:25:00,520 --> 03:25:03,000
Now, it just so may happen that eventually we
4013
03:25:03,000 --> 03:25:06,880
may produce the empty clause, the clause we were talking about before.
4014
03:25:06,880 --> 03:25:11,720
If I resolve p and not p together, that produces the empty clause
4015
03:25:11,720 --> 03:25:14,620
and the empty clause we know to be false.
4016
03:25:14,620 --> 03:25:18,280
Because we know that there's no way for both p and not p
4017
03:25:18,280 --> 03:25:21,200
to both simultaneously be true.
4018
03:25:21,200 --> 03:25:25,120
So if ever we produce the empty clause, then we have a contradiction.
4019
03:25:25,120 --> 03:25:27,720
And if we have a contradiction, that's exactly what we were trying
4020
03:25:27,720 --> 03:25:29,720
to do in a fruit by contradiction.
4021
03:25:29,720 --> 03:25:32,360
If we have a contradiction, then we know that our knowledge base
4022
03:25:32,360 --> 03:25:34,400
must entail this query alpha.
4023
03:25:34,400 --> 03:25:37,600
And we know that alpha must be true.
4024
03:25:37,600 --> 03:25:39,920
And it turns out, and we won't go into the proof here,
4025
03:25:39,920 --> 03:25:43,760
but you can show that otherwise, if you don't produce the empty clause,
4026
03:25:43,760 --> 03:25:45,400
then there is no entailment.
4027
03:25:45,400 --> 03:25:48,680
If we run into a situation where there are no more new clauses to add,
4028
03:25:48,680 --> 03:25:50,960
we've done all the resolution that we can do,
4029
03:25:50,960 --> 03:25:53,400
and yet we still haven't produced the empty clause,
4030
03:25:53,400 --> 03:25:56,480
then there is no entailment in this case.
4031
03:25:56,480 --> 03:25:58,720
And this now is the resolution algorithm.
4032
03:25:58,720 --> 03:26:01,240
And it's very abstract looking, especially this idea of like,
4033
03:26:01,240 --> 03:26:03,560
what does it even mean to have the empty clause?
4034
03:26:03,560 --> 03:26:05,440
So let's take a look at an example, actually
4035
03:26:05,440 --> 03:26:11,320
try and prove some entailment by using this inference by resolution process.
4036
03:26:11,320 --> 03:26:12,680
So here's our question.
4037
03:26:12,680 --> 03:26:14,200
We have this knowledge base.
4038
03:26:14,200 --> 03:26:21,200
Here is the knowledge that we know, A or B, and not B or C, and not C.
4039
03:26:21,200 --> 03:26:25,840
And we want to know if all of this entails A.
4040
03:26:25,840 --> 03:26:28,600
So this is our knowledge base here, this whole log thing.
4041
03:26:28,600 --> 03:26:33,160
And our query alpha is just this propositional symbol, A.
4042
03:26:33,160 --> 03:26:34,240
So what do we do?
4043
03:26:34,240 --> 03:26:36,480
Well, first, we want to prove by contradiction.
4044
03:26:36,480 --> 03:26:39,600
So we want to first assume that A is false,
4045
03:26:39,600 --> 03:26:42,200
and see if that leads to some sort of contradiction.
4046
03:26:42,200 --> 03:26:46,880
So here is what we're going to start with, A or B, and not B or C, and not C.
4047
03:26:46,880 --> 03:26:48,680
This is our knowledge base.
4048
03:26:48,680 --> 03:26:51,280
And we're going to assume not A. We're going
4049
03:26:51,280 --> 03:26:56,760
to assume that the thing we're trying to prove is, in fact, false.
4050
03:26:56,760 --> 03:26:59,520
And so this is now in conjunctive normal form,
4051
03:26:59,520 --> 03:27:01,400
and I have four different clauses.
4052
03:27:01,400 --> 03:27:08,880
I have A or B. I have not B or C. I have not C, and I have not A.
4053
03:27:08,880 --> 03:27:12,800
And now, I can begin to just pick two clauses that I can resolve,
4054
03:27:12,800 --> 03:27:15,880
and apply the resolution rule to them.
4055
03:27:15,880 --> 03:27:20,320
And so looking at these four clauses, I see, all right, these two clauses
4056
03:27:20,320 --> 03:27:21,440
are ones I can resolve.
4057
03:27:21,440 --> 03:27:25,160
I can resolve them because there are complementary literals
4058
03:27:25,160 --> 03:27:26,040
that show up in them.
4059
03:27:26,040 --> 03:27:28,600
There's a C here, and a not C here.
4060
03:27:28,600 --> 03:27:34,240
So just looking at these two clauses, if I know that not B or C is true,
4061
03:27:34,240 --> 03:27:36,960
and I know that C is not true, well, then I
4062
03:27:36,960 --> 03:27:41,280
can resolve these two clauses to say, all right, not B, that must be true.
4063
03:27:41,280 --> 03:27:45,040
I can generate this new clause as a new piece of information
4064
03:27:45,040 --> 03:27:47,800
that I now know to be true.
4065
03:27:47,800 --> 03:27:50,800
And all right, now I can repeat this process, do the process again.
4066
03:27:50,800 --> 03:27:54,160
Can I use resolution again to get some new conclusion?
4067
03:27:54,160 --> 03:27:55,160
Well, it turns out I can.
4068
03:27:55,160 --> 03:27:58,720
I can use that new clause I just generated, along with this one here.
4069
03:27:58,720 --> 03:28:00,600
There are complementary literals.
4070
03:28:00,600 --> 03:28:06,280
This B is complementary to, or conflicts with, this not B over here.
4071
03:28:06,280 --> 03:28:12,320
And so if I know that A or B is true, and I know that B is not true,
4072
03:28:12,320 --> 03:28:15,560
well, then the only remaining possibility is that A must be true.
4073
03:28:15,560 --> 03:28:19,640
So now we have A. That is a new clause that I've been able to generate.
4074
03:28:19,640 --> 03:28:21,240
And now, I can do this one more time.
4075
03:28:21,240 --> 03:28:23,360
I'm looking for two clauses that can be resolved,
4076
03:28:23,360 --> 03:28:25,480
and you might programmatically do this by just looping
4077
03:28:25,480 --> 03:28:28,320
over all possible pairs of clauses and checking
4078
03:28:28,320 --> 03:28:30,240
for complementary literals in each.
4079
03:28:30,240 --> 03:28:34,560
And here, I can say, all right, I found two clauses, not A and A,
4080
03:28:34,560 --> 03:28:36,360
that conflict with each other.
4081
03:28:36,360 --> 03:28:38,600
And when I resolve these two together, well,
4082
03:28:38,600 --> 03:28:42,040
this is the same as when we were resolving P and not P from before.
4083
03:28:42,040 --> 03:28:45,760
When I resolve these two clauses together, I get rid of the As,
4084
03:28:45,760 --> 03:28:48,240
and I'm left with the empty clause.
4085
03:28:48,240 --> 03:28:51,920
And the empty clause we know to be false, which means we have a contradiction,
4086
03:28:51,920 --> 03:28:56,320
which means we can safely say that this whole knowledge base does entail A.
4087
03:28:56,320 --> 03:29:02,080
That if this sentence is true, that we know that A for sure is also true.
4088
03:29:02,080 --> 03:29:04,720
So this now, using inference by resolution,
4089
03:29:04,720 --> 03:29:07,740
is an entirely different way to take some statement
4090
03:29:07,740 --> 03:29:10,240
and try and prove that it is, in fact, true.
4091
03:29:10,240 --> 03:29:12,560
Instead of enumerating all of the possible worlds
4092
03:29:12,560 --> 03:29:15,840
that we might be in in order to try to figure out in which cases
4093
03:29:15,840 --> 03:29:18,760
is the knowledge base true and in which cases are query true,
4094
03:29:18,760 --> 03:29:22,000
instead we use this resolution algorithm to say,
4095
03:29:22,000 --> 03:29:25,080
let's keep trying to figure out what conclusions we can draw
4096
03:29:25,080 --> 03:29:27,240
and see if we reach a contradiction.
4097
03:29:27,240 --> 03:29:28,920
And if we reach a contradiction, then that
4098
03:29:28,920 --> 03:29:31,840
tells us something about whether our knowledge actually
4099
03:29:31,840 --> 03:29:33,540
entails the query or not.
4100
03:29:33,540 --> 03:29:35,840
And it turns out there are many different algorithms that
4101
03:29:35,840 --> 03:29:37,520
can be used for inference.
4102
03:29:37,520 --> 03:29:39,840
What we've just looked at here are just a couple of them.
4103
03:29:39,840 --> 03:29:44,080
And in fact, all of this is just based on one particular type of logic.
4104
03:29:44,080 --> 03:29:47,900
It's based on propositional logic, where we have these individual symbols
4105
03:29:47,900 --> 03:29:52,640
and we connect them using and and or and not and implies and by conditionals.
4106
03:29:52,640 --> 03:29:56,760
But propositional logic is not the only kind of logic that exists.
4107
03:29:56,760 --> 03:29:58,880
And in fact, we see that there are limitations
4108
03:29:58,880 --> 03:30:01,680
that exist in propositional logic, especially
4109
03:30:01,680 --> 03:30:06,000
as we saw in examples like with the mastermind example
4110
03:30:06,000 --> 03:30:08,560
or with the example with the logic puzzle where
4111
03:30:08,560 --> 03:30:12,260
we had different Hogwarts house people that belong to different houses
4112
03:30:12,260 --> 03:30:15,080
and we were trying to figure out who belonged to which houses.
4113
03:30:15,080 --> 03:30:18,280
There were a lot of different propositional symbols that we needed
4114
03:30:18,280 --> 03:30:21,680
in order to represent some fairly basic ideas.
4115
03:30:21,680 --> 03:30:24,640
So now is the final topic that we'll take a look at just before we end class
4116
03:30:24,640 --> 03:30:28,560
today is one final type of logic different from propositional logic
4117
03:30:28,560 --> 03:30:32,080
known as first order logic, which is a little bit more powerful than
4118
03:30:32,080 --> 03:30:34,620
propositional logic and is going to make it easier for us
4119
03:30:34,620 --> 03:30:37,240
to express certain types of ideas.
4120
03:30:37,240 --> 03:30:39,800
In propositional logic, if we think back to that puzzle
4121
03:30:39,800 --> 03:30:43,680
with the people in the Hogwarts houses, we had a whole bunch of symbols.
4122
03:30:43,680 --> 03:30:46,200
And every symbol could only be true or false.
4123
03:30:46,200 --> 03:30:49,240
We had a symbol for Minerva Gryffindor, which was either true of Minerva
4124
03:30:49,240 --> 03:30:51,840
within Gryffindor and false otherwise, and likewise
4125
03:30:51,840 --> 03:30:55,120
for Minerva Hufflepuff and Minerva Ravenclaw and Minerva Slytherin
4126
03:30:55,120 --> 03:30:56,920
and so forth.
4127
03:30:56,920 --> 03:30:58,920
But this was starting to get quite redundant.
4128
03:30:58,920 --> 03:31:01,120
We wanted some way to be able to express that there
4129
03:31:01,120 --> 03:31:03,360
is a relationship between these propositional symbols,
4130
03:31:03,360 --> 03:31:05,720
that Minerva shows up in all of them.
4131
03:31:05,720 --> 03:31:09,360
And also, I would have liked to have not have had so many different symbols
4132
03:31:09,360 --> 03:31:13,360
to represent what really was a fairly straightforward problem.
4133
03:31:13,360 --> 03:31:15,480
So first order logic will give us a different way
4134
03:31:15,480 --> 03:31:19,520
of trying to deal with this idea by giving us two different types of symbols.
4135
03:31:19,520 --> 03:31:23,040
We're going to have constant symbols that are going to represent objects
4136
03:31:23,040 --> 03:31:24,880
like people or houses.
4137
03:31:24,880 --> 03:31:29,640
And then predicate symbols, which you can think of as relations or functions
4138
03:31:29,640 --> 03:31:33,240
that take an input and evaluate them to true or false, for example,
4139
03:31:33,240 --> 03:31:37,400
that tell us whether or not some property of some constant
4140
03:31:37,400 --> 03:31:41,120
or some pair of constants or multiple constants actually holds.
4141
03:31:41,120 --> 03:31:43,120
So we'll see an example of that in just a moment.
4142
03:31:43,120 --> 03:31:46,640
For now, in this same problem, our constant symbols
4143
03:31:46,640 --> 03:31:49,240
might be objects, things like people or houses.
4144
03:31:49,240 --> 03:31:53,440
So Minerva, Pomona, Horace, Gilderoy, those are all constant symbols,
4145
03:31:53,440 --> 03:31:58,040
as are my four houses, Gryffindor, Hufflepuff, Ravenclaw, and Slytherin.
4146
03:31:58,040 --> 03:32:00,360
Predicates, meanwhile, these predicate symbols
4147
03:32:00,360 --> 03:32:03,880
are going to be properties that might hold true or false
4148
03:32:03,880 --> 03:32:06,120
of these individual constants.
4149
03:32:06,120 --> 03:32:09,480
So person might hold true of Minerva, but it
4150
03:32:09,480 --> 03:32:12,320
would be false for Gryffindor because Gryffindor is not a person.
4151
03:32:12,320 --> 03:32:15,280
And house is going to hold true for Ravenclaw,
4152
03:32:15,280 --> 03:32:17,640
but it's not going to hold true for Horace, for example,
4153
03:32:17,640 --> 03:32:19,800
because Horace is a person.
4154
03:32:19,800 --> 03:32:23,320
And belongs to, meanwhile, is going to be some relation that
4155
03:32:23,320 --> 03:32:26,280
is going to relate people to their houses.
4156
03:32:26,280 --> 03:32:30,440
And it's going to only tell me when someone belongs to a house or does not.
4157
03:32:30,440 --> 03:32:35,080
So let's take a look at some examples of what a sentence in first order logic
4158
03:32:35,080 --> 03:32:36,480
might actually look like.
4159
03:32:36,480 --> 03:32:38,320
A sentence might look like something like this.
4160
03:32:38,320 --> 03:32:42,960
Person Minerva, with Minerva in parentheses, and person being a predicate
4161
03:32:42,960 --> 03:32:45,880
symbol, Minerva being a constant symbol.
4162
03:32:45,880 --> 03:32:48,600
This sentence in first order logic effectively
4163
03:32:48,600 --> 03:32:54,440
means Minerva is a person, or the person property applies to the Minerva object.
4164
03:32:54,440 --> 03:32:56,920
So if I want to say something like Minerva is a person,
4165
03:32:56,920 --> 03:33:00,800
here is how I express that idea using first order logic.
4166
03:33:00,800 --> 03:33:03,720
Meanwhile, I can say something like, house Gryffindor,
4167
03:33:03,720 --> 03:33:07,320
to likewise express the idea that Gryffindor is a house.
4168
03:33:07,320 --> 03:33:08,800
I can do that this way.
4169
03:33:08,800 --> 03:33:10,920
And all of the same logical connectives that we
4170
03:33:10,920 --> 03:33:13,920
saw in propositional logic, those are going to work here too.
4171
03:33:13,920 --> 03:33:16,760
And or implication by conditional not.
4172
03:33:16,760 --> 03:33:20,920
In fact, I can use not to say something like, not house Minerva.
4173
03:33:20,920 --> 03:33:24,240
And this sentence in first order logic means something like,
4174
03:33:24,240 --> 03:33:26,080
Minerva is not a house.
4175
03:33:26,080 --> 03:33:31,640
It is not true that the house property applies to Minerva.
4176
03:33:31,640 --> 03:33:34,080
Meanwhile, in addition to some of these predicate symbols
4177
03:33:34,080 --> 03:33:36,880
that just take a single argument, some of our predicate symbols
4178
03:33:36,880 --> 03:33:39,840
are going to express binary relations, relations
4179
03:33:39,840 --> 03:33:42,080
between two of its arguments.
4180
03:33:42,080 --> 03:33:46,600
So I could say something like, belongs to, and then two inputs, Minerva
4181
03:33:46,600 --> 03:33:51,920
and Gryffindor, to express the idea that Minerva belongs to Gryffindor.
4182
03:33:51,920 --> 03:33:54,920
And so now here's the key difference, or one of the key differences,
4183
03:33:54,920 --> 03:33:56,920
between this and propositional logic.
4184
03:33:56,920 --> 03:34:00,640
In propositional logic, I needed one symbol for Minerva Gryffindor,
4185
03:34:00,640 --> 03:34:02,960
and one symbol for Minerva Hufflepuff, and one
4186
03:34:02,960 --> 03:34:06,360
symbol for all the other people's Gryffindor and Hufflepuff variables.
4187
03:34:06,360 --> 03:34:10,560
In this case, I just need one symbol for each of my people,
4188
03:34:10,560 --> 03:34:13,200
and one symbol for each of my houses.
4189
03:34:13,200 --> 03:34:16,920
And then I can express as a predicate something like, belongs to,
4190
03:34:16,920 --> 03:34:21,520
and say, belongs to Minerva Gryffindor, to express the idea that Minerva
4191
03:34:21,520 --> 03:34:23,440
belongs to Gryffindor House.
4192
03:34:23,440 --> 03:34:27,180
So already we can see that first order logic is quite expressive in being
4193
03:34:27,180 --> 03:34:32,480
able to express these sorts of sentences using the existing constant symbols
4194
03:34:32,480 --> 03:34:36,240
and predicates that already exist, while minimizing the number of new symbols
4195
03:34:36,240 --> 03:34:37,120
that I need to create.
4196
03:34:37,120 --> 03:34:40,920
I can just use eight symbols for people for houses,
4197
03:34:40,920 --> 03:34:46,080
instead of 16 symbols for every possible combination of each.
4198
03:34:46,080 --> 03:34:49,000
But first order logic gives us a couple of additional features
4199
03:34:49,000 --> 03:34:52,000
that we can use to express even more complex ideas.
4200
03:34:52,000 --> 03:34:56,160
And these more additional features are generally known as quantifiers.
4201
03:34:56,160 --> 03:34:58,800
And there are two main quantifiers in first order logic,
4202
03:34:58,800 --> 03:35:01,640
the first of which is universal quantification.
4203
03:35:01,640 --> 03:35:04,800
Universal quantification lets me express an idea
4204
03:35:04,800 --> 03:35:09,040
like something is going to be true for all values of a variable.
4205
03:35:09,040 --> 03:35:13,560
Like for all values of x, some statement is going to hold true.
4206
03:35:13,560 --> 03:35:16,600
So what might a sentence in universal quantification look like?
4207
03:35:16,600 --> 03:35:21,080
Well, we're going to use this upside down a to mean for all.
4208
03:35:21,080 --> 03:35:26,680
So upside down ax means for all values of x, where x is any object,
4209
03:35:26,680 --> 03:35:28,840
this is going to hold true.
4210
03:35:28,840 --> 03:35:36,800
Belongs to x Gryffindor implies not belongs to x Hufflepuff.
4211
03:35:36,800 --> 03:35:38,160
So let's try and parse this out.
4212
03:35:38,160 --> 03:35:42,440
This means that for all values of x, if this holds true,
4213
03:35:42,440 --> 03:35:46,880
if x belongs to Gryffindor, then this does not hold true.
4214
03:35:46,880 --> 03:35:50,160
x does not belong to Hufflepuff.
4215
03:35:50,160 --> 03:35:52,560
So translated into English, this sentence
4216
03:35:52,560 --> 03:35:57,280
is saying something like for all objects x, if x belongs to Gryffindor,
4217
03:35:57,280 --> 03:36:00,720
then x does not belong to Hufflepuff, for example.
4218
03:36:00,720 --> 03:36:03,720
Or a phrase even more simply, anyone in Gryffindor
4219
03:36:03,720 --> 03:36:07,920
is not in Hufflepuff, simplified way of saying the same thing.
4220
03:36:07,920 --> 03:36:10,560
So this universal quantification lets us express
4221
03:36:10,560 --> 03:36:14,240
an idea like something is going to hold true for all values
4222
03:36:14,240 --> 03:36:16,400
of a particular variable.
4223
03:36:16,400 --> 03:36:18,520
In addition to universal quantification though,
4224
03:36:18,520 --> 03:36:21,880
we also have existential quantification.
4225
03:36:21,880 --> 03:36:24,400
Whereas universal quantification said that something
4226
03:36:24,400 --> 03:36:27,320
is going to be true for all values of a variable,
4227
03:36:27,320 --> 03:36:30,680
existential quantification says that some expression is going
4228
03:36:30,680 --> 03:36:36,680
to be true for some value of a variable, at least one value of the variable.
4229
03:36:36,680 --> 03:36:40,560
So let's take a look at a sample sentence using existential quantification.
4230
03:36:40,560 --> 03:36:42,480
One such sentence looks like this.
4231
03:36:42,480 --> 03:36:43,680
There exists an x.
4232
03:36:43,680 --> 03:36:46,360
This backwards e stands for exists.
4233
03:36:46,360 --> 03:36:51,560
And here we're saying there exists an x such that house x and belongs
4234
03:36:51,560 --> 03:36:53,400
to Minerva x.
4235
03:36:53,400 --> 03:36:57,480
In other words, there exists some object x where x is a house
4236
03:36:57,480 --> 03:37:00,480
and Minerva belongs to x.
4237
03:37:00,480 --> 03:37:02,640
Or phrased a little more succinctly in English,
4238
03:37:02,640 --> 03:37:05,400
I'm here just saying Minerva belongs to a house.
4239
03:37:05,400 --> 03:37:10,280
There's some object that is a house and Minerva belongs to a house.
4240
03:37:10,280 --> 03:37:13,280
And combining this universal and existential quantification,
4241
03:37:13,280 --> 03:37:16,280
we can create far more sophisticated logical statements
4242
03:37:16,280 --> 03:37:19,320
than we were able to just using propositional logic.
4243
03:37:19,320 --> 03:37:21,840
I could combine these to say something like this.
4244
03:37:21,840 --> 03:37:26,000
For all x, person x implies there exists
4245
03:37:26,000 --> 03:37:30,920
a y such that house y and belongs to xy.
4246
03:37:30,920 --> 03:37:31,400
All right.
4247
03:37:31,400 --> 03:37:33,600
So a lot of stuff going on there, a lot of symbols.
4248
03:37:33,600 --> 03:37:36,320
Let's try and parse it out and just understand what it's saying.
4249
03:37:36,320 --> 03:37:41,560
Here we're saying that for all values of x, if x is a person,
4250
03:37:41,560 --> 03:37:43,080
then this is true.
4251
03:37:43,080 --> 03:37:45,680
So in other words, I'm saying for all people,
4252
03:37:45,680 --> 03:37:48,960
and we call that person x, this statement is going to be true.
4253
03:37:48,960 --> 03:37:50,800
What statement is true of all people?
4254
03:37:50,800 --> 03:37:55,760
Well, there exists a y that is a house, so there exists some house,
4255
03:37:55,760 --> 03:37:58,760
and x belongs to y.
4256
03:37:58,760 --> 03:38:01,560
In other words, I'm saying that for all people out there,
4257
03:38:01,560 --> 03:38:07,520
there exists some house such that x, the person, belongs to y, the house.
4258
03:38:07,520 --> 03:38:08,920
This is phrased more succinctly.
4259
03:38:08,920 --> 03:38:12,480
I'm saying that every person belongs to a house, that for all x,
4260
03:38:12,480 --> 03:38:17,200
if x is a person, then there exists a house that x belongs to.
4261
03:38:17,200 --> 03:38:20,760
And so we can now express a lot more powerful ideas using this idea now
4262
03:38:20,760 --> 03:38:21,920
of first order logic.
4263
03:38:21,920 --> 03:38:24,480
And it turns out there are many other kinds of logic out there.
4264
03:38:24,480 --> 03:38:27,040
There's second order logic and other higher order logic,
4265
03:38:27,040 --> 03:38:30,720
each of which allows us to express more and more complex ideas.
4266
03:38:30,720 --> 03:38:33,160
But all of it, in this case, is really in pursuit
4267
03:38:33,160 --> 03:38:36,280
of the same goal, which is the representation of knowledge.
4268
03:38:36,280 --> 03:38:39,800
We want our AI agents to be able to know information,
4269
03:38:39,800 --> 03:38:41,880
to represent that information, whether that's
4270
03:38:41,880 --> 03:38:45,440
using propositional logic or first order logic or some other logic,
4271
03:38:45,440 --> 03:38:49,080
and then be able to reason based on that, to be able to draw conclusions,
4272
03:38:49,080 --> 03:38:50,840
make inferences, figure out whether there's
4273
03:38:50,840 --> 03:38:54,920
some sort of entailment relationship, as by using some sort of inference
4274
03:38:54,920 --> 03:38:58,560
algorithm, something like inference by resolution or model checking
4275
03:38:58,560 --> 03:39:01,600
or any number of these other algorithms that we can use in order
4276
03:39:01,600 --> 03:39:06,200
to take information that we know and translate it to additional conclusions.
4277
03:39:06,200 --> 03:39:08,880
So all of this has helped us to create AI that
4278
03:39:08,880 --> 03:39:13,640
is able to represent information about what it knows and what it doesn't know.
4279
03:39:13,640 --> 03:39:16,560
Next time, though, we'll take a look at how we can make our AI even more
4280
03:39:16,560 --> 03:39:20,520
powerful by not just encoding information that we know for sure to be true
4281
03:39:20,520 --> 03:39:23,920
and not to be true, but also to take a look at uncertainty,
4282
03:39:23,920 --> 03:39:27,240
to look at what happens if AI thinks that something might be probable
4283
03:39:27,240 --> 03:39:31,520
or maybe not very probable or somewhere in between those two extremes,
4284
03:39:31,520 --> 03:39:34,760
all in the pursuit of trying to build our intelligent systems
4285
03:39:34,760 --> 03:39:36,880
to be even more intelligent.
4286
03:39:36,880 --> 03:39:39,320
We'll see you next time.
4287
03:39:39,320 --> 03:39:57,760
Thank you.
4288
03:39:57,760 --> 03:39:59,880
All right, welcome back, everyone, to an introduction
4289
03:39:59,880 --> 03:40:02,040
to artificial intelligence with Python.
4290
03:40:02,040 --> 03:40:05,720
And last time, we took a look at how it is that AI inside of our computers
4291
03:40:05,720 --> 03:40:07,040
can represent knowledge.
4292
03:40:07,040 --> 03:40:10,120
We represented that knowledge in the form of logical sentences
4293
03:40:10,120 --> 03:40:12,080
in a variety of different logical languages.
4294
03:40:12,080 --> 03:40:15,640
And the idea was we wanted our AI to be able to represent knowledge
4295
03:40:15,640 --> 03:40:19,080
or information and somehow use those pieces of information
4296
03:40:19,080 --> 03:40:22,200
to be able to derive new pieces of information by inference,
4297
03:40:22,200 --> 03:40:24,680
to be able to take some information and deduce
4298
03:40:24,680 --> 03:40:27,240
some additional conclusions based on the information
4299
03:40:27,240 --> 03:40:29,160
that it already knew for sure.
4300
03:40:29,160 --> 03:40:32,320
But in reality, when we think about computers and we think about AI,
4301
03:40:32,320 --> 03:40:35,920
very rarely are our machines going to be able to know things for sure.
4302
03:40:35,920 --> 03:40:38,440
Oftentimes, there's going to be some amount of uncertainty
4303
03:40:38,440 --> 03:40:41,480
in the information that our AIs or our computers are dealing with,
4304
03:40:41,480 --> 03:40:44,200
where it might believe something with some probability,
4305
03:40:44,200 --> 03:40:46,960
as we'll soon discuss what probability is all about and what it means,
4306
03:40:46,960 --> 03:40:48,840
but not entirely for certain.
4307
03:40:48,840 --> 03:40:51,920
And we want to use the information that it has some knowledge about,
4308
03:40:51,920 --> 03:40:53,720
even if it doesn't have perfect knowledge,
4309
03:40:53,720 --> 03:40:57,280
to still be able to make inferences, still be able to draw conclusions.
4310
03:40:57,280 --> 03:41:00,480
So you might imagine, for example, in the context of a robot that
4311
03:41:00,480 --> 03:41:02,920
has some sensors and is exploring some environment,
4312
03:41:02,920 --> 03:41:06,200
it might not know exactly where it is or exactly what's around it,
4313
03:41:06,200 --> 03:41:08,880
but it does have access to some data that can allow it
4314
03:41:08,880 --> 03:41:10,840
to draw inferences with some probability.
4315
03:41:10,840 --> 03:41:13,520
There's some likelihood that one thing is true or another.
4316
03:41:13,520 --> 03:41:15,960
Or you can imagine in context where there is a little bit more
4317
03:41:15,960 --> 03:41:18,840
randomness and uncertainty, something like predicting the weather,
4318
03:41:18,840 --> 03:41:21,640
where you might not be able to know for sure what tomorrow's weather is
4319
03:41:21,640 --> 03:41:26,000
with 100% certainty, but you can probably infer with some probability
4320
03:41:26,000 --> 03:41:29,440
what tomorrow's weather is going to be based on maybe today's weather
4321
03:41:29,440 --> 03:41:32,280
and yesterday's weather and other data that you might have access
4322
03:41:32,280 --> 03:41:33,600
to as well.
4323
03:41:33,600 --> 03:41:36,920
And so oftentimes, we can distill this in terms of just possible events
4324
03:41:36,920 --> 03:41:39,760
that might happen and what the likelihood of those events are.
4325
03:41:39,760 --> 03:41:43,040
This comes a lot in games, for example, where there is an element of chance
4326
03:41:43,040 --> 03:41:44,280
inside of those games.
4327
03:41:44,280 --> 03:41:45,760
So you imagine rolling a dice.
4328
03:41:45,760 --> 03:41:48,240
You're not sure exactly what the die roll is going to be,
4329
03:41:48,240 --> 03:41:52,160
but you know it's going to be one of these possibilities from 1 to 6,
4330
03:41:52,160 --> 03:41:53,520
for example.
4331
03:41:53,520 --> 03:41:56,760
And so here now, we introduce the idea of probability theory.
4332
03:41:56,760 --> 03:41:58,720
And what we'll take a look at today is beginning
4333
03:41:58,720 --> 03:42:01,840
by looking at the mathematical foundations of probability theory,
4334
03:42:01,840 --> 03:42:05,400
getting an understanding for some of the key concepts within probability,
4335
03:42:05,400 --> 03:42:08,680
and then diving into how we can use probability and the ideas
4336
03:42:08,680 --> 03:42:12,400
that we look at mathematically to represent some ideas in terms of models
4337
03:42:12,400 --> 03:42:15,960
that we can put into our computers in order to program an AI that
4338
03:42:15,960 --> 03:42:19,280
is able to use information about probability to draw inferences,
4339
03:42:19,280 --> 03:42:22,280
to make some judgments about the world with some probability
4340
03:42:22,280 --> 03:42:25,040
or likelihood of being true.
4341
03:42:25,040 --> 03:42:27,920
So probability ultimately boils down to this idea
4342
03:42:27,920 --> 03:42:30,880
that there are possible worlds that we're here representing
4343
03:42:30,880 --> 03:42:32,920
using this little Greek letter omega.
4344
03:42:32,920 --> 03:42:36,400
And the idea of a possible world is that when I roll a die,
4345
03:42:36,400 --> 03:42:38,920
there are six possible worlds that could result from it.
4346
03:42:38,920 --> 03:42:42,840
I could roll a 1, or a 2, or a 3, or a 4, or a 5, or a 6.
4347
03:42:42,840 --> 03:42:45,040
And each of those are a possible world.
4348
03:42:45,040 --> 03:42:49,000
And each of those possible worlds has some probability of being true,
4349
03:42:49,000 --> 03:42:53,400
the probability that I do roll a 1, or a 2, or a 3, or something else.
4350
03:42:53,400 --> 03:42:57,040
And we represent that probability like this, using the capital letter P.
4351
03:42:57,040 --> 03:43:00,560
And then in parentheses, what it is that we want the probability of.
4352
03:43:00,560 --> 03:43:04,240
So this right here would be the probability of some possible world
4353
03:43:04,240 --> 03:43:07,040
as represented by the little letter omega.
4354
03:43:07,040 --> 03:43:09,760
Now, there are a couple of basic axioms of probability
4355
03:43:09,760 --> 03:43:13,000
that become relevant as we consider how we deal with probability
4356
03:43:13,000 --> 03:43:14,200
and how we think about it.
4357
03:43:14,200 --> 03:43:16,960
First and foremost, every probability value
4358
03:43:16,960 --> 03:43:20,160
must range between 0 and 1 inclusive.
4359
03:43:20,160 --> 03:43:23,920
So the smallest value any probability can have is the number 0,
4360
03:43:23,920 --> 03:43:25,800
which is an impossible event.
4361
03:43:25,800 --> 03:43:28,960
Something like I roll a die, and the die is a 7 is the roll that I get.
4362
03:43:28,960 --> 03:43:33,000
If the die only has numbers 1 through 6, the event that I roll a 7
4363
03:43:33,000 --> 03:43:36,240
is impossible, so it would have probability 0.
4364
03:43:36,240 --> 03:43:38,320
And on the other end of the spectrum, probability
4365
03:43:38,320 --> 03:43:40,920
can range all the way up to the positive number 1,
4366
03:43:40,920 --> 03:43:43,800
meaning an event is certain to happen, that I roll a die
4367
03:43:43,800 --> 03:43:46,200
and the number is less than 10, for example.
4368
03:43:46,200 --> 03:43:49,560
That is an event that is guaranteed to happen if the only sides on my die
4369
03:43:49,560 --> 03:43:51,800
are 1 through 6, for instance.
4370
03:43:51,800 --> 03:43:55,240
And then they can range through any real number in between these two values.
4371
03:43:55,240 --> 03:43:58,240
Where, generally speaking, a higher value for the probability
4372
03:43:58,240 --> 03:44:00,560
means an event is more likely to take place,
4373
03:44:00,560 --> 03:44:03,600
and a lower value for the probability means the event is less
4374
03:44:03,600 --> 03:44:05,680
likely to take place.
4375
03:44:05,680 --> 03:44:08,920
And the other key rule for probability looks a little bit like this.
4376
03:44:08,920 --> 03:44:11,840
This sigma notation, if you haven't seen it before,
4377
03:44:11,840 --> 03:44:13,920
refers to summation, the idea that we're going
4378
03:44:13,920 --> 03:44:16,160
to be adding up a whole sequence of values.
4379
03:44:16,160 --> 03:44:19,160
And this sigma notation is going to come up a couple of times today,
4380
03:44:19,160 --> 03:44:21,480
because as we deal with probability, oftentimes we're
4381
03:44:21,480 --> 03:44:25,120
adding up a whole bunch of individual values or individual probabilities
4382
03:44:25,120 --> 03:44:26,240
to get some other value.
4383
03:44:26,240 --> 03:44:28,200
So we'll see this come up a couple of times.
4384
03:44:28,200 --> 03:44:31,120
But what this notation means is that if I sum up
4385
03:44:31,120 --> 03:44:35,600
all of the possible worlds omega that are in big omega, which
4386
03:44:35,600 --> 03:44:38,280
represents the set of all the possible worlds,
4387
03:44:38,280 --> 03:44:42,120
meaning I take for all of the worlds in the set of possible worlds
4388
03:44:42,120 --> 03:44:47,000
and add up all of their probabilities, what I ultimately get is the number 1.
4389
03:44:47,000 --> 03:44:48,880
So if I take all the possible worlds, add up
4390
03:44:48,880 --> 03:44:52,280
what each of their probabilities is, I should get the number 1 at the end,
4391
03:44:52,280 --> 03:44:55,220
meaning all probabilities just need to sum to 1.
4392
03:44:55,220 --> 03:44:57,640
So for example, if I take dice, for example,
4393
03:44:57,640 --> 03:45:00,400
and if you imagine I have a fair die with numbers 1 through 6
4394
03:45:00,400 --> 03:45:02,480
and I roll the die, each one of these rolls
4395
03:45:02,480 --> 03:45:04,800
has an equal probability of taking place.
4396
03:45:04,800 --> 03:45:07,960
And the probability is 1 over 6, for example.
4397
03:45:07,960 --> 03:45:12,160
So each of these probabilities is between 0 and 1, 0 meaning impossible
4398
03:45:12,160 --> 03:45:13,600
and 1 meaning for certain.
4399
03:45:13,600 --> 03:45:15,640
And if you add up all of these probabilities
4400
03:45:15,640 --> 03:45:18,960
for all of the possible worlds, you get the number 1.
4401
03:45:18,960 --> 03:45:22,200
And we can represent any one of those probabilities like this.
4402
03:45:22,200 --> 03:45:25,640
The probability that we roll the number 2, for example,
4403
03:45:25,640 --> 03:45:27,440
is just 1 over 6.
4404
03:45:27,440 --> 03:45:31,680
Every six times we roll the die, we'd expect that one time, for instance,
4405
03:45:31,680 --> 03:45:33,280
the die might come up as a 2.
4406
03:45:33,280 --> 03:45:36,520
Its probability is not certain, but it's a little more than nothing,
4407
03:45:36,520 --> 03:45:38,120
for instance.
4408
03:45:38,120 --> 03:45:40,920
And so this is all fairly straightforward for just a single die.
4409
03:45:40,920 --> 03:45:43,260
But things get more interesting as our models of the world
4410
03:45:43,260 --> 03:45:44,840
get a little bit more complex.
4411
03:45:44,840 --> 03:45:47,520
Let's imagine now that we're not just dealing with a single die,
4412
03:45:47,520 --> 03:45:49,720
but we have two dice, for example.
4413
03:45:49,720 --> 03:45:51,880
I have a red die here and a blue die there,
4414
03:45:51,880 --> 03:45:54,920
and I care not just about what the individual roll is,
4415
03:45:54,920 --> 03:45:56,880
but I care about the sum of the two rolls.
4416
03:45:56,880 --> 03:46:00,280
In this case, the sum of the two rolls is the number 3.
4417
03:46:00,280 --> 03:46:04,160
How do I begin to now reason about what does the probability look like
4418
03:46:04,160 --> 03:46:07,560
if instead of having one die, I now have two dice?
4419
03:46:07,560 --> 03:46:09,920
Well, what we might imagine is that we could first consider
4420
03:46:09,920 --> 03:46:12,480
what are all of the possible worlds.
4421
03:46:12,480 --> 03:46:14,480
And in this case, all of the possible worlds
4422
03:46:14,480 --> 03:46:18,120
are just every combination of the red and blue die that I could come up with.
4423
03:46:18,120 --> 03:46:22,640
For the red die, it could be a 1 or a 2 or a 3 or a 4 or a 5 or a 6.
4424
03:46:22,640 --> 03:46:25,320
And for each of those possibilities, the blue die, likewise,
4425
03:46:25,320 --> 03:46:30,320
could also be either 1 or 2 or 3 or 4 or 5 or 6.
4426
03:46:30,320 --> 03:46:33,000
And it just so happens that in this particular case,
4427
03:46:33,000 --> 03:46:36,200
each of these possible combinations is equally likely.
4428
03:46:36,200 --> 03:46:39,400
Equally likely are all of these various different possible worlds.
4429
03:46:39,400 --> 03:46:41,080
That's not always going to be the case.
4430
03:46:41,080 --> 03:46:44,160
If you imagine more complex models that we could try to build and things
4431
03:46:44,160 --> 03:46:46,400
that we could try to represent in the real world,
4432
03:46:46,400 --> 03:46:49,560
it's probably not going to be the case that every single possible world is
4433
03:46:49,560 --> 03:46:50,920
always equally likely.
4434
03:46:50,920 --> 03:46:53,600
But in the case of fair dice, where in any given die roll,
4435
03:46:53,600 --> 03:46:57,080
any one number has just as good a chance of coming up as any other number,
4436
03:46:57,080 --> 03:47:01,360
we can consider all of these possible worlds to be equally likely.
4437
03:47:01,360 --> 03:47:04,120
But even though all of the possible worlds are equally likely,
4438
03:47:04,120 --> 03:47:07,320
that doesn't necessarily mean that their sums are equally likely.
4439
03:47:07,320 --> 03:47:10,320
So if we consider what the sum is of all of these two, so 1 plus 1,
4440
03:47:10,320 --> 03:47:11,240
that's a 2.
4441
03:47:11,240 --> 03:47:12,600
2 plus 1 is a 3.
4442
03:47:12,600 --> 03:47:15,320
And consider for each of these possible pairs of numbers
4443
03:47:15,320 --> 03:47:18,720
what their sum ultimately is, we can notice that there are some patterns
4444
03:47:18,720 --> 03:47:22,000
here, where it's not entirely the case that every number comes up
4445
03:47:22,000 --> 03:47:23,240
equally likely.
4446
03:47:23,240 --> 03:47:26,880
If you consider 7, for example, what's the probability that when I roll two
4447
03:47:26,880 --> 03:47:28,720
dice, their sum is 7?
4448
03:47:28,720 --> 03:47:30,280
There are several ways this can happen.
4449
03:47:30,280 --> 03:47:33,080
There are six possible worlds where the sum is 7.
4450
03:47:33,080 --> 03:47:37,480
It could be a 1 and a 6, or a 2 and a 5, or a 3 and a 4, a 4 and a 3,
4451
03:47:37,480 --> 03:47:39,040
and so forth.
4452
03:47:39,040 --> 03:47:42,720
But if you instead consider what's the probability that I roll two dice,
4453
03:47:42,720 --> 03:47:45,920
and the sum of those two die rolls is 12, for example,
4454
03:47:45,920 --> 03:47:49,880
we're looking at this diagram, there's only one possible world in which that
4455
03:47:49,880 --> 03:47:50,400
can happen.
4456
03:47:50,400 --> 03:47:54,200
And that's the possible world where both the red die and the blue die
4457
03:47:54,200 --> 03:47:58,400
both come up as sixes to give us a sum total of 12.
4458
03:47:58,400 --> 03:48:00,520
So based on just taking a look at this diagram,
4459
03:48:00,520 --> 03:48:03,000
we see that some of these probabilities are likely different.
4460
03:48:03,000 --> 03:48:07,200
The probability that the sum is a 7 must be greater than the probability
4461
03:48:07,200 --> 03:48:08,440
that the sum is a 12.
4462
03:48:08,440 --> 03:48:11,680
And we can represent that even more formally by saying, OK, the probability
4463
03:48:11,680 --> 03:48:15,320
that we sum to 12 is 1 out of 36.
4464
03:48:15,320 --> 03:48:18,680
Out of the 36 equally likely possible worlds,
4465
03:48:18,680 --> 03:48:22,040
6 squared because we have six options for the red die and six
4466
03:48:22,040 --> 03:48:24,960
options for the blue die, out of those 36 options,
4467
03:48:24,960 --> 03:48:27,840
only one of them sums to 12.
4468
03:48:27,840 --> 03:48:29,600
Whereas on the other hand, the probability
4469
03:48:29,600 --> 03:48:33,360
that if we take two dice rolls and they sum up to the number 7, well,
4470
03:48:33,360 --> 03:48:37,840
out of those 36 possible worlds, there were six worlds where the sum was 7.
4471
03:48:37,840 --> 03:48:42,280
And so we get 6 over 36, which we can simplify as a fraction to just 1
4472
03:48:42,280 --> 03:48:43,720
over 6.
4473
03:48:43,720 --> 03:48:46,360
So here now, we're able to represent these different ideas
4474
03:48:46,360 --> 03:48:49,400
of probability, representing some events that might be more likely
4475
03:48:49,400 --> 03:48:52,720
and then other events that are less likely as well.
4476
03:48:52,720 --> 03:48:55,840
And these sorts of judgments, where we're figuring out just in the abstract
4477
03:48:55,840 --> 03:48:58,760
what is the probability that this thing takes place,
4478
03:48:58,760 --> 03:49:01,680
are generally known as unconditional probabilities.
4479
03:49:01,680 --> 03:49:04,000
Some degree of belief we have in some proposition,
4480
03:49:04,000 --> 03:49:07,840
some fact about the world, in the absence of any other evidence.
4481
03:49:07,840 --> 03:49:10,600
Without knowing any additional information, if I roll a die,
4482
03:49:10,600 --> 03:49:12,240
what's the chance it comes up as a 2?
4483
03:49:12,240 --> 03:49:15,240
Or if I roll two dice, what's the chance that the sum of those two die
4484
03:49:15,240 --> 03:49:17,080
rolls is a 7?
4485
03:49:17,080 --> 03:49:20,080
But usually when we're thinking about probability, especially when
4486
03:49:20,080 --> 03:49:22,400
we're thinking about training in AI to intelligently
4487
03:49:22,400 --> 03:49:24,320
be able to know something about the world
4488
03:49:24,320 --> 03:49:26,600
and make predictions based on that information,
4489
03:49:26,600 --> 03:49:30,120
it's not unconditional probability that our AI is dealing with,
4490
03:49:30,120 --> 03:49:32,680
but rather conditional probability, probability
4491
03:49:32,680 --> 03:49:35,360
where rather than having no original knowledge,
4492
03:49:35,360 --> 03:49:37,600
we have some initial knowledge about the world
4493
03:49:37,600 --> 03:49:39,320
and how the world actually works.
4494
03:49:39,320 --> 03:49:43,120
So conditional probability is the degree of belief in a proposition
4495
03:49:43,120 --> 03:49:47,840
given some evidence that has already been revealed to us.
4496
03:49:47,840 --> 03:49:49,000
So what does this look like?
4497
03:49:49,000 --> 03:49:51,720
Well, it looks like this in terms of notation.
4498
03:49:51,720 --> 03:49:56,240
We're going to represent conditional probability as probability of A
4499
03:49:56,240 --> 03:49:59,920
and then this vertical bar and then B. And the way to read this
4500
03:49:59,920 --> 03:50:02,720
is the thing on the left-hand side of the vertical bar
4501
03:50:02,720 --> 03:50:05,000
is what we want the probability of.
4502
03:50:05,000 --> 03:50:08,200
Here now, I want the probability that A is true,
4503
03:50:08,200 --> 03:50:12,000
that it is the real world, that it is the event that actually does take place.
4504
03:50:12,000 --> 03:50:14,920
And then on the right side of the vertical bar is our evidence,
4505
03:50:14,920 --> 03:50:18,520
the information that we already know for certain about the world.
4506
03:50:18,520 --> 03:50:21,200
For example, that B is true.
4507
03:50:21,200 --> 03:50:23,080
So the way to read this entire expression
4508
03:50:23,080 --> 03:50:28,480
is what is the probability of A given B, the probability that A is true,
4509
03:50:28,480 --> 03:50:31,480
given that we already know that B is true.
4510
03:50:31,480 --> 03:50:34,120
And this type of judgment, conditional probability,
4511
03:50:34,120 --> 03:50:37,160
the probability of one thing given some other fact,
4512
03:50:37,160 --> 03:50:40,200
comes up quite a lot when we think about the types of calculations
4513
03:50:40,200 --> 03:50:42,240
we might want our AI to be able to do.
4514
03:50:42,240 --> 03:50:45,640
For example, we might care about the probability of rain today
4515
03:50:45,640 --> 03:50:47,720
given that we know that it rained yesterday.
4516
03:50:47,720 --> 03:50:51,000
We could think about the probability of rain today just in the abstract.
4517
03:50:51,000 --> 03:50:52,960
What is the chance that today it rains?
4518
03:50:52,960 --> 03:50:54,960
But usually, we have some additional evidence.
4519
03:50:54,960 --> 03:50:57,520
I know for certain that it rained yesterday.
4520
03:50:57,520 --> 03:51:00,920
And so I would like to calculate the probability that it rains today
4521
03:51:00,920 --> 03:51:03,240
given that I know that it rained yesterday.
4522
03:51:03,240 --> 03:51:06,200
Or you might imagine that I want to know the probability that my optimal
4523
03:51:06,200 --> 03:51:09,920
route to my destination changes given the current traffic condition.
4524
03:51:09,920 --> 03:51:12,120
So whether or not traffic conditions change,
4525
03:51:12,120 --> 03:51:16,200
that might change the probability that this route is actually the optimal route.
4526
03:51:16,200 --> 03:51:18,160
Or you might imagine in a medical context,
4527
03:51:18,160 --> 03:51:22,480
I want to know the probability that a patient has a particular disease given
4528
03:51:22,480 --> 03:51:25,600
some results of some tests that have been performed on that patient.
4529
03:51:25,600 --> 03:51:28,440
And I have some evidence, the results of that test,
4530
03:51:28,440 --> 03:51:31,760
and I would like to know the probability that a patient has
4531
03:51:31,760 --> 03:51:33,080
a particular disease.
4532
03:51:33,080 --> 03:51:35,840
So this notion of conditional probability comes up everywhere.
4533
03:51:35,840 --> 03:51:38,320
So we begin to think about what we would like to reason about,
4534
03:51:38,320 --> 03:51:40,800
but being able to reason a little more intelligently
4535
03:51:40,800 --> 03:51:43,760
by taking into account evidence that we already have.
4536
03:51:43,760 --> 03:51:46,920
We're more able to get an accurate result for what is the likelihood
4537
03:51:46,920 --> 03:51:50,960
that someone has this disease if we know this evidence, the results of the test,
4538
03:51:50,960 --> 03:51:55,240
as opposed to if we were just calculating the unconditional probability of saying,
4539
03:51:55,240 --> 03:51:58,600
what is the probability they have the disease without any evidence
4540
03:51:58,600 --> 03:52:03,360
to try and back up our result one way or the other.
4541
03:52:03,360 --> 03:52:06,400
So now that we've got this idea of what conditional probability is,
4542
03:52:06,400 --> 03:52:08,200
the next question we have to ask is, all right,
4543
03:52:08,200 --> 03:52:10,200
how do we calculate conditional probability?
4544
03:52:10,200 --> 03:52:13,880
How do we figure out mathematically, if I have an expression like this,
4545
03:52:13,880 --> 03:52:15,240
how do I get a number from that?
4546
03:52:15,240 --> 03:52:17,560
What does conditional probability actually mean?
4547
03:52:17,560 --> 03:52:19,560
Well, the formula for conditional probability
4548
03:52:19,560 --> 03:52:21,120
looks a little something like this.
4549
03:52:21,120 --> 03:52:25,640
The probability of a given b, the probability that a is true,
4550
03:52:25,640 --> 03:52:29,320
given that we know that b is true, is equal to this fraction,
4551
03:52:29,320 --> 03:52:34,520
the probability that a and b are true, divided by just the probability
4552
03:52:34,520 --> 03:52:35,520
that b is true.
4553
03:52:35,520 --> 03:52:37,800
And the way to intuitively try to think about this
4554
03:52:37,800 --> 03:52:40,960
is that if I want to know the probability that a is true, given
4555
03:52:40,960 --> 03:52:46,000
that b is true, well, I want to consider all the ways they could both be true out
4556
03:52:46,000 --> 03:52:50,040
of the only worlds that I care about are the worlds where b is already true.
4557
03:52:50,040 --> 03:52:52,840
I can sort of ignore all the cases where b isn't true,
4558
03:52:52,840 --> 03:52:55,640
because those aren't relevant to my ultimate computation.
4559
03:52:55,640 --> 03:52:59,720
They're not relevant to what it is that I want to get information about.
4560
03:52:59,720 --> 03:53:01,220
So let's take a look at an example.
4561
03:53:01,220 --> 03:53:04,160
Let's go back to that example of rolling two dice and the idea
4562
03:53:04,160 --> 03:53:06,920
that those two dice might sum up to the number 12.
4563
03:53:06,920 --> 03:53:09,680
We discussed earlier that the unconditional probability
4564
03:53:09,680 --> 03:53:13,160
that if I roll two dice and they sum to 12 is 1 out of 36,
4565
03:53:13,160 --> 03:53:16,280
because out of the 36 possible worlds that I might care about,
4566
03:53:16,280 --> 03:53:19,280
in only one of them is the sum of those two dice 12.
4567
03:53:19,280 --> 03:53:22,880
It's only when red is 6 and blue is also 6.
4568
03:53:22,880 --> 03:53:25,400
But let's say now that I have some additional information.
4569
03:53:25,400 --> 03:53:29,400
I now want to know what is the probability that the two dice sum to 12,
4570
03:53:29,400 --> 03:53:33,720
given that I know that the red die was a 6.
4571
03:53:33,720 --> 03:53:35,320
So I already have some evidence.
4572
03:53:35,320 --> 03:53:36,960
I already know the red die is a 6.
4573
03:53:36,960 --> 03:53:38,320
I don't know what the blue die is.
4574
03:53:38,320 --> 03:53:41,200
That information isn't given to me in this expression.
4575
03:53:41,200 --> 03:53:44,080
But given the fact that I know that the red die rolled a 6,
4576
03:53:44,080 --> 03:53:47,080
what is the probability that we sum to 12?
4577
03:53:47,080 --> 03:53:50,040
And so we can begin to do the math using that expression from before.
4578
03:53:50,040 --> 03:53:52,440
Here, again, are all of the possibilities,
4579
03:53:52,440 --> 03:53:55,800
all of the possible combinations of red die being 1 through 6
4580
03:53:55,800 --> 03:53:58,600
and blue die being 1 through 6.
4581
03:53:58,600 --> 03:54:00,320
And I might consider first, all right, what
4582
03:54:00,320 --> 03:54:04,320
is the probability of my evidence, my B variable, where I want to know,
4583
03:54:04,320 --> 03:54:07,400
what is the probability that the red die is a 6?
4584
03:54:07,400 --> 03:54:11,200
Well, the probability that the red die is a 6 is just 1 out of 6.
4585
03:54:11,200 --> 03:54:14,800
So these 1 out of 6 options are really the only worlds
4586
03:54:14,800 --> 03:54:16,200
that I care about here now.
4587
03:54:16,200 --> 03:54:19,320
All the rest of them are irrelevant to my calculation,
4588
03:54:19,320 --> 03:54:22,200
because I already have this evidence that the red die was a 6,
4589
03:54:22,200 --> 03:54:26,280
so I don't need to care about all of the other possibilities that could result.
4590
03:54:26,280 --> 03:54:29,760
So now, in addition to the fact that the red die rolled as a 6
4591
03:54:29,760 --> 03:54:32,280
and the probability of that, the other piece of information
4592
03:54:32,280 --> 03:54:35,560
I need to know in order to calculate this conditional probability
4593
03:54:35,560 --> 03:54:39,480
is the probability that both of my variables, A and B, are true.
4594
03:54:39,480 --> 03:54:44,360
The probability that both the red die is a 6, and they all sum to 12.
4595
03:54:44,360 --> 03:54:47,120
So what is the probability that both of these things happen?
4596
03:54:47,120 --> 03:54:51,640
Well, it only happens in one possible case in 1 out of these 36 cases,
4597
03:54:51,640 --> 03:54:55,520
and it's the case where both the red and the blue die are equal to 6.
4598
03:54:55,520 --> 03:54:57,800
This is a piece of information that we already knew.
4599
03:54:57,800 --> 03:55:01,880
And so this probability is equal to 1 over 36.
4600
03:55:01,880 --> 03:55:05,680
And so to get the conditional probability that the sum is 12,
4601
03:55:05,680 --> 03:55:08,560
given that I know that the red dice is equal to 6,
4602
03:55:08,560 --> 03:55:10,640
well, I just divide these two values together,
4603
03:55:10,640 --> 03:55:16,600
and 1 over 36 divided by 1 over 6 gives us this probability of 1 over 6.
4604
03:55:16,600 --> 03:55:19,960
Given that I know that the red die rolled a value of 6,
4605
03:55:19,960 --> 03:55:25,320
the probability that the sum of the two dice is 12 is also 1 over 6.
4606
03:55:25,320 --> 03:55:27,480
And that probably makes intuitive sense to you, too,
4607
03:55:27,480 --> 03:55:30,880
because if the red die is a 6, the only way for me to get to a 12
4608
03:55:30,880 --> 03:55:33,240
is if the blue die also rolls a 6, and we
4609
03:55:33,240 --> 03:55:37,040
know that the probability of the blue die rolling a 6 is 1 over 6.
4610
03:55:37,040 --> 03:55:40,680
So in this case, the conditional probability seems fairly straightforward.
4611
03:55:40,680 --> 03:55:44,040
But this idea of calculating a conditional probability
4612
03:55:44,040 --> 03:55:47,880
by looking at the probability that both of these events take place
4613
03:55:47,880 --> 03:55:49,920
is an idea that's going to come up again and again.
4614
03:55:49,920 --> 03:55:52,880
This is the definition now of conditional probability.
4615
03:55:52,880 --> 03:55:54,800
And we're going to use that definition as we
4616
03:55:54,800 --> 03:55:56,960
think about probability more generally to be
4617
03:55:56,960 --> 03:55:59,120
able to draw conclusions about the world.
4618
03:55:59,120 --> 03:56:00,760
This, again, is that formula.
4619
03:56:00,760 --> 03:56:04,440
The probability of A given B is equal to the probability
4620
03:56:04,440 --> 03:56:08,840
that A and B take place divided by the probability of B.
4621
03:56:08,840 --> 03:56:11,880
And you'll see this formula sometimes written in a couple of different ways.
4622
03:56:11,880 --> 03:56:15,520
You could imagine algebraically multiplying both sides of this equation
4623
03:56:15,520 --> 03:56:18,720
by probability of B to get rid of the fraction,
4624
03:56:18,720 --> 03:56:20,320
and you'll get an expression like this.
4625
03:56:20,320 --> 03:56:24,520
The probability of A and B, which is this expression over here,
4626
03:56:24,520 --> 03:56:28,520
is just the probability of B times the probability of A given B.
4627
03:56:28,520 --> 03:56:31,840
Or you could represent this equivalently since A and B in this expression
4628
03:56:31,840 --> 03:56:32,840
are interchangeable.
4629
03:56:32,840 --> 03:56:36,440
A and B is the same thing as B and A. You could imagine also
4630
03:56:36,440 --> 03:56:41,040
representing the probability of A and B as the probability of A
4631
03:56:41,040 --> 03:56:45,080
times the probability of B given A, just switching all of the A's and B's.
4632
03:56:45,080 --> 03:56:47,280
These three are all equivalent ways of trying
4633
03:56:47,280 --> 03:56:49,760
to represent what joint probability means.
4634
03:56:49,760 --> 03:56:52,120
And so you'll sometimes see all of these equations,
4635
03:56:52,120 --> 03:56:55,680
and they might be useful to you as you begin to reason about probability
4636
03:56:55,680 --> 03:57:00,080
and to think about what values might be taking place in the real world.
4637
03:57:00,080 --> 03:57:02,120
Now, sometimes when we deal with probability,
4638
03:57:02,120 --> 03:57:05,320
we don't just care about a Boolean event like did this happen
4639
03:57:05,320 --> 03:57:06,720
or did this not happen.
4640
03:57:06,720 --> 03:57:10,160
Sometimes we might want the ability to represent variable values
4641
03:57:10,160 --> 03:57:13,400
in a probability space where some variable might take
4642
03:57:13,400 --> 03:57:16,080
on multiple different possible values.
4643
03:57:16,080 --> 03:57:19,440
And in probability, we call a variable in probability theory
4644
03:57:19,440 --> 03:57:21,040
a random variable.
4645
03:57:21,040 --> 03:57:25,440
A random variable in probability is just some variable in probability theory
4646
03:57:25,440 --> 03:57:28,800
that has some domain of values that it can take on.
4647
03:57:28,800 --> 03:57:29,920
So what do I mean by this?
4648
03:57:29,920 --> 03:57:32,640
Well, what I mean is I might have a random variable that is just
4649
03:57:32,640 --> 03:57:36,120
called roll, for example, that has six possible values.
4650
03:57:36,120 --> 03:57:39,720
Roll is my variable, and the possible values, the domain of values
4651
03:57:39,720 --> 03:57:43,160
that it can take on are 1, 2, 3, 4, 5, and 6.
4652
03:57:43,160 --> 03:57:45,520
And I might like to know the probability of each.
4653
03:57:45,520 --> 03:57:47,440
In this case, they happen to all be the same.
4654
03:57:47,440 --> 03:57:50,360
But in other random variables, that might not be the case.
4655
03:57:50,360 --> 03:57:52,160
For example, I might have a random variable
4656
03:57:52,160 --> 03:57:55,200
to represent the weather, for example, where the domain of values
4657
03:57:55,200 --> 03:57:59,680
it could take on are things like sun or cloudy or rainy or windy or snowy.
4658
03:57:59,680 --> 03:58:02,120
And each of those might have a different probability.
4659
03:58:02,120 --> 03:58:05,560
And I care about knowing what is the probability that the weather equals
4660
03:58:05,560 --> 03:58:08,600
sun or that the weather equals clouds, for instance.
4661
03:58:08,600 --> 03:58:11,080
And I might like to do some mathematical calculations
4662
03:58:11,080 --> 03:58:12,760
based on that information.
4663
03:58:12,760 --> 03:58:15,320
Other random variables might be something like traffic.
4664
03:58:15,320 --> 03:58:18,840
What are the odds that there is no traffic or light traffic or heavy traffic?
4665
03:58:18,840 --> 03:58:21,200
Traffic, in this case, is my random variable.
4666
03:58:21,200 --> 03:58:24,560
And the values that that random variable can take on are here.
4667
03:58:24,560 --> 03:58:26,760
It's either none or light or heavy.
4668
03:58:26,760 --> 03:58:28,640
And I, the person doing these calculations,
4669
03:58:28,640 --> 03:58:32,280
I, the person encoding these random variables into my computer,
4670
03:58:32,280 --> 03:58:36,600
need to make the decision as to what these possible values actually are.
4671
03:58:36,600 --> 03:58:38,880
You might imagine, for example, for a flight.
4672
03:58:38,880 --> 03:58:41,320
If I care about whether or not I make it or do a flight on time,
4673
03:58:41,320 --> 03:58:43,880
my flight has a couple of possible values that it could take on.
4674
03:58:43,880 --> 03:58:45,280
My flight could be on time.
4675
03:58:45,280 --> 03:58:46,520
My flight could be delayed.
4676
03:58:46,520 --> 03:58:47,800
My flight could be canceled.
4677
03:58:47,800 --> 03:58:51,480
So flight, in this case, is my random variable.
4678
03:58:51,480 --> 03:58:54,120
And these are the values that it can take on.
4679
03:58:54,120 --> 03:58:57,360
And often, I want to know something about the probability
4680
03:58:57,360 --> 03:59:00,880
that my random variable takes on each of those possible values.
4681
03:59:00,880 --> 03:59:04,360
And this is what we then call a probability distribution.
4682
03:59:04,360 --> 03:59:07,320
A probability distribution takes a random variable
4683
03:59:07,320 --> 03:59:12,040
and gives me the probability for each of the possible values in its domain.
4684
03:59:12,040 --> 03:59:15,600
So in the case of this flight, for example, my probability distribution
4685
03:59:15,600 --> 03:59:16,960
might look something like this.
4686
03:59:16,960 --> 03:59:19,920
My probability distribution says the probability
4687
03:59:19,920 --> 03:59:25,880
that the random variable flight is equal to the value on time is 0.6.
4688
03:59:25,880 --> 03:59:28,480
Or otherwise, put into more English human-friendly terms,
4689
03:59:28,480 --> 03:59:32,080
the likelihood that my flight is on time is 60%, for example.
4690
03:59:32,080 --> 03:59:35,760
And in this case, the probability that my flight is delayed is 30%.
4691
03:59:35,760 --> 03:59:39,720
The probability that my flight is canceled is 10% or 0.1.
4692
03:59:39,720 --> 03:59:42,480
And if you sum up all of these possible values,
4693
03:59:42,480 --> 03:59:43,840
the sum is going to be 1, right?
4694
03:59:43,840 --> 03:59:46,360
If you take all of the possible worlds, here
4695
03:59:46,360 --> 03:59:49,800
are my three possible worlds for the value of the random variable flight,
4696
03:59:49,800 --> 03:59:52,160
add them all up together, the result needs
4697
03:59:52,160 --> 03:59:55,280
to be the number 1 per that axiom of probability theory
4698
03:59:55,280 --> 03:59:57,160
that we've discussed before.
4699
03:59:57,160 --> 04:00:00,440
So this now is one way of representing this probability
4700
04:00:00,440 --> 04:00:03,600
distribution for the random variable flight.
4701
04:00:03,600 --> 04:00:06,160
Sometimes you'll see it represented a little bit more concisely
4702
04:00:06,160 --> 04:00:08,440
that this is pretty verbose for really just trying
4703
04:00:08,440 --> 04:00:10,720
to express three possible values.
4704
04:00:10,720 --> 04:00:13,280
And so often, you'll instead see the same notation
4705
04:00:13,280 --> 04:00:15,120
representing using a vector.
4706
04:00:15,120 --> 04:00:17,880
And all a vector is is a sequence of values.
4707
04:00:17,880 --> 04:00:21,160
As opposed to just a single value, I might have multiple values.
4708
04:00:21,160 --> 04:00:25,200
And so I could extend instead, represent this idea this way.
4709
04:00:25,200 --> 04:00:29,920
Bold p, so a larger p, generally meaning the probability distribution
4710
04:00:29,920 --> 04:00:35,520
of this variable flight is equal to this vector represented in angle brackets.
4711
04:00:35,520 --> 04:00:39,880
The probability distribution is 0.6, 0.3, and 0.1.
4712
04:00:39,880 --> 04:00:42,840
And I would just have to know that this probability distribution is
4713
04:00:42,840 --> 04:00:46,600
in order of on time or delayed and canceled
4714
04:00:46,600 --> 04:00:48,280
to know how to interpret this vector.
4715
04:00:48,280 --> 04:00:51,000
To mean the first value in the vector is the probability
4716
04:00:51,000 --> 04:00:52,520
that my flight is on time.
4717
04:00:52,520 --> 04:00:56,040
The second value in the vector is the probability that my flight is delayed.
4718
04:00:56,040 --> 04:00:58,480
And the third value in the vector is the probability
4719
04:00:58,480 --> 04:01:00,560
that my flight is canceled.
4720
04:01:00,560 --> 04:01:03,720
And so this is just an alternate way of representing this idea,
4721
04:01:03,720 --> 04:01:05,040
a little more verbosely.
4722
04:01:05,040 --> 04:01:08,840
But oftentimes, you'll see us just talk about a probability distribution
4723
04:01:08,840 --> 04:01:10,360
over a random variable.
4724
04:01:10,360 --> 04:01:12,600
And whenever we talk about that, what we're really doing
4725
04:01:12,600 --> 04:01:16,040
is trying to figure out the probabilities of each of the possible values
4726
04:01:16,040 --> 04:01:17,840
that that random variable can take on.
4727
04:01:17,840 --> 04:01:20,640
But this notation is just a little bit more succinct,
4728
04:01:20,640 --> 04:01:22,760
even though it can sometimes be a little confusing,
4729
04:01:22,760 --> 04:01:24,480
depending on the context in which you see it.
4730
04:01:24,480 --> 04:01:27,720
So we'll start to look at examples where we use this sort of notation
4731
04:01:27,720 --> 04:01:33,480
to describe probability and to describe events that might take place.
4732
04:01:33,480 --> 04:01:37,080
A couple of other important ideas to know with regards to probability theory.
4733
04:01:37,080 --> 04:01:39,480
One is this idea of independence.
4734
04:01:39,480 --> 04:01:43,080
And independence refers to the idea that the knowledge of one event
4735
04:01:43,080 --> 04:01:46,480
doesn't influence the probability of another event.
4736
04:01:46,480 --> 04:01:48,760
So for example, in the context of my two dice rolls,
4737
04:01:48,760 --> 04:01:51,560
where I had the red die and the blue die, the probability
4738
04:01:51,560 --> 04:01:54,040
that I roll the red die and the blue die,
4739
04:01:54,040 --> 04:01:57,120
those two events, red die and blue die, are independent.
4740
04:01:57,120 --> 04:02:00,160
Knowing the result of the red die doesn't change
4741
04:02:00,160 --> 04:02:01,520
the probabilities for the blue die.
4742
04:02:01,520 --> 04:02:03,960
It doesn't give me any additional information
4743
04:02:03,960 --> 04:02:06,920
about what the value of the blue die is ultimately going to be.
4744
04:02:06,920 --> 04:02:08,760
But that's not always going to be the case.
4745
04:02:08,760 --> 04:02:11,480
You might imagine that in the case of weather, something
4746
04:02:11,480 --> 04:02:15,240
like clouds and rain, those are probably not independent.
4747
04:02:15,240 --> 04:02:18,720
But if it is cloudy, that might increase the probability that later
4748
04:02:18,720 --> 04:02:20,240
in the day it's going to rain.
4749
04:02:20,240 --> 04:02:24,680
So some information informs some other event or some other random variable.
4750
04:02:24,680 --> 04:02:29,080
So independence refers to the idea that one event doesn't influence the other.
4751
04:02:29,080 --> 04:02:34,280
And if they're not independent, then there might be some relationship.
4752
04:02:34,280 --> 04:02:37,440
So mathematically, formally, what does independence actually mean?
4753
04:02:37,440 --> 04:02:42,200
Well, recall this formula from before, that the probability of A and B
4754
04:02:42,200 --> 04:02:46,320
is the probability of A times the probability of B given A.
4755
04:02:46,320 --> 04:02:48,160
And the more intuitive way to think about this
4756
04:02:48,160 --> 04:02:51,680
is that to know how likely it is that A and B happen,
4757
04:02:51,680 --> 04:02:54,520
well, let's first figure out the likelihood that A happens.
4758
04:02:54,520 --> 04:02:56,880
And then given that we know that A happens,
4759
04:02:56,880 --> 04:02:58,720
let's figure out the likelihood that B happens
4760
04:02:58,720 --> 04:03:01,560
and multiply those two things together.
4761
04:03:01,560 --> 04:03:05,680
But if A and B were independent, meaning knowing A
4762
04:03:05,680 --> 04:03:09,440
doesn't change anything about the likelihood that B is true,
4763
04:03:09,440 --> 04:03:14,680
well, then the probability of B given A, meaning the probability that B is true,
4764
04:03:14,680 --> 04:03:17,680
given that I know A is true, well, that I know A is true
4765
04:03:17,680 --> 04:03:20,400
shouldn't really make a difference if these two things are independent,
4766
04:03:20,400 --> 04:03:22,880
that A shouldn't influence B at all.
4767
04:03:22,880 --> 04:03:27,760
So the probability of B given A is really just the probability of B.
4768
04:03:27,760 --> 04:03:30,800
If it is true that A and B are independent.
4769
04:03:30,800 --> 04:03:33,840
And so this right here is one example of a definition
4770
04:03:33,840 --> 04:03:36,440
for what it means for A and B to be independent.
4771
04:03:36,440 --> 04:03:39,600
The probability of A and B is just the probability
4772
04:03:39,600 --> 04:03:44,320
of A times the probability of B. Anytime you find two events A and B
4773
04:03:44,320 --> 04:03:49,640
where this relationship holds, then you can say that A and B are independent.
4774
04:03:49,640 --> 04:03:53,640
So an example of that might be the dice that we were taking a look at before.
4775
04:03:53,640 --> 04:03:58,320
Here, if I wanted the probability of red being a 6 and blue being a 6,
4776
04:03:58,320 --> 04:04:01,680
well, that's just the probability that red is a 6 multiplied
4777
04:04:01,680 --> 04:04:03,480
by the probability that blue is a 6.
4778
04:04:03,480 --> 04:04:05,760
It's both equal to 1 over 36.
4779
04:04:05,760 --> 04:04:10,320
So I can say that these two events are independent.
4780
04:04:10,320 --> 04:04:13,920
What wouldn't be independent, for example, would be an example.
4781
04:04:13,920 --> 04:04:16,320
So this, for example, has a probability of 1 over 36,
4782
04:04:16,320 --> 04:04:17,640
as we talked about before.
4783
04:04:17,640 --> 04:04:20,560
But what wouldn't be independent would be a case like this,
4784
04:04:20,560 --> 04:04:26,360
the probability that the red die rolls a 6 and the red die rolls a 4.
4785
04:04:26,360 --> 04:04:29,600
If you just naively took, OK, red die 6, red die 4,
4786
04:04:29,600 --> 04:04:31,280
well, if I'm only rolling the die once, you
4787
04:04:31,280 --> 04:04:34,120
might imagine the naive approach is to say, well, each of these
4788
04:04:34,120 --> 04:04:35,800
has a probability of 1 over 6.
4789
04:04:35,800 --> 04:04:39,440
So multiply them together, and the probability is 1 over 36.
4790
04:04:39,440 --> 04:04:41,720
But of course, if you're only rolling the red die once,
4791
04:04:41,720 --> 04:04:45,360
there's no way you could get two different values for the red die.
4792
04:04:45,360 --> 04:04:48,000
It couldn't both be a 6 and a 4.
4793
04:04:48,000 --> 04:04:50,200
So the probability should be 0.
4794
04:04:50,200 --> 04:04:53,680
But if you were to multiply probability of red 6 times
4795
04:04:53,680 --> 04:04:57,440
probability of red 4, well, that would equal 1 over 36.
4796
04:04:57,440 --> 04:04:58,760
But of course, that's not true.
4797
04:04:58,760 --> 04:05:01,800
Because we know that there is no way, probability 0,
4798
04:05:01,800 --> 04:05:06,200
that when we roll the red die once, we get both a 6 and a 4,
4799
04:05:06,200 --> 04:05:10,760
because only one of those possibilities can actually be the result.
4800
04:05:10,760 --> 04:05:14,280
And so we can say that the event that red roll is 6
4801
04:05:14,280 --> 04:05:18,360
and the event that red roll is 4, those two events are not independent.
4802
04:05:18,360 --> 04:05:23,200
If I know that the red roll is a 6, I know that the red roll cannot possibly
4803
04:05:23,200 --> 04:05:25,880
be a 4, so these things are not independent.
4804
04:05:25,880 --> 04:05:28,240
And instead, if I wanted to calculate the probability,
4805
04:05:28,240 --> 04:05:31,480
I would need to use this conditional probability
4806
04:05:31,480 --> 04:05:36,160
as the regular definition of the probability of two events taking place.
4807
04:05:36,160 --> 04:05:38,560
And the probability of this now, well, the probability
4808
04:05:38,560 --> 04:05:41,320
of the red roll being a 6, that's 1 over 6.
4809
04:05:41,320 --> 04:05:45,960
But what's the probability that the roll is a 4 given that the roll is a 6?
4810
04:05:45,960 --> 04:05:50,680
Well, this is just 0, because there's no way for the red roll to be a 4,
4811
04:05:50,680 --> 04:05:53,560
given that we already know the red roll is a 6.
4812
04:05:53,560 --> 04:05:59,320
And so the value, if we do add all that multiplication, is we get the number 0.
4813
04:05:59,320 --> 04:06:02,520
So this idea of conditional probability is going to come up again and again,
4814
04:06:02,520 --> 04:06:06,400
especially as we begin to reason about multiple different random variables
4815
04:06:06,400 --> 04:06:08,760
that might be interacting with each other in some way.
4816
04:06:08,760 --> 04:06:10,880
And this gets us to one of the most important rules
4817
04:06:10,880 --> 04:06:14,400
in probability theory, which is known as Bayes rule.
4818
04:06:14,400 --> 04:06:17,000
And it turns out that just using the information we've already
4819
04:06:17,000 --> 04:06:20,440
learned about probability and just applying a little bit of algebra,
4820
04:06:20,440 --> 04:06:23,480
we can actually derive Bayes rule for ourselves.
4821
04:06:23,480 --> 04:06:26,200
But it's a very important rule when it comes to inference
4822
04:06:26,200 --> 04:06:28,640
and thinking about probability in the context of what
4823
04:06:28,640 --> 04:06:31,200
it is that a computer can do or what a mathematician could
4824
04:06:31,200 --> 04:06:34,920
do by having access to information about probability.
4825
04:06:34,920 --> 04:06:39,400
So let's go back to these equations to be able to derive Bayes rule ourselves.
4826
04:06:39,400 --> 04:06:43,800
We know the probability of A and B, the likelihood that A and B take place,
4827
04:06:43,800 --> 04:06:47,240
is the likelihood of B, and then the likelihood of A,
4828
04:06:47,240 --> 04:06:49,680
given that we know that B is already true.
4829
04:06:49,680 --> 04:06:52,800
And likewise, the probability of A given A and B
4830
04:06:52,800 --> 04:06:56,240
is the probability of A times the probability of B,
4831
04:06:56,240 --> 04:06:58,280
given that we know that A is already true.
4832
04:06:58,280 --> 04:07:00,280
This is sort of a symmetric relationship where
4833
04:07:00,280 --> 04:07:04,000
it doesn't matter the order of A and B and B and A mean the same thing.
4834
04:07:04,000 --> 04:07:07,520
And so in these equations, we can just swap out A and B
4835
04:07:07,520 --> 04:07:09,720
to be able to represent the exact same idea.
4836
04:07:09,720 --> 04:07:12,200
So we know that these two equations are already true.
4837
04:07:12,200 --> 04:07:13,480
We've seen that already.
4838
04:07:13,480 --> 04:07:17,000
And now let's just do a little bit of algebraic manipulation of this stuff.
4839
04:07:17,000 --> 04:07:19,800
Both of these expressions on the right-hand side
4840
04:07:19,800 --> 04:07:24,040
are equal to the probability of A and B. So what I can do
4841
04:07:24,040 --> 04:07:26,600
is take these two expressions on the right-hand side
4842
04:07:26,600 --> 04:07:28,760
and just set them equal to each other.
4843
04:07:28,760 --> 04:07:32,480
If they're both equal to the probability of A and B,
4844
04:07:32,480 --> 04:07:34,600
then they both must be equal to each other.
4845
04:07:34,600 --> 04:07:38,400
So probability of A times probability of B given A
4846
04:07:38,400 --> 04:07:44,360
is equal to the probability of B times the probability of A given B.
4847
04:07:44,360 --> 04:07:47,480
And now all we're going to do is do a little bit of division.
4848
04:07:47,480 --> 04:07:53,480
I'm going to divide both sides by P of A. And now I get what is Bayes' rule.
4849
04:07:53,480 --> 04:07:59,000
The probability of B given A is equal to the probability of B
4850
04:07:59,000 --> 04:08:03,120
times the probability of A given B divided by the probability of A.
4851
04:08:03,120 --> 04:08:05,040
And sometimes in Bayes' rule, you'll see the order
4852
04:08:05,040 --> 04:08:06,320
of these two arguments switched.
4853
04:08:06,320 --> 04:08:10,520
So instead of B times A given B, it'll be A given B times B.
4854
04:08:10,520 --> 04:08:12,940
That ultimately doesn't matter because in multiplication,
4855
04:08:12,940 --> 04:08:15,600
you can switch the order of the two things you're multiplying,
4856
04:08:15,600 --> 04:08:18,480
and it doesn't change the result. But this here right now
4857
04:08:18,480 --> 04:08:21,120
is the most common formulation of Bayes' rule.
4858
04:08:21,120 --> 04:08:26,240
The probability of B given A is equal to the probability of A given
4859
04:08:26,240 --> 04:08:31,200
B times the probability of B divided by the probability of A.
4860
04:08:31,200 --> 04:08:33,640
And this rule, it turns out, is really important
4861
04:08:33,640 --> 04:08:36,280
when it comes to trying to infer things about the world,
4862
04:08:36,280 --> 04:08:39,720
because it means you can express one conditional probability,
4863
04:08:39,720 --> 04:08:44,000
the conditional probability of B given A, using knowledge
4864
04:08:44,000 --> 04:08:47,960
about the probability of A given B, using the reverse
4865
04:08:47,960 --> 04:08:49,680
of that conditional probability.
4866
04:08:49,680 --> 04:08:51,960
So let's first do a little bit of an example with this,
4867
04:08:51,960 --> 04:08:54,200
just to see how we might use it, and then explore
4868
04:08:54,200 --> 04:08:56,680
what this means a little bit more generally.
4869
04:08:56,680 --> 04:08:59,840
So we're going to construct a situation where I have some information.
4870
04:08:59,840 --> 04:09:02,400
There are two events that I care about, the idea
4871
04:09:02,400 --> 04:09:05,240
that it's cloudy in the morning and the idea
4872
04:09:05,240 --> 04:09:07,600
that it is rainy in the afternoon.
4873
04:09:07,600 --> 04:09:10,240
Those are two different possible events that could take place,
4874
04:09:10,240 --> 04:09:13,680
cloudy in the morning, or the AM, rainy in the PM.
4875
04:09:13,680 --> 04:09:17,160
And what I care about is, given clouds in the morning,
4876
04:09:17,160 --> 04:09:19,840
what is the probability of rain in the afternoon?
4877
04:09:19,840 --> 04:09:22,040
A reasonable question I might ask, in the morning,
4878
04:09:22,040 --> 04:09:24,840
I look outside, or an AI's camera looks outside
4879
04:09:24,840 --> 04:09:27,480
and sees that there are clouds in the morning.
4880
04:09:27,480 --> 04:09:30,880
And we want to conclude, we want to figure out what is the probability
4881
04:09:30,880 --> 04:09:34,000
that in the afternoon, there is going to be rain.
4882
04:09:34,000 --> 04:09:36,080
Of course, in the abstract, we don't have access
4883
04:09:36,080 --> 04:09:38,600
to this kind of information, but we can use data
4884
04:09:38,600 --> 04:09:40,400
to begin to try and figure this out.
4885
04:09:40,400 --> 04:09:44,680
So let's imagine now that I have access to some pieces of information.
4886
04:09:44,680 --> 04:09:48,440
I have access to the idea that 80% of rainy afternoons
4887
04:09:48,440 --> 04:09:50,400
start out with a cloudy morning.
4888
04:09:50,400 --> 04:09:52,920
And you might imagine that I could have gathered this data just
4889
04:09:52,920 --> 04:09:54,640
by looking at data over a sequence of time,
4890
04:09:54,640 --> 04:09:58,360
that I know that 80% of the time when it's raining in the afternoon,
4891
04:09:58,360 --> 04:10:01,360
it was cloudy that morning.
4892
04:10:01,360 --> 04:10:04,760
I also know that 40% of days have cloudy mornings.
4893
04:10:04,760 --> 04:10:08,680
And I also know that 10% of days have rainy afternoons.
4894
04:10:08,680 --> 04:10:12,280
And now using this information, I would like to figure out,
4895
04:10:12,280 --> 04:10:15,320
given clouds in the morning, what is the probability
4896
04:10:15,320 --> 04:10:16,720
that it rains in the afternoon?
4897
04:10:16,720 --> 04:10:21,200
I want to know the probability of afternoon rain given morning clouds.
4898
04:10:21,200 --> 04:10:26,160
And I can do that, in particular, using this fact, the probability of,
4899
04:10:26,160 --> 04:10:29,880
so if I know that 80% of rainy afternoons start with cloudy mornings,
4900
04:10:29,880 --> 04:10:34,040
then I know the probability of cloudy mornings given rainy afternoons.
4901
04:10:34,040 --> 04:10:36,760
So using sort of the reverse conditional probability,
4902
04:10:36,760 --> 04:10:38,080
I can figure that out.
4903
04:10:38,080 --> 04:10:41,160
Expressed in terms of Bayes rule, this is what that would look like.
4904
04:10:41,160 --> 04:10:46,520
Probability of rain given clouds is the probability of clouds given rain
4905
04:10:46,520 --> 04:10:50,000
times the probability of rain divided by the probability of clouds.
4906
04:10:50,000 --> 04:10:53,160
Here I'm just substituting in for the values of a and b
4907
04:10:53,160 --> 04:10:55,280
from that equation of Bayes rule from before.
4908
04:10:55,280 --> 04:10:56,320
And then I can just do the math.
4909
04:10:56,320 --> 04:10:57,400
I have this information.
4910
04:10:57,400 --> 04:11:00,880
I know that 80% of the time, if it was raining,
4911
04:11:00,880 --> 04:11:01,960
then there were clouds in the morning.
4912
04:11:01,960 --> 04:11:03,240
So 0.8 here.
4913
04:11:03,240 --> 04:11:06,640
Probability of rain is 0.1, because 10% of days were rainy,
4914
04:11:06,640 --> 04:11:08,480
and 40% of days were cloudy.
4915
04:11:08,480 --> 04:11:11,560
I do the math, and I can figure out the answer is 0.2.
4916
04:11:11,560 --> 04:11:14,440
So the probability that it rains in the afternoon,
4917
04:11:14,440 --> 04:11:19,720
given that it was cloudy in the morning, is 0.2 in this case.
4918
04:11:19,720 --> 04:11:22,120
And this now is an application of Bayes rule,
4919
04:11:22,120 --> 04:11:24,760
the idea that using one conditional probability,
4920
04:11:24,760 --> 04:11:27,720
we can get the reverse conditional probability.
4921
04:11:27,720 --> 04:11:31,040
And this is often useful when one of the conditional probabilities
4922
04:11:31,040 --> 04:11:34,840
might be easier for us to know about or easier for us to have data about.
4923
04:11:34,840 --> 04:11:37,520
And using that information, we can calculate
4924
04:11:37,520 --> 04:11:39,360
the other conditional probability.
4925
04:11:39,360 --> 04:11:40,640
So what does this look like?
4926
04:11:40,640 --> 04:11:43,720
Well, it means that knowing the probability of cloudy mornings
4927
04:11:43,720 --> 04:11:47,200
given rainy afternoons, we can calculate the probability
4928
04:11:47,200 --> 04:11:50,120
of rainy afternoons given cloudy mornings.
4929
04:11:50,120 --> 04:11:54,320
Or, for example, more generally, if we know the probability
4930
04:11:54,320 --> 04:11:58,480
of some visible effect, some effect that we can see and observe,
4931
04:11:58,480 --> 04:12:02,040
given some unknown cause that we're not sure about,
4932
04:12:02,040 --> 04:12:05,760
well, then we can calculate the probability of that unknown cause
4933
04:12:05,760 --> 04:12:08,440
given the visible effect.
4934
04:12:08,440 --> 04:12:10,080
So what might that look like?
4935
04:12:10,080 --> 04:12:12,200
Well, in the context of medicine, for example,
4936
04:12:12,200 --> 04:12:17,080
I might know the probability of some medical test result given a disease.
4937
04:12:17,080 --> 04:12:19,400
Like, I know that if someone has a disease,
4938
04:12:19,400 --> 04:12:23,040
then x% of the time the medical test result will show up as this,
4939
04:12:23,040 --> 04:12:24,000
for instance.
4940
04:12:24,000 --> 04:12:26,760
And using that information, then I can calculate, all right,
4941
04:12:26,760 --> 04:12:31,040
what is the probability that given I know the medical test result, what
4942
04:12:31,040 --> 04:12:33,120
is the likelihood that someone has the disease?
4943
04:12:33,120 --> 04:12:36,280
This is the piece of information that is usually easier to know,
4944
04:12:36,280 --> 04:12:38,760
easier to immediately have access to data for.
4945
04:12:38,760 --> 04:12:42,320
And this is the information that I actually want to calculate.
4946
04:12:42,320 --> 04:12:44,080
Or I might want to know, for example, if I
4947
04:12:44,080 --> 04:12:48,040
know that some probability of counterfeit bills
4948
04:12:48,040 --> 04:12:51,440
have blurry text around the edges, because counterfeit printers aren't
4949
04:12:51,440 --> 04:12:53,560
nearly as good at printing text precisely.
4950
04:12:53,560 --> 04:12:56,000
So I have some information about, given that something
4951
04:12:56,000 --> 04:12:59,160
is a counterfeit bill, like x% of counterfeit bills
4952
04:12:59,160 --> 04:13:01,120
have blurry text, for example.
4953
04:13:01,120 --> 04:13:04,480
And using that information, then I can calculate some piece of information
4954
04:13:04,480 --> 04:13:08,160
that I might want to know, like, given that I know there's blurry text
4955
04:13:08,160 --> 04:13:12,200
on a bill, what is the probability that that bill is counterfeit?
4956
04:13:12,200 --> 04:13:14,600
So given one conditional probability, I can
4957
04:13:14,600 --> 04:13:19,320
calculate the other conditional probability as well.
4958
04:13:19,320 --> 04:13:22,640
And so now we've taken a look at a couple of different types of probability.
4959
04:13:22,640 --> 04:13:24,840
And we've looked at unconditional probability,
4960
04:13:24,840 --> 04:13:27,920
where I just look at what is the probability of this event occurring,
4961
04:13:27,920 --> 04:13:31,040
given no additional evidence that I might have access to.
4962
04:13:31,040 --> 04:13:33,560
And we've also looked at conditional probability,
4963
04:13:33,560 --> 04:13:35,400
where I have some sort of evidence, and I
4964
04:13:35,400 --> 04:13:38,760
would like to, using that evidence, be able to calculate some other
4965
04:13:38,760 --> 04:13:40,480
probability as well.
4966
04:13:40,480 --> 04:13:43,560
And the other kind of probability that will be important for us to think about
4967
04:13:43,560 --> 04:13:45,280
is joint probability.
4968
04:13:45,280 --> 04:13:47,440
And this is when we're considering the likelihood
4969
04:13:47,440 --> 04:13:50,800
of multiple different events simultaneously.
4970
04:13:50,800 --> 04:13:52,200
And so what do we mean by this?
4971
04:13:52,200 --> 04:13:55,320
For example, I might have probability distributions
4972
04:13:55,320 --> 04:13:56,880
that look a little something like this.
4973
04:13:56,880 --> 04:13:59,800
Like, oh, I want to know the probability distribution of clouds
4974
04:13:59,800 --> 04:14:00,640
in the morning.
4975
04:14:00,640 --> 04:14:02,400
And that distribution looks like this.
4976
04:14:02,400 --> 04:14:06,080
40% of the time, C, which is my random variable here,
4977
04:14:06,080 --> 04:14:07,680
is equal to it's cloudy.
4978
04:14:07,680 --> 04:14:10,560
And 60% of the time, it's not cloudy.
4979
04:14:10,560 --> 04:14:13,040
So here is just a simple probability distribution
4980
04:14:13,040 --> 04:14:17,320
that is effectively telling me that 40% of the time, it's cloudy.
4981
04:14:17,320 --> 04:14:20,800
I might also have a probability distribution for rain in the afternoon,
4982
04:14:20,800 --> 04:14:24,240
where 10% of the time, or with probability 0.1,
4983
04:14:24,240 --> 04:14:25,800
it is raining in the afternoon.
4984
04:14:25,800 --> 04:14:30,680
And with probability 0.9, it is not raining in the afternoon.
4985
04:14:30,680 --> 04:14:34,080
And using just these two pieces of information,
4986
04:14:34,080 --> 04:14:36,160
I don't actually have a whole lot of information
4987
04:14:36,160 --> 04:14:39,480
about how these two variables relate to each other.
4988
04:14:39,480 --> 04:14:42,520
But I could if I had access to their joint probability,
4989
04:14:42,520 --> 04:14:45,160
meaning for every combination of these two things,
4990
04:14:45,160 --> 04:14:49,200
meaning morning cloudy and afternoon rain, morning cloudy and afternoon not
4991
04:14:49,200 --> 04:14:52,120
rain, morning not cloudy and afternoon rain,
4992
04:14:52,120 --> 04:14:54,760
and morning not cloudy and afternoon not raining,
4993
04:14:54,760 --> 04:14:57,320
if I had access to values for each of those four,
4994
04:14:57,320 --> 04:14:58,800
I'd have more information.
4995
04:14:58,800 --> 04:15:02,040
So information that'd be organized in a table like this,
4996
04:15:02,040 --> 04:15:05,320
and this, rather than just a probability distribution,
4997
04:15:05,320 --> 04:15:07,600
is a joint probability distribution.
4998
04:15:07,600 --> 04:15:10,720
It tells me the probability distribution of each
4999
04:15:10,720 --> 04:15:15,800
of the possible combinations of values that these random variables can take on.
5000
04:15:15,800 --> 04:15:19,280
So if I want to know what is the probability that on any given day
5001
04:15:19,280 --> 04:15:22,440
it is both cloudy and rainy, well, I would say, all right,
5002
04:15:22,440 --> 04:15:26,520
we're looking at cases where it is cloudy and cases where it is raining.
5003
04:15:26,520 --> 04:15:30,960
And the intersection of those two, that row in that column, is 0.08.
5004
04:15:30,960 --> 04:15:35,160
So that is the probability that it is both cloudy and rainy using
5005
04:15:35,160 --> 04:15:36,720
that information.
5006
04:15:36,720 --> 04:15:39,640
And using this conditional probability table,
5007
04:15:39,640 --> 04:15:41,880
using this joint probability table, I can
5008
04:15:41,880 --> 04:15:46,200
begin to draw other pieces of information about things like conditional
5009
04:15:46,200 --> 04:15:47,000
probability.
5010
04:15:47,000 --> 04:15:51,520
So I might ask a question like, what is the probability distribution of clouds
5011
04:15:51,520 --> 04:15:53,800
given that I know that it is raining?
5012
04:15:53,800 --> 04:15:56,280
Meaning I know for sure that it's raining.
5013
04:15:56,280 --> 04:15:59,800
Tell me the probability distribution over whether it's cloudy or not,
5014
04:15:59,800 --> 04:16:02,320
given that I know already that it is, in fact, raining.
5015
04:16:02,320 --> 04:16:05,080
And here I'm using C to stand for that random variable.
5016
04:16:05,080 --> 04:16:07,640
I'm looking for a distribution, meaning the answer to this
5017
04:16:07,640 --> 04:16:09,480
is not going to be a single value.
5018
04:16:09,480 --> 04:16:12,080
It's going to be two values, a vector of two values,
5019
04:16:12,080 --> 04:16:14,800
where the first value is probability of clouds,
5020
04:16:14,800 --> 04:16:17,600
the second value is probability that it is not cloudy,
5021
04:16:17,600 --> 04:16:19,880
but the sum of those two values is going to be 1.
5022
04:16:19,880 --> 04:16:23,280
Because when you add up the probabilities of all of the possible worlds,
5023
04:16:23,280 --> 04:16:26,840
the result that you get must be the number 1.
5024
04:16:26,840 --> 04:16:30,360
And well, what do we know about how to calculate a conditional probability?
5025
04:16:30,360 --> 04:16:33,600
Well, we know that the probability of A given B
5026
04:16:33,600 --> 04:16:38,960
is the probability of A and B divided by the probability of B.
5027
04:16:38,960 --> 04:16:40,280
So what does this mean?
5028
04:16:40,280 --> 04:16:43,240
Well, it means that I can calculate the probability of clouds
5029
04:16:43,240 --> 04:16:49,080
given that it's raining as the probability of clouds and raining
5030
04:16:49,080 --> 04:16:50,880
divided by the probability of rain.
5031
04:16:50,880 --> 04:16:53,640
And this comma here for the probability distribution
5032
04:16:53,640 --> 04:16:57,320
of clouds and rain, this comma sort of stands in for the word and.
5033
04:16:57,320 --> 04:16:59,920
You'll sort of see in the logical operator and and the comma
5034
04:16:59,920 --> 04:17:01,120
used interchangeably.
5035
04:17:01,120 --> 04:17:04,200
This means the probability distribution over the clouds
5036
04:17:04,200 --> 04:17:06,680
and knowing the fact that it is raining divided
5037
04:17:06,680 --> 04:17:09,160
by the probability of rain.
5038
04:17:09,160 --> 04:17:11,080
And the interesting thing to note here and what
5039
04:17:11,080 --> 04:17:13,640
we'll often do in order to simplify our mathematics
5040
04:17:13,640 --> 04:17:16,760
is that dividing by the probability of rain,
5041
04:17:16,760 --> 04:17:19,760
the probability of rain here is just some numerical constant.
5042
04:17:19,760 --> 04:17:20,560
It is some number.
5043
04:17:20,560 --> 04:17:24,480
Dividing by probability of rain is just dividing by some constant,
5044
04:17:24,480 --> 04:17:27,760
or in other words, multiplying by the inverse of that constant.
5045
04:17:27,760 --> 04:17:30,480
And it turns out that oftentimes we can just not
5046
04:17:30,480 --> 04:17:32,880
worry about what the exact value of this is
5047
04:17:32,880 --> 04:17:36,040
and just know that it is, in fact, a constant value.
5048
04:17:36,040 --> 04:17:37,280
And we'll see why in a moment.
5049
04:17:37,280 --> 04:17:41,040
So instead of expressing this as this joint probability divided
5050
04:17:41,040 --> 04:17:43,040
by the probability of rain, sometimes we'll
5051
04:17:43,040 --> 04:17:47,240
just represent it as alpha times the numerator here,
5052
04:17:47,240 --> 04:17:50,480
the probability distribution of C, this variable,
5053
04:17:50,480 --> 04:17:53,000
and that we know that it is raining, for instance.
5054
04:17:53,000 --> 04:17:57,920
So all we've done here is said this value of 1 over the probability of rain,
5055
04:17:57,920 --> 04:18:00,720
that's really just a constant we're going to divide by or equivalently
5056
04:18:00,720 --> 04:18:02,800
multiply by the inverse of at the end.
5057
04:18:02,800 --> 04:18:06,400
We'll just call it alpha for now and deal with it a little bit later.
5058
04:18:06,400 --> 04:18:09,800
But the key idea here now, and this is an idea that's going to come up again,
5059
04:18:09,800 --> 04:18:14,040
is that the conditional distribution of C given rain
5060
04:18:14,040 --> 04:18:17,120
is proportional to, meaning just some factor multiplied
5061
04:18:17,120 --> 04:18:22,200
by the joint probability of C and rain being true.
5062
04:18:22,200 --> 04:18:23,560
And so how do we figure this out?
5063
04:18:23,560 --> 04:18:25,760
Well, this is going to be the probability that it
5064
04:18:25,760 --> 04:18:28,440
is cloudy given that it's raining, which is 0.08,
5065
04:18:28,440 --> 04:18:30,680
and the probability that it's not cloudy given
5066
04:18:30,680 --> 04:18:32,960
that it's raining, which is 0.02.
5067
04:18:32,960 --> 04:18:37,680
And so we get alpha times here now is that probability distribution.
5068
04:18:37,680 --> 04:18:40,000
0.08 is clouds and rain.
5069
04:18:40,000 --> 04:18:43,840
0.02 is not cloudy and rain.
5070
04:18:43,840 --> 04:18:47,920
But of course, 0.08 and 0.02 don't sum up to the number 1.
5071
04:18:47,920 --> 04:18:50,400
And we know that in a probability distribution,
5072
04:18:50,400 --> 04:18:52,680
if you consider all of the possible values,
5073
04:18:52,680 --> 04:18:55,360
they must sum up to a probability of 1.
5074
04:18:55,360 --> 04:18:57,600
And so we know that we just need to figure out
5075
04:18:57,600 --> 04:19:01,720
some constant to normalize, so to speak, these values, something
5076
04:19:01,720 --> 04:19:05,480
we can multiply or divide by to get it so that all these probabilities sum up
5077
04:19:05,480 --> 04:19:08,920
to 1, and it turns out that if we multiply both numbers by 10,
5078
04:19:08,920 --> 04:19:11,920
then we can get that result of 0.8 and 0.2.
5079
04:19:11,920 --> 04:19:15,640
The proportions are still equivalent, but now 0.8 plus 0.2,
5080
04:19:15,640 --> 04:19:18,280
those sum up to the number 1.
5081
04:19:18,280 --> 04:19:21,400
So take a look at this and see if you can understand step by step
5082
04:19:21,400 --> 04:19:23,600
how it is we're getting from one point to another.
5083
04:19:23,600 --> 04:19:27,840
The key idea here is that by using the joint probabilities,
5084
04:19:27,840 --> 04:19:31,360
these probabilities that it is both cloudy and rainy
5085
04:19:31,360 --> 04:19:35,240
and that it is not cloudy and rainy, I can take that information
5086
04:19:35,240 --> 04:19:39,440
and figure out the conditional probability given that it's raining.
5087
04:19:39,440 --> 04:19:41,960
What is the chance that it's cloudy versus not cloudy?
5088
04:19:41,960 --> 04:19:46,320
Just by multiplying by some normalization constant, so to speak.
5089
04:19:46,320 --> 04:19:48,520
And this is what a computer can begin to use
5090
04:19:48,520 --> 04:19:52,880
to be able to interact with these various different types of probabilities.
5091
04:19:52,880 --> 04:19:55,420
And it turns out there are a number of other probability rules
5092
04:19:55,420 --> 04:19:57,440
that are going to be useful to us as we begin
5093
04:19:57,440 --> 04:20:01,200
to explore how we can actually use this information to encode
5094
04:20:01,200 --> 04:20:05,640
into our computers some more complex analysis that we might want to do
5095
04:20:05,640 --> 04:20:08,840
about probability and distributions and random variables
5096
04:20:08,840 --> 04:20:10,440
that we might be interacting with.
5097
04:20:10,440 --> 04:20:12,840
So here are a couple of those important probability rules.
5098
04:20:12,840 --> 04:20:15,480
One of the simplest rules is just this negation rule.
5099
04:20:15,480 --> 04:20:19,080
What is the probability of not event A?
5100
04:20:19,080 --> 04:20:21,600
So A is an event that has some probability,
5101
04:20:21,600 --> 04:20:25,480
and I would like to know what is the probability that A does not occur.
5102
04:20:25,480 --> 04:20:29,980
And it turns out it's just 1 minus P of A, which makes sense.
5103
04:20:29,980 --> 04:20:33,720
Because if those are the two possible cases, either A happens or A
5104
04:20:33,720 --> 04:20:37,600
doesn't happen, then when you add up those two cases, you must get 1,
5105
04:20:37,600 --> 04:20:42,600
which means that P of not A must just be 1 minus P of A.
5106
04:20:42,600 --> 04:20:46,560
Because P of A and P of not A must sum up to the number 1.
5107
04:20:46,560 --> 04:20:49,680
They must include all of the possible cases.
5108
04:20:49,680 --> 04:20:53,640
We've seen an expression for calculating the probability of A and B.
5109
04:20:53,640 --> 04:20:57,840
We might also reasonably want to calculate the probability of A or B.
5110
04:20:57,840 --> 04:21:01,200
What is the probability that one thing happens or another thing happens?
5111
04:21:01,200 --> 04:21:04,520
So for example, I might want to calculate what is the probability
5112
04:21:04,520 --> 04:21:07,880
that if I roll two dice, a red die and a blue die, what is the likelihood
5113
04:21:07,880 --> 04:21:11,480
that A is a 6 or B is a 6, like one or the other?
5114
04:21:11,480 --> 04:21:14,480
And what you might imagine you could do, and the wrong way to approach it,
5115
04:21:14,480 --> 04:21:19,000
would be just to say, all right, well, A comes up as a 6 with the red die
5116
04:21:19,000 --> 04:21:21,560
comes up as a 6 with probability 1 over 6.
5117
04:21:21,560 --> 04:21:23,720
The same for the blue die, it's also 1 over 6.
5118
04:21:23,720 --> 04:21:27,160
Add them together, and you get 2 over 6, otherwise known as 1 third.
5119
04:21:27,160 --> 04:21:30,480
But this suffers from a problem of over counting,
5120
04:21:30,480 --> 04:21:34,560
that we've double counted the case, where both A and B, both the red die
5121
04:21:34,560 --> 04:21:37,320
and the blue die, both come up as a 6-roll.
5122
04:21:37,320 --> 04:21:39,440
And I've counted that instance twice.
5123
04:21:39,440 --> 04:21:43,880
So to resolve this, the actual expression for calculating the probability of A
5124
04:21:43,880 --> 04:21:47,720
or B uses what we call the inclusion-exclusion formula.
5125
04:21:47,720 --> 04:21:51,120
So I take the probability of A, add it to the probability of B.
5126
04:21:51,120 --> 04:21:52,520
That's all same as before.
5127
04:21:52,520 --> 04:21:56,080
But then I need to exclude the cases that I've double counted.
5128
04:21:56,080 --> 04:22:01,240
So I subtract from that the probability of A and B.
5129
04:22:01,240 --> 04:22:05,160
And that gets me the result for A or B. I consider all the cases where A is true
5130
04:22:05,160 --> 04:22:07,000
and all the cases where B is true.
5131
04:22:07,000 --> 04:22:09,920
And if you imagine this is like a Venn diagram of cases where A is true,
5132
04:22:09,920 --> 04:22:12,800
cases where B is true, I just need to subtract out the middle
5133
04:22:12,800 --> 04:22:16,720
to get rid of the cases that I have overcounted by double counting them
5134
04:22:16,720 --> 04:22:21,160
inside of both of these individual expressions.
5135
04:22:21,160 --> 04:22:23,160
One other rule that's going to be quite helpful
5136
04:22:23,160 --> 04:22:25,400
is a rule called marginalization.
5137
04:22:25,400 --> 04:22:27,520
So marginalization is answering the question
5138
04:22:27,520 --> 04:22:31,760
of how do I figure out the probability of A using some other variable
5139
04:22:31,760 --> 04:22:33,600
that I might have access to, like B?
5140
04:22:33,600 --> 04:22:35,840
Even if I don't know additional information about it,
5141
04:22:35,840 --> 04:22:40,320
I know that B, some event, can have two possible states, either B
5142
04:22:40,320 --> 04:22:44,720
happens or B doesn't happen, assuming it's a Boolean, true or false.
5143
04:22:44,720 --> 04:22:47,160
And well, what that means is that for me to be
5144
04:22:47,160 --> 04:22:50,760
able to calculate the probability of A, there are only two cases.
5145
04:22:50,760 --> 04:22:55,560
Either A happens and B happens, or A happens and B doesn't happen.
5146
04:22:55,560 --> 04:22:58,840
And those are two disjoint, meaning they can't both happen together.
5147
04:22:58,840 --> 04:23:01,160
Either B happens or B doesn't happen.
5148
04:23:01,160 --> 04:23:03,280
They're disjoint or separate cases.
5149
04:23:03,280 --> 04:23:05,680
And so I can figure out the probability of A
5150
04:23:05,680 --> 04:23:07,800
just by adding up those two cases.
5151
04:23:07,800 --> 04:23:13,360
The probability that A is true is the probability that A and B is true,
5152
04:23:13,360 --> 04:23:16,520
plus the probability that A is true and B isn't true.
5153
04:23:16,520 --> 04:23:19,880
So by marginalizing, I've looked at the two possible cases
5154
04:23:19,880 --> 04:23:23,600
that might take place, either B happens or B doesn't happen.
5155
04:23:23,600 --> 04:23:25,560
And in either of those cases, I look at what's
5156
04:23:25,560 --> 04:23:27,240
the probability that A happens.
5157
04:23:27,240 --> 04:23:30,080
And if I add those together, well, then I get the probability
5158
04:23:30,080 --> 04:23:32,360
that A happens as a whole.
5159
04:23:32,360 --> 04:23:33,640
So take a look at that rule.
5160
04:23:33,640 --> 04:23:36,760
It doesn't matter what B is or how it's related to A.
5161
04:23:36,760 --> 04:23:39,200
So long as I know these joint distributions,
5162
04:23:39,200 --> 04:23:42,120
I can figure out the overall probability of A.
5163
04:23:42,120 --> 04:23:44,760
And this can be a useful way if I have a joint distribution,
5164
04:23:44,760 --> 04:23:48,200
like the joint distribution of A and B, to just figure out
5165
04:23:48,200 --> 04:23:51,320
some unconditional probability, like the probability of A.
5166
04:23:51,320 --> 04:23:54,160
And we'll see examples of this soon as well.
5167
04:23:54,160 --> 04:23:55,920
Now, sometimes these might not just be random,
5168
04:23:55,920 --> 04:23:58,680
might not just be variables that are events that are like they happened
5169
04:23:58,680 --> 04:24:00,800
or they didn't happen, like B is here.
5170
04:24:00,800 --> 04:24:03,320
They might be some broader probability distribution
5171
04:24:03,320 --> 04:24:05,520
where there are multiple possible values.
5172
04:24:05,520 --> 04:24:08,360
And so here, in order to use this marginalization rule,
5173
04:24:08,360 --> 04:24:11,720
I need to sum up not just over B and not B,
5174
04:24:11,720 --> 04:24:15,760
but for all of the possible values that the other random variable could take
5175
04:24:15,760 --> 04:24:16,320
on.
5176
04:24:16,320 --> 04:24:19,000
And so here, we'll see a version of this rule for random variables.
5177
04:24:19,000 --> 04:24:21,280
And it's going to include that summation notation
5178
04:24:21,280 --> 04:24:25,800
to indicate that I'm summing up, adding up a whole bunch of individual values.
5179
04:24:25,800 --> 04:24:26,800
So here's the rule.
5180
04:24:26,800 --> 04:24:28,760
Looks a lot more complicated, but it's actually
5181
04:24:28,760 --> 04:24:30,960
the equivalent exactly the same rule.
5182
04:24:30,960 --> 04:24:35,120
What I'm saying here is that if I have two random variables, one called x
5183
04:24:35,120 --> 04:24:41,000
and one called y, well, the probability that x is equal to some value x sub i,
5184
04:24:41,000 --> 04:24:43,800
this is just some value that this variable takes on.
5185
04:24:43,800 --> 04:24:45,120
How do I figure it out?
5186
04:24:45,120 --> 04:24:48,720
Well, I'm going to sum up over j, where j is going
5187
04:24:48,720 --> 04:24:53,000
to range over all of the possible values that y can take on.
5188
04:24:53,000 --> 04:24:58,240
Well, let's look at the probability that x equals xi and y equals yj.
5189
04:24:58,240 --> 04:25:00,240
So the exact same rule, the only difference here
5190
04:25:00,240 --> 04:25:03,000
is now I'm summing up over all of the possible values
5191
04:25:03,000 --> 04:25:06,960
that y can take on, saying let's add up all of those possible cases
5192
04:25:06,960 --> 04:25:10,760
and look at this joint distribution, this joint probability,
5193
04:25:10,760 --> 04:25:15,640
that x takes on the value I care about, given all of the possible values for y.
5194
04:25:15,640 --> 04:25:18,560
And if I add all those up, then I can get
5195
04:25:18,560 --> 04:25:22,360
this unconditional probability of what x is equal to,
5196
04:25:22,360 --> 04:25:26,080
whether or not x is equal to some value x sub i.
5197
04:25:26,080 --> 04:25:27,880
So let's take a look at this rule, because it
5198
04:25:27,880 --> 04:25:29,000
does look a little bit complicated.
5199
04:25:29,000 --> 04:25:31,280
Let's try and put a concrete example to it.
5200
04:25:31,280 --> 04:25:34,080
Here again is that same joint distribution from before.
5201
04:25:34,080 --> 04:25:38,120
I have cloud, not cloudy, rainy, not rainy.
5202
04:25:38,120 --> 04:25:40,480
And maybe I want to access some variable.
5203
04:25:40,480 --> 04:25:44,520
I want to know what is the probability that it is cloudy.
5204
04:25:44,520 --> 04:25:48,120
Well, marginalization says that if I have this joint distribution
5205
04:25:48,120 --> 04:25:51,600
and I want to know what is the probability that it is cloudy,
5206
04:25:51,600 --> 04:25:55,320
well, I need to consider the other variable, the variable that's not here,
5207
04:25:55,320 --> 04:25:56,720
the idea that it's rainy.
5208
04:25:56,720 --> 04:26:00,440
And I consider the two cases, either it's raining or it's not raining.
5209
04:26:00,440 --> 04:26:04,000
And I just sum up the values for each of those possibilities.
5210
04:26:04,000 --> 04:26:07,040
In other words, the probability that it is cloudy
5211
04:26:07,040 --> 04:26:12,320
is equal to the sum of the probability that it's cloudy and it's rainy
5212
04:26:12,320 --> 04:26:17,720
and the probability that it's cloudy and it is not raining.
5213
04:26:17,720 --> 04:26:20,080
And so these now are values that I have access to.
5214
04:26:20,080 --> 04:26:24,480
These are values that are just inside of this joint probability table.
5215
04:26:24,480 --> 04:26:27,600
What is the probability that it is both cloudy and rainy?
5216
04:26:27,600 --> 04:26:31,000
Well, it's just the intersection of these two here, which is 0.08.
5217
04:26:31,000 --> 04:26:34,240
And the probability that it's cloudy and not raining is, all right,
5218
04:26:34,240 --> 04:26:36,120
here's cloudy, here's not raining.
5219
04:26:36,120 --> 04:26:37,640
It's 0.32.
5220
04:26:37,640 --> 04:26:42,240
So it's 0.08 plus 0.32, which just gives us equal to 0.4.
5221
04:26:42,240 --> 04:26:46,560
That is the unconditional probability that it is, in fact, cloudy.
5222
04:26:46,560 --> 04:26:50,800
And so marginalization gives us a way to go from these joint distributions
5223
04:26:50,800 --> 04:26:53,960
to just some individual probability that I might care about.
5224
04:26:53,960 --> 04:26:56,680
And you'll see a little bit later why it is that we care about that
5225
04:26:56,680 --> 04:26:59,280
and why that's actually useful to us as we begin
5226
04:26:59,280 --> 04:27:01,860
doing some of these calculations.
5227
04:27:01,860 --> 04:27:04,020
Last rule we'll take a look at before transitioning
5228
04:27:04,020 --> 04:27:06,840
to something a little bit different is this rule of conditioning,
5229
04:27:06,840 --> 04:27:09,760
very similar to the marginalization rule.
5230
04:27:09,760 --> 04:27:12,240
But it says that, again, if I have two events, a and b,
5231
04:27:12,240 --> 04:27:15,440
but instead of having access to their joint probabilities,
5232
04:27:15,440 --> 04:27:17,820
I have access to their conditional probabilities,
5233
04:27:17,820 --> 04:27:19,520
how they relate to each other.
5234
04:27:19,520 --> 04:27:22,960
Well, again, if I want to know the probability that a happens,
5235
04:27:22,960 --> 04:27:26,480
and I know that there's some other variable b, either b happens or b
5236
04:27:26,480 --> 04:27:30,320
doesn't happen, and so I can say that the probability of a
5237
04:27:30,320 --> 04:27:35,840
is the probability of a given b times the probability of b, meaning b happened.
5238
04:27:35,840 --> 04:27:39,080
And given that I know b happened, what's the likelihood that a happened?
5239
04:27:39,080 --> 04:27:42,200
And then I consider the other case, that b didn't happen.
5240
04:27:42,200 --> 04:27:44,960
So here's the probability that b didn't happen.
5241
04:27:44,960 --> 04:27:47,160
And here's the probability that a happens,
5242
04:27:47,160 --> 04:27:49,520
given that I know that b didn't happen.
5243
04:27:49,520 --> 04:27:51,580
And this is really the equivalent rule just
5244
04:27:51,580 --> 04:27:55,280
using conditional probability instead of joint probability,
5245
04:27:55,280 --> 04:27:59,440
where I'm saying let's look at both of these two cases and condition on b.
5246
04:27:59,440 --> 04:28:03,120
Look at the case where b happens, and look at the case where b doesn't happen,
5247
04:28:03,120 --> 04:28:06,200
and look at what probabilities I get as a result.
5248
04:28:06,200 --> 04:28:08,320
And just as in the case of marginalization,
5249
04:28:08,320 --> 04:28:10,520
where there was an equivalent rule for random variables
5250
04:28:10,520 --> 04:28:14,480
that could take on multiple possible values in a domain of possible values,
5251
04:28:14,480 --> 04:28:17,160
here, too, conditioning has the same equivalent rule.
5252
04:28:17,160 --> 04:28:19,640
Again, there's a summation to mean I'm summing over
5253
04:28:19,640 --> 04:28:23,720
all of the possible values that some random variable y could take on.
5254
04:28:23,720 --> 04:28:27,760
But if I want to know what is the probability that x takes on this value,
5255
04:28:27,760 --> 04:28:31,840
then I'm going to sum up over all the values j that y could take on,
5256
04:28:31,840 --> 04:28:35,800
and say, all right, what's the chance that y takes on that value yj?
5257
04:28:35,800 --> 04:28:38,360
And multiply it by the conditional probability
5258
04:28:38,360 --> 04:28:42,840
that x takes on this value, given that y took on that value yj.
5259
04:28:42,840 --> 04:28:46,120
So equivalent rule just using conditional probabilities
5260
04:28:46,120 --> 04:28:47,760
instead of joint probabilities.
5261
04:28:47,760 --> 04:28:50,400
And using the equation we know about joint probabilities,
5262
04:28:50,400 --> 04:28:53,440
we can translate between these two.
5263
04:28:53,440 --> 04:28:55,440
So all right, we've seen a whole lot of mathematics,
5264
04:28:55,440 --> 04:28:57,760
and we've just laid the foundation for mathematics.
5265
04:28:57,760 --> 04:29:00,840
And no need to worry if you haven't seen probability in too much detail
5266
04:29:00,840 --> 04:29:02,000
up until this point.
5267
04:29:02,000 --> 04:29:05,080
These are the foundations of the ideas that are going to come up
5268
04:29:05,080 --> 04:29:09,600
as we begin to explore how we can now take these ideas from probability
5269
04:29:09,600 --> 04:29:12,720
and begin to apply them to represent something inside of our computer,
5270
04:29:12,720 --> 04:29:16,160
something inside of the AI agent we're trying to design that
5271
04:29:16,160 --> 04:29:18,920
is able to represent information and probabilities
5272
04:29:18,920 --> 04:29:22,240
and the likelihoods between various different events.
5273
04:29:22,240 --> 04:29:24,640
So there are a number of different probabilistic models
5274
04:29:24,640 --> 04:29:26,840
that we can generate, but the first of the models
5275
04:29:26,840 --> 04:29:30,160
we're going to talk about are what are known as Bayesian networks.
5276
04:29:30,160 --> 04:29:34,000
And a Bayesian network is just going to be some network of random variables,
5277
04:29:34,000 --> 04:29:37,160
connected random variables that are going to represent
5278
04:29:37,160 --> 04:29:39,880
the dependence between these random variables.
5279
04:29:39,880 --> 04:29:43,080
The odds are most random variables in this world
5280
04:29:43,080 --> 04:29:45,200
are not independent from each other, but there's
5281
04:29:45,200 --> 04:29:48,360
some relationship between things that are happening that we care about.
5282
04:29:48,360 --> 04:29:51,840
If it is rainy today, that might increase the likelihood
5283
04:29:51,840 --> 04:29:54,400
that my flight or my train gets delayed, for example.
5284
04:29:54,400 --> 04:29:57,240
There are some dependence between these random variables,
5285
04:29:57,240 --> 04:30:01,960
and a Bayesian network is going to be able to capture those dependencies.
5286
04:30:01,960 --> 04:30:03,280
So what is a Bayesian network?
5287
04:30:03,280 --> 04:30:06,040
What is its actual structure, and how does it work?
5288
04:30:06,040 --> 04:30:08,760
Well, a Bayesian network is going to be a directed graph.
5289
04:30:08,760 --> 04:30:10,800
And again, we've seen directed graphs before.
5290
04:30:10,800 --> 04:30:13,800
They are individual nodes with arrows or edges
5291
04:30:13,800 --> 04:30:18,520
that connect one node to another node pointing in a particular direction.
5292
04:30:18,520 --> 04:30:20,600
And so this directed graph is going to have nodes
5293
04:30:20,600 --> 04:30:23,480
as well, where each node in this directed graph
5294
04:30:23,480 --> 04:30:27,040
is going to represent a random variable, something like the weather,
5295
04:30:27,040 --> 04:30:30,880
or something like whether my train was on time or delayed.
5296
04:30:30,880 --> 04:30:34,440
And we're going to have an arrow from a node x to a node y
5297
04:30:34,440 --> 04:30:37,080
to mean that x is a parent of y.
5298
04:30:37,080 --> 04:30:38,200
So that'll be our notation.
5299
04:30:38,200 --> 04:30:42,600
If there's an arrow from x to y, x is going to be considered a parent of y.
5300
04:30:42,600 --> 04:30:46,000
And the reason that's important is because each of these nodes
5301
04:30:46,000 --> 04:30:48,840
is going to have a probability distribution that we're
5302
04:30:48,840 --> 04:30:52,280
going to store along with it, which is the distribution of x
5303
04:30:52,280 --> 04:30:56,160
given some evidence, given the parents of x.
5304
04:30:56,160 --> 04:30:58,120
So the way to more intuitively think about this
5305
04:30:58,120 --> 04:31:01,880
is the parents seem to be thought of as sort of causes for some effect
5306
04:31:01,880 --> 04:31:04,240
that we're going to observe.
5307
04:31:04,240 --> 04:31:07,400
And so let's take a look at an actual example of a Bayesian network
5308
04:31:07,400 --> 04:31:09,880
and think about the types of logic that might be involved
5309
04:31:09,880 --> 04:31:11,680
in reasoning about that network.
5310
04:31:11,680 --> 04:31:15,200
Let's imagine for a moment that I have an appointment out of town,
5311
04:31:15,200 --> 04:31:18,200
and I need to take a train in order to get to that appointment.
5312
04:31:18,200 --> 04:31:19,960
So what are the things I might care about?
5313
04:31:19,960 --> 04:31:22,240
Well, I care about getting to my appointment on time.
5314
04:31:22,240 --> 04:31:24,720
Whether I make it to my appointment and I'm able to attend it
5315
04:31:24,720 --> 04:31:26,360
or I miss the appointment.
5316
04:31:26,360 --> 04:31:29,120
And you might imagine that that's influenced by the train,
5317
04:31:29,120 --> 04:31:33,680
that the train is either on time or it's delayed, for example.
5318
04:31:33,680 --> 04:31:36,000
But that train itself is also influenced.
5319
04:31:36,000 --> 04:31:39,680
Whether the train is on time or not depends maybe on the rain.
5320
04:31:39,680 --> 04:31:40,520
Is there no rain?
5321
04:31:40,520 --> 04:31:41,180
Is it light rain?
5322
04:31:41,180 --> 04:31:42,480
Is there heavy rain?
5323
04:31:42,480 --> 04:31:44,720
And it might also be influenced by other variables too.
5324
04:31:44,720 --> 04:31:47,000
It might be influenced as well by whether or not
5325
04:31:47,000 --> 04:31:49,200
there's maintenance on the train track, for example.
5326
04:31:49,200 --> 04:31:51,080
If there is maintenance on the train track,
5327
04:31:51,080 --> 04:31:55,480
that probably increases the likelihood that my train is delayed.
5328
04:31:55,480 --> 04:31:57,640
And so we can represent all of these ideas
5329
04:31:57,640 --> 04:32:01,000
using a Bayesian network that looks a little something like this.
5330
04:32:01,000 --> 04:32:05,080
Here I have four nodes representing four random variables
5331
04:32:05,080 --> 04:32:06,600
that I would like to keep track of.
5332
04:32:06,600 --> 04:32:08,800
I have one random variable called rain that
5333
04:32:08,800 --> 04:32:12,840
can take on three possible values in its domain, either none or light
5334
04:32:12,840 --> 04:32:16,160
or heavy, for no rain, light rain, or heavy rain.
5335
04:32:16,160 --> 04:32:18,280
I have a variable called maintenance for whether or not
5336
04:32:18,280 --> 04:32:20,240
there is maintenance on the train track, which
5337
04:32:20,240 --> 04:32:22,600
it has two possible values, just either yes or no.
5338
04:32:22,600 --> 04:32:26,160
Either there is maintenance or there's no maintenance happening on the track.
5339
04:32:26,160 --> 04:32:28,840
Then I have a random variable for the train indicating whether or not
5340
04:32:28,840 --> 04:32:30,120
the train was on time or not.
5341
04:32:30,120 --> 04:32:33,480
That random variable has two possible values in its domain.
5342
04:32:33,480 --> 04:32:37,360
The train is either on time or the train is delayed.
5343
04:32:37,360 --> 04:32:39,480
And then finally, I have a random variable
5344
04:32:39,480 --> 04:32:41,120
for whether I make it to my appointment.
5345
04:32:41,120 --> 04:32:43,600
For my appointment down here, I have a random variable
5346
04:32:43,600 --> 04:32:49,120
called appointment that itself has two possible values, attend and miss.
5347
04:32:49,120 --> 04:32:50,560
And so here are the possible values.
5348
04:32:50,560 --> 04:32:54,040
Here are my four nodes, each of which represents a random variable, each
5349
04:32:54,040 --> 04:32:58,120
of which has a domain of possible values that it can take on.
5350
04:32:58,120 --> 04:33:01,600
And the arrows, the edges pointing from one node to another,
5351
04:33:01,600 --> 04:33:05,880
encode some notion of dependence inside of this graph,
5352
04:33:05,880 --> 04:33:08,440
that whether I make it to my appointment or not
5353
04:33:08,440 --> 04:33:12,200
is dependent upon whether the train is on time or delayed.
5354
04:33:12,200 --> 04:33:14,320
And whether the train is on time or delayed
5355
04:33:14,320 --> 04:33:18,520
is dependent on two things given by the two arrows pointing at this node.
5356
04:33:18,520 --> 04:33:22,000
It is dependent on whether or not there was maintenance on the train track.
5357
04:33:22,000 --> 04:33:25,640
And it is also dependent upon whether or not it was raining
5358
04:33:25,640 --> 04:33:27,360
or whether it is raining.
5359
04:33:27,360 --> 04:33:29,360
And just to make things a little complicated,
5360
04:33:29,360 --> 04:33:32,920
let's say as well that whether or not there is maintenance on the track,
5361
04:33:32,920 --> 04:33:34,920
this too might be influenced by the rain.
5362
04:33:34,920 --> 04:33:37,360
That if there's heavier rain, well, maybe it's
5363
04:33:37,360 --> 04:33:40,320
less likely that it's going to be maintenance on the train track that day
5364
04:33:40,320 --> 04:33:43,360
because they're more likely to want to do maintenance on the track on days
5365
04:33:43,360 --> 04:33:45,000
when it's not raining, for example.
5366
04:33:45,000 --> 04:33:47,920
And so these nodes might have different relationships between them.
5367
04:33:47,920 --> 04:33:51,360
But the idea is that we can come up with a probability distribution
5368
04:33:51,360 --> 04:33:56,000
for any of these nodes based only upon its parents.
5369
04:33:56,000 --> 04:33:59,760
And so let's look node by node at what this probability distribution might
5370
04:33:59,760 --> 04:34:00,480
actually look like.
5371
04:34:00,480 --> 04:34:03,600
And we'll go ahead and begin with this root node, this rain node here,
5372
04:34:03,600 --> 04:34:07,440
which is at the top, and has no arrows pointing into it, which
5373
04:34:07,440 --> 04:34:10,160
means its probability distribution is not
5374
04:34:10,160 --> 04:34:11,920
going to be a conditional distribution.
5375
04:34:11,920 --> 04:34:13,520
It's not based on anything.
5376
04:34:13,520 --> 04:34:17,920
I just have some probability distribution over the possible values
5377
04:34:17,920 --> 04:34:20,280
for the rain random variable.
5378
04:34:20,280 --> 04:34:23,200
And that distribution might look a little something like this.
5379
04:34:23,200 --> 04:34:25,800
None, light and heavy, each have a possible value.
5380
04:34:25,800 --> 04:34:31,120
Here I'm saying the likelihood of no rain is 0.7, of light rain is 0.2,
5381
04:34:31,120 --> 04:34:33,880
of heavy rain is 0.1, for example.
5382
04:34:33,880 --> 04:34:38,080
So here is a probability distribution for this root node in this Bayesian
5383
04:34:38,080 --> 04:34:39,360
network.
5384
04:34:39,360 --> 04:34:42,640
And let's now consider the next node in the network, maintenance.
5385
04:34:42,640 --> 04:34:44,680
Track maintenance is yes or no.
5386
04:34:44,680 --> 04:34:47,960
And the general idea of what this distribution is going to encode,
5387
04:34:47,960 --> 04:34:52,120
at least in this story, is the idea that the heavier the rain is,
5388
04:34:52,120 --> 04:34:55,240
the less likely it is that there's going to be maintenance on the track.
5389
04:34:55,240 --> 04:34:57,620
Because the people that are doing maintenance on the track probably
5390
04:34:57,620 --> 04:35:00,480
want to wait until a day when it's not as rainy in order
5391
04:35:00,480 --> 04:35:02,520
to do the track maintenance, for example.
5392
04:35:02,520 --> 04:35:05,120
And so what might that probability distribution look like?
5393
04:35:05,120 --> 04:35:08,720
Well, this now is going to be a conditional probability distribution,
5394
04:35:08,720 --> 04:35:12,400
that here are the three possible values for the rain random variable, which
5395
04:35:12,400 --> 04:35:15,680
I'm here just going to abbreviate to R, either no rain, light rain,
5396
04:35:15,680 --> 04:35:17,080
or heavy rain.
5397
04:35:17,080 --> 04:35:19,640
And for each of those possible values, either there
5398
04:35:19,640 --> 04:35:22,820
is yes track maintenance or no track maintenance.
5399
04:35:22,820 --> 04:35:25,760
And those have probabilities associated with them.
5400
04:35:25,760 --> 04:35:30,280
That I see here that if it is not raining,
5401
04:35:30,280 --> 04:35:33,280
then there is a probability of 0.4 that there's track maintenance
5402
04:35:33,280 --> 04:35:36,000
and a probability of 0.6 that there isn't.
5403
04:35:36,000 --> 04:35:38,840
But if there's heavy rain, then here the chance
5404
04:35:38,840 --> 04:35:41,640
that there is track maintenance is 0.1 and the chance
5405
04:35:41,640 --> 04:35:44,200
that there is not track maintenance is 0.9.
5406
04:35:44,200 --> 04:35:47,160
Each of these rows is going to sum up to 1.
5407
04:35:47,160 --> 04:35:49,640
Because each of these represent different values
5408
04:35:49,640 --> 04:35:52,360
of whether or not it's raining, the three possible values
5409
04:35:52,360 --> 04:35:54,320
that that random variable can take on.
5410
04:35:54,320 --> 04:35:57,800
And each is associated with its own probability distribution
5411
04:35:57,800 --> 04:36:02,080
that is ultimately all going to add up to the number 1.
5412
04:36:02,080 --> 04:36:05,920
So that there is our distribution for this random variable called maintenance,
5413
04:36:05,920 --> 04:36:09,720
about whether or not there is maintenance on the train track.
5414
04:36:09,720 --> 04:36:11,680
And now let's consider the next variable.
5415
04:36:11,680 --> 04:36:15,040
Here we have a node inside of our Bayesian network called train
5416
04:36:15,040 --> 04:36:18,080
that has two possible values, on time and delayed.
5417
04:36:18,080 --> 04:36:21,800
And this node is going to be dependent upon the two nodes that
5418
04:36:21,800 --> 04:36:23,800
are pointing towards it, that whether or not
5419
04:36:23,800 --> 04:36:27,200
the train is on time or delayed depends on whether or not
5420
04:36:27,200 --> 04:36:28,520
there is track maintenance.
5421
04:36:28,520 --> 04:36:30,480
And it depends on whether or not there is rain,
5422
04:36:30,480 --> 04:36:35,160
that heavier rain probably means more likely that my train is delayed.
5423
04:36:35,160 --> 04:36:38,200
And if there is track maintenance, that also probably
5424
04:36:38,200 --> 04:36:41,880
means it's more likely that my train is delayed as well.
5425
04:36:41,880 --> 04:36:45,000
And so you could construct a larger probability distribution,
5426
04:36:45,000 --> 04:36:47,760
a conditional probability distribution, that instead
5427
04:36:47,760 --> 04:36:51,160
of conditioning on just one variable, as was the case here,
5428
04:36:51,160 --> 04:36:54,000
is now conditioning on two variables, conditioning
5429
04:36:54,000 --> 04:36:58,920
both on rain represented by r and on maintenance represented by yes.
5430
04:36:58,920 --> 04:37:02,680
Again, each of these rows has two values that sum up to the number 1,
5431
04:37:02,680 --> 04:37:06,920
one for whether the train is on time, one for whether the train is delayed.
5432
04:37:06,920 --> 04:37:08,880
And here I can say something like, all right,
5433
04:37:08,880 --> 04:37:12,600
if I know there was light rain and track maintenance, well, OK,
5434
04:37:12,600 --> 04:37:16,120
that would be r is light and m is yes.
5435
04:37:16,120 --> 04:37:19,840
Well, then there is a probability of 0.6 that my train is on time,
5436
04:37:19,840 --> 04:37:23,200
and a probability of 0.4 the train is delayed.
5437
04:37:23,200 --> 04:37:25,480
And you can imagine gathering this data just
5438
04:37:25,480 --> 04:37:28,960
by looking at real world data, looking at data about, all right,
5439
04:37:28,960 --> 04:37:31,800
if I knew that it was light rain and there was track maintenance,
5440
04:37:31,800 --> 04:37:33,880
how often was a train delayed or not delayed?
5441
04:37:33,880 --> 04:37:35,680
And you could begin to construct this thing.
5442
04:37:35,680 --> 04:37:37,920
The interesting thing is intelligently, being
5443
04:37:37,920 --> 04:37:40,880
able to try to figure out how might you go about ordering these things,
5444
04:37:40,880 --> 04:37:46,320
what things might influence other nodes inside of this Bayesian network.
5445
04:37:46,320 --> 04:37:50,480
And the last thing I care about is whether or not I make it to my appointment.
5446
04:37:50,480 --> 04:37:52,800
So did I attend or miss the appointment?
5447
04:37:52,800 --> 04:37:55,760
And ultimately, whether I attend or miss the appointment,
5448
04:37:55,760 --> 04:37:59,520
it is influenced by track maintenance, because it's indirectly this idea that,
5449
04:37:59,520 --> 04:38:01,240
all right, if there is track maintenance,
5450
04:38:01,240 --> 04:38:02,940
well, then my train might more likely be delayed.
5451
04:38:02,940 --> 04:38:04,740
And if my train is more likely to be delayed,
5452
04:38:04,740 --> 04:38:06,880
then I'm more likely to miss my appointment.
5453
04:38:06,880 --> 04:38:09,240
But what we encode in this Bayesian network
5454
04:38:09,240 --> 04:38:12,440
are just what we might consider to be more direct relationships.
5455
04:38:12,440 --> 04:38:15,300
So the train has a direct influence on the appointment.
5456
04:38:15,300 --> 04:38:18,300
And given that I know whether the train is on time or delayed,
5457
04:38:18,300 --> 04:38:20,440
knowing whether there's track maintenance isn't
5458
04:38:20,440 --> 04:38:24,080
going to give me any additional information that I didn't already have.
5459
04:38:24,080 --> 04:38:27,680
That if I know train, these other nodes that are up above
5460
04:38:27,680 --> 04:38:30,840
isn't really going to influence the result.
5461
04:38:30,840 --> 04:38:34,500
And so here we might represent it using another conditional probability
5462
04:38:34,500 --> 04:38:36,900
distribution that looks a little something like this.
5463
04:38:36,900 --> 04:38:39,780
The train can take on two possible values.
5464
04:38:39,780 --> 04:38:42,360
Either my train is on time or my train is delayed.
5465
04:38:42,360 --> 04:38:44,120
And for each of those two possible values,
5466
04:38:44,120 --> 04:38:46,840
I have a distribution for what are the odds that I'm
5467
04:38:46,840 --> 04:38:49,720
able to attend the meeting and what are the odds that I missed the meeting.
5468
04:38:49,720 --> 04:38:51,640
And obviously, if my train is on time, I'm
5469
04:38:51,640 --> 04:38:53,760
much more likely to be able to attend the meeting
5470
04:38:53,760 --> 04:38:57,760
than if my train is delayed, in which case I'm more likely to miss that
5471
04:38:57,760 --> 04:38:59,000
meeting.
5472
04:38:59,000 --> 04:39:03,360
So all of these nodes put all together here represent this Bayesian network,
5473
04:39:03,360 --> 04:39:07,120
this network of random variables whose values I ultimately care about,
5474
04:39:07,120 --> 04:39:09,920
and that have some sort of relationship between them,
5475
04:39:09,920 --> 04:39:13,320
some sort of dependence where these arrows from one node to another
5476
04:39:13,320 --> 04:39:15,360
indicate some dependence, that I can calculate
5477
04:39:15,360 --> 04:39:21,400
the probability of some node given the parents that happen to exist there.
5478
04:39:21,400 --> 04:39:24,540
So now that we've been able to describe the structure of this Bayesian
5479
04:39:24,540 --> 04:39:27,320
network and the relationships between each of these nodes
5480
04:39:27,320 --> 04:39:30,720
by associating each of the nodes in the network with a probability
5481
04:39:30,720 --> 04:39:34,480
distribution, whether that's an unconditional probability distribution
5482
04:39:34,480 --> 04:39:36,720
in the case of this root node here, like rain,
5483
04:39:36,720 --> 04:39:39,560
and a conditional probability distribution in the case
5484
04:39:39,560 --> 04:39:42,000
of all of the other nodes whose probabilities are
5485
04:39:42,000 --> 04:39:44,560
dependent upon the values of their parents,
5486
04:39:44,560 --> 04:39:47,800
we can begin to do some computation and calculation using
5487
04:39:47,800 --> 04:39:50,120
the information inside of that table.
5488
04:39:50,120 --> 04:39:51,960
So let's imagine, for example, that I just
5489
04:39:51,960 --> 04:39:55,560
wanted to compute something simple like the probability of light rain.
5490
04:39:55,560 --> 04:39:57,760
How would I get the probability of light rain?
5491
04:39:57,760 --> 04:40:01,000
Well, light rain, rain here is a root node.
5492
04:40:01,000 --> 04:40:03,400
And so if I wanted to calculate that probability,
5493
04:40:03,400 --> 04:40:06,360
I could just look at the probability distribution for rain
5494
04:40:06,360 --> 04:40:10,680
and extract from it the probability of light rains, just a single value
5495
04:40:10,680 --> 04:40:12,840
that I already have access to.
5496
04:40:12,840 --> 04:40:16,160
But we could also imagine wanting to compute more complex joint
5497
04:40:16,160 --> 04:40:21,200
probabilities, like the probability that there is light rain and also
5498
04:40:21,200 --> 04:40:22,240
no track maintenance.
5499
04:40:22,240 --> 04:40:27,080
This is a joint probability of two values, light rain and no track
5500
04:40:27,080 --> 04:40:27,960
maintenance.
5501
04:40:27,960 --> 04:40:30,960
And the way I might do that is first by starting by saying, all right,
5502
04:40:30,960 --> 04:40:33,400
well, let me get the probability of light rain.
5503
04:40:33,400 --> 04:40:36,800
But now I also want the probability of no track maintenance.
5504
04:40:36,800 --> 04:40:41,360
But of course, this node is dependent upon the value of rain.
5505
04:40:41,360 --> 04:40:44,560
So what I really want is the probability of no track maintenance,
5506
04:40:44,560 --> 04:40:47,160
given that I know that there was light rain.
5507
04:40:47,160 --> 04:40:51,280
And so the expression for calculating this idea that the probability of light
5508
04:40:51,280 --> 04:40:56,040
rain and no track maintenance is really just the probability of light rain
5509
04:40:56,040 --> 04:40:58,840
and the probability that there is no track maintenance,
5510
04:40:58,840 --> 04:41:01,840
given that I know that there already is light rain.
5511
04:41:01,840 --> 04:41:05,160
So I take the unconditional probability of light rain,
5512
04:41:05,160 --> 04:41:09,800
multiply it by the conditional probability of no track maintenance,
5513
04:41:09,800 --> 04:41:12,320
given that I know there is light rain.
5514
04:41:12,320 --> 04:41:15,400
And you can continue to do this again and again for every variable
5515
04:41:15,400 --> 04:41:18,040
that you want to add into this joint probability
5516
04:41:18,040 --> 04:41:19,320
that I might want to calculate.
5517
04:41:19,320 --> 04:41:23,240
If I wanted to know the probability of light rain and no track maintenance
5518
04:41:23,240 --> 04:41:27,960
and a delayed train, well, that's going to be the probability of light rain,
5519
04:41:27,960 --> 04:41:31,880
multiplied by the probability of no track maintenance, given light rain,
5520
04:41:31,880 --> 04:41:36,400
multiplied by the probability of a delayed train, given light rain
5521
04:41:36,400 --> 04:41:37,400
and no track maintenance.
5522
04:41:37,400 --> 04:41:39,640
Because whether the train is on time or delayed
5523
04:41:39,640 --> 04:41:42,920
is dependent upon both of these other two variables.
5524
04:41:42,920 --> 04:41:45,200
And so I have two pieces of evidence that go
5525
04:41:45,200 --> 04:41:48,480
into the calculation of that conditional probability.
5526
04:41:48,480 --> 04:41:51,120
And each of these three values is just a value
5527
04:41:51,120 --> 04:41:55,280
that I can look up by looking at one of these individual probability
5528
04:41:55,280 --> 04:41:59,760
distributions that is encoded into my Bayesian network.
5529
04:41:59,760 --> 04:42:03,040
And if I wanted a joint probability over all four of the variables,
5530
04:42:03,040 --> 04:42:06,840
something like the probability of light rain and no track maintenance
5531
04:42:06,840 --> 04:42:09,760
and a delayed train and I miss my appointment,
5532
04:42:09,760 --> 04:42:12,520
well, that's going to be multiplying four different values, one
5533
04:42:12,520 --> 04:42:14,520
from each of these individual nodes.
5534
04:42:14,520 --> 04:42:16,600
It's going to be the probability of light rain,
5535
04:42:16,600 --> 04:42:20,600
then of no track maintenance given light rain, then of a delayed train,
5536
04:42:20,600 --> 04:42:22,720
given light rain and no track maintenance.
5537
04:42:22,720 --> 04:42:25,000
And then finally, for this node here, for whether I
5538
04:42:25,000 --> 04:42:26,840
make it to my appointment or not, it's not
5539
04:42:26,840 --> 04:42:29,360
dependent upon these two variables, given
5540
04:42:29,360 --> 04:42:31,880
that I know whether or not the train is on time.
5541
04:42:31,880 --> 04:42:34,680
I only need to care about the conditional probability
5542
04:42:34,680 --> 04:42:37,800
that I miss my train, or that I miss my appointment,
5543
04:42:37,800 --> 04:42:39,880
given that the train happens to be delayed.
5544
04:42:39,880 --> 04:42:43,720
And so that's represented here by four probabilities, each of which
5545
04:42:43,720 --> 04:42:47,040
is located inside of one of these probability distributions
5546
04:42:47,040 --> 04:42:50,760
for each of the nodes, all multiplied together.
5547
04:42:50,760 --> 04:42:52,920
And so I can take a variable like that and figure out
5548
04:42:52,920 --> 04:42:55,520
what the joint probability is by multiplying
5549
04:42:55,520 --> 04:42:59,640
a whole bunch of these individual probabilities from the Bayesian network.
5550
04:42:59,640 --> 04:43:02,720
But of course, just as with last time, where what I really wanted to do
5551
04:43:02,720 --> 04:43:05,240
was to be able to get new pieces of information,
5552
04:43:05,240 --> 04:43:08,280
here, too, this is what we're going to want to do with our Bayesian network.
5553
04:43:08,280 --> 04:43:11,360
In the context of knowledge, we talked about the problem of inference.
5554
04:43:11,360 --> 04:43:14,900
Given things that I know to be true, can I draw conclusions,
5555
04:43:14,900 --> 04:43:19,880
make deductions about other facts about the world that I also know to be true?
5556
04:43:19,880 --> 04:43:23,800
And what we're going to do now is apply the same sort of idea to probability.
5557
04:43:23,800 --> 04:43:26,600
Using information about which I have some knowledge,
5558
04:43:26,600 --> 04:43:28,920
whether some evidence or some probabilities,
5559
04:43:28,920 --> 04:43:32,000
can I figure out not other variables for certain,
5560
04:43:32,000 --> 04:43:35,000
but can I figure out the probabilities of other variables
5561
04:43:35,000 --> 04:43:36,800
taking on particular values?
5562
04:43:36,800 --> 04:43:41,240
And so here, we introduce the problem of inference in a probabilistic setting,
5563
04:43:41,240 --> 04:43:44,920
in a case where variables might not necessarily be true for sure,
5564
04:43:44,920 --> 04:43:48,480
but they might be random variables that take on different values
5565
04:43:48,480 --> 04:43:50,160
with some probability.
5566
04:43:50,160 --> 04:43:53,400
So how do we formally define what exactly this inference problem actually
5567
04:43:53,400 --> 04:43:54,120
is?
5568
04:43:54,120 --> 04:43:57,000
Well, the inference problem has a couple of parts to it.
5569
04:43:57,000 --> 04:43:59,780
We have some query, some variable x that we
5570
04:43:59,780 --> 04:44:01,360
want to compute the distribution for.
5571
04:44:01,360 --> 04:44:04,520
Maybe I want the probability that I miss my train,
5572
04:44:04,520 --> 04:44:08,600
or I want the probability that there is track maintenance,
5573
04:44:08,600 --> 04:44:11,200
something that I want information about.
5574
04:44:11,200 --> 04:44:13,200
And then I have some evidence variables.
5575
04:44:13,200 --> 04:44:14,740
Maybe it's just one piece of evidence.
5576
04:44:14,740 --> 04:44:16,400
Maybe it's multiple pieces of evidence.
5577
04:44:16,400 --> 04:44:20,320
But I've observed certain variables for some sort of event.
5578
04:44:20,320 --> 04:44:23,440
So for example, I might have observed that it is raining.
5579
04:44:23,440 --> 04:44:24,600
This is evidence that I have.
5580
04:44:24,600 --> 04:44:27,680
I know that there is light rain, or I know that there is heavy rain.
5581
04:44:27,680 --> 04:44:28,760
And that is evidence I have.
5582
04:44:28,760 --> 04:44:32,400
And using that evidence, I want to know what is the probability
5583
04:44:32,400 --> 04:44:34,960
that my train is delayed, for example.
5584
04:44:34,960 --> 04:44:38,080
And that is a query that I might want to ask based on this evidence.
5585
04:44:38,080 --> 04:44:39,880
So I have a query, some variable.
5586
04:44:39,880 --> 04:44:41,800
Evidence, which are some other variables that I
5587
04:44:41,800 --> 04:44:44,240
have observed inside of my Bayesian network.
5588
04:44:44,240 --> 04:44:46,960
And of course, that does leave some hidden variables.
5589
04:44:46,960 --> 04:44:47,720
Why?
5590
04:44:47,720 --> 04:44:52,160
These are variables that are not evidence variables and not query variables.
5591
04:44:52,160 --> 04:44:55,720
So you might imagine in the case where I know whether or not it's raining,
5592
04:44:55,720 --> 04:44:59,560
and I want to know whether my train is going to be delayed or not,
5593
04:44:59,560 --> 04:45:02,200
the hidden variable, the thing I don't have access to,
5594
04:45:02,200 --> 04:45:04,520
is something like, is there maintenance on the track?
5595
04:45:04,520 --> 04:45:07,000
Or am I going to make or not make my appointment, for example?
5596
04:45:07,000 --> 04:45:09,040
These are variables that I don't have access to.
5597
04:45:09,040 --> 04:45:12,080
They're hidden because they're not things I observed,
5598
04:45:12,080 --> 04:45:14,720
and they're also not the query, the thing that I'm asking.
5599
04:45:14,720 --> 04:45:17,080
And so ultimately, what we want to calculate
5600
04:45:17,080 --> 04:45:21,240
is I want to know the probability distribution of x given
5601
04:45:21,240 --> 04:45:22,600
e, the event that I observed.
5602
04:45:22,600 --> 04:45:25,760
So given that I observed some event, I observed that it is raining,
5603
04:45:25,760 --> 04:45:29,600
I would like to know what is the distribution over the possible values
5604
04:45:29,600 --> 04:45:31,280
of the train random variable.
5605
04:45:31,280 --> 04:45:32,240
Is it on time?
5606
04:45:32,240 --> 04:45:33,080
Is it delayed?
5607
04:45:33,080 --> 04:45:35,400
What's the likelihood it's going to be there?
5608
04:45:35,400 --> 04:45:37,800
And it turns out we can do this calculation just
5609
04:45:37,800 --> 04:45:42,040
using a lot of the probability rules that we've already seen in action.
5610
04:45:42,040 --> 04:45:44,480
And ultimately, we're going to take a look at the math
5611
04:45:44,480 --> 04:45:46,800
at a little bit of a high level, at an abstract level.
5612
04:45:46,800 --> 04:45:49,520
But ultimately, we can allow computers and programming libraries
5613
04:45:49,520 --> 04:45:52,240
that already exist to begin to do some of this math for us.
5614
04:45:52,240 --> 04:45:55,280
But it's good to get a general sense for what's actually happening
5615
04:45:55,280 --> 04:45:57,640
when this inference process takes place.
5616
04:45:57,640 --> 04:46:00,820
Let's imagine, for example, that I want to compute the probability
5617
04:46:00,820 --> 04:46:05,000
distribution of the appointment random variable given some evidence,
5618
04:46:05,000 --> 04:46:07,040
given that I know that there was light rain
5619
04:46:07,040 --> 04:46:08,920
and no track maintenance.
5620
04:46:08,920 --> 04:46:12,440
So there's my evidence, these two variables that I observe the values of.
5621
04:46:12,440 --> 04:46:14,240
I observe the value of rain.
5622
04:46:14,240 --> 04:46:15,560
I know there's light rain.
5623
04:46:15,560 --> 04:46:18,480
And I know that there is no track maintenance going on today.
5624
04:46:18,480 --> 04:46:22,440
And what I care about knowing, my query, is this random variable appointment.
5625
04:46:22,440 --> 04:46:25,800
I want to know the distribution of this random variable appointment,
5626
04:46:25,800 --> 04:46:28,360
like what is the chance that I'm able to attend my appointment?
5627
04:46:28,360 --> 04:46:32,000
What is the chance that I miss my appointment given this evidence?
5628
04:46:32,000 --> 04:46:35,520
And the hidden variable, the information that I don't have access to,
5629
04:46:35,520 --> 04:46:36,800
is this variable train.
5630
04:46:36,800 --> 04:46:38,920
This is information that is not part of the evidence
5631
04:46:38,920 --> 04:46:41,280
that I see, not something that I observe.
5632
04:46:41,280 --> 04:46:44,600
But it is also not the query that I'm asking for.
5633
04:46:44,600 --> 04:46:47,000
And so what might this inference procedure look like?
5634
04:46:47,000 --> 04:46:50,440
Well, if you recall back from when we were defining conditional probability
5635
04:46:50,440 --> 04:46:52,880
and doing math with conditional probabilities,
5636
04:46:52,880 --> 04:46:57,720
we know that a conditional probability is proportional to the joint
5637
04:46:57,720 --> 04:46:58,680
probability.
5638
04:46:58,680 --> 04:47:01,920
And we remembered this by recalling that the probability of A given
5639
04:47:01,920 --> 04:47:06,680
B is just some constant factor alpha multiplied by the probability of A
5640
04:47:06,680 --> 04:47:08,800
and B. That constant factor alpha turns out
5641
04:47:08,800 --> 04:47:10,960
to be like dividing over the probability of B.
5642
04:47:10,960 --> 04:47:14,560
But the important thing is that it's just some constant multiplied
5643
04:47:14,560 --> 04:47:17,080
by the joint distribution, the probability
5644
04:47:17,080 --> 04:47:19,680
that all of these individual things happen.
5645
04:47:19,680 --> 04:47:23,280
So in this case, I can take the probability of the appointment random
5646
04:47:23,280 --> 04:47:27,000
variable given light rain and no track maintenance
5647
04:47:27,000 --> 04:47:30,720
and say that is just going to be proportional, some constant alpha,
5648
04:47:30,720 --> 04:47:33,400
multiplied by the joint probability, the probability
5649
04:47:33,400 --> 04:47:36,060
of a particular value for the appointment random variable
5650
04:47:36,060 --> 04:47:40,040
and light rain and no track maintenance.
5651
04:47:40,040 --> 04:47:43,160
Well, all right, how do I calculate this, probability of appointment
5652
04:47:43,160 --> 04:47:46,240
and light rain and no track maintenance, when what I really care about
5653
04:47:46,240 --> 04:47:48,760
is knowing I need all four of these values
5654
04:47:48,760 --> 04:47:52,200
to be able to calculate a joint distribution across everything
5655
04:47:52,200 --> 04:47:56,040
because in a particular appointment depends upon the value of train?
5656
04:47:56,040 --> 04:47:59,400
Well, in order to do that, here I can begin to use that marginalization
5657
04:47:59,400 --> 04:48:02,240
trick, that there are only two ways I can get
5658
04:48:02,240 --> 04:48:05,520
any configuration of an appointment, light rain, and no track maintenance.
5659
04:48:05,520 --> 04:48:07,760
Either this particular setting of variables
5660
04:48:07,760 --> 04:48:12,000
happens and the train is on time, or this particular setting of variables
5661
04:48:12,000 --> 04:48:13,800
happens and the train is delayed.
5662
04:48:13,800 --> 04:48:17,160
Those are two possible cases that I would want to consider.
5663
04:48:17,160 --> 04:48:19,760
And if I add those two cases up, well, then I
5664
04:48:19,760 --> 04:48:23,360
get the result just by adding up all of the possibilities
5665
04:48:23,360 --> 04:48:26,600
for the hidden variable or variables that there are multiple.
5666
04:48:26,600 --> 04:48:30,260
But since there's only one hidden variable here, train, all I need to do
5667
04:48:30,260 --> 04:48:34,040
is iterate over all the possible values for that hidden variable train
5668
04:48:34,040 --> 04:48:36,160
and add up their probabilities.
5669
04:48:36,160 --> 04:48:40,440
So this probability expression here becomes probability distribution
5670
04:48:40,440 --> 04:48:44,080
over appointment, light, no rain, and train is on time,
5671
04:48:44,080 --> 04:48:47,560
and the probability distribution over the appointment, light rain,
5672
04:48:47,560 --> 04:48:51,360
no track maintenance, and that the train is delayed, for example.
5673
04:48:51,360 --> 04:48:55,280
So I take both of the possible values for train, go ahead and add them up.
5674
04:48:55,280 --> 04:48:57,520
These are just joint probabilities that we saw earlier,
5675
04:48:57,520 --> 04:48:59,920
how to calculate just by going parent, parent, parent, parent,
5676
04:48:59,920 --> 04:49:03,320
and calculating those probabilities and multiplying them together.
5677
04:49:03,320 --> 04:49:05,440
And then you'll need to normalize them at the end,
5678
04:49:05,440 --> 04:49:09,560
speaking at a high level, to make sure that everything adds up to the number 1.
5679
04:49:09,560 --> 04:49:13,480
So the formula for how you do this in a process known as inference by enumeration
5680
04:49:13,480 --> 04:49:16,280
looks a little bit complicated, but ultimately it looks like this.
5681
04:49:16,280 --> 04:49:20,040
And let's now try to distill what it is that all of these symbols actually mean.
5682
04:49:20,040 --> 04:49:21,040
Let's start here.
5683
04:49:21,040 --> 04:49:25,680
What I care about knowing is the probability of x, my query variable,
5684
04:49:25,680 --> 04:49:28,000
given some sort of evidence.
5685
04:49:28,000 --> 04:49:30,040
What do I know about conditional probabilities?
5686
04:49:30,040 --> 04:49:34,640
Well, a conditional probability is proportional to the joint probability.
5687
04:49:34,640 --> 04:49:37,480
So it is some alpha, some normalizing constant,
5688
04:49:37,480 --> 04:49:41,480
multiplied by this joint probability of x and evidence.
5689
04:49:41,480 --> 04:49:42,920
And how do I calculate that?
5690
04:49:42,920 --> 04:49:45,360
Well, to do that, I'm going to marginalize
5691
04:49:45,360 --> 04:49:47,760
over all of the hidden variables, all the variables
5692
04:49:47,760 --> 04:49:50,080
that I don't directly observe the values for.
5693
04:49:50,080 --> 04:49:53,020
I'm basically going to iterate over all of the possibilities
5694
04:49:53,020 --> 04:49:55,560
that it could happen and just sum them all up.
5695
04:49:55,560 --> 04:49:58,720
And so I can translate this into a sum over all y,
5696
04:49:58,720 --> 04:50:02,080
which ranges over all the possible hidden variables and the values
5697
04:50:02,080 --> 04:50:06,880
that they could take on, and adds up all of those possible individual
5698
04:50:06,880 --> 04:50:07,920
probabilities.
5699
04:50:07,920 --> 04:50:11,960
And that is going to allow me to do this process of inference by enumeration.
5700
04:50:11,960 --> 04:50:14,080
Now, ultimately, it's pretty annoying if we as humans
5701
04:50:14,080 --> 04:50:16,320
have to do all this math for ourselves.
5702
04:50:16,320 --> 04:50:19,480
But turns out this is where computers and AI can be particularly helpful,
5703
04:50:19,480 --> 04:50:22,800
that we can program a computer to understand a Bayesian network,
5704
04:50:22,800 --> 04:50:25,240
to be able to understand these inference procedures,
5705
04:50:25,240 --> 04:50:27,180
and to be able to do these calculations.
5706
04:50:27,180 --> 04:50:29,040
And using the information you've seen here,
5707
04:50:29,040 --> 04:50:31,760
you could implement a Bayesian network from scratch yourself.
5708
04:50:31,760 --> 04:50:34,920
But turns out there are a lot of libraries, especially written in Python,
5709
04:50:34,920 --> 04:50:38,400
that allow us to make it easier to do this sort of probabilistic inference,
5710
04:50:38,400 --> 04:50:41,480
to be able to take a Bayesian network and do these sorts of calculations,
5711
04:50:41,480 --> 04:50:44,480
so that you don't need to know and understand all of the underlying math,
5712
04:50:44,480 --> 04:50:46,920
though it's helpful to have a general sense for how it works.
5713
04:50:46,920 --> 04:50:49,980
But you just need to be able to describe the structure of the network
5714
04:50:49,980 --> 04:50:53,960
and make queries in order to be able to produce the result.
5715
04:50:53,960 --> 04:50:56,680
And so let's take a look at an example of that right now.
5716
04:50:56,680 --> 04:50:59,040
It turns out that there are a lot of possible libraries
5717
04:50:59,040 --> 04:51:01,600
that exist in Python for doing this sort of inference.
5718
04:51:01,600 --> 04:51:04,000
It doesn't matter too much which specific library you use.
5719
04:51:04,000 --> 04:51:05,880
They all behave in fairly similar ways.
5720
04:51:05,880 --> 04:51:08,800
But the library I'm going to use here is one known as pomegranate.
5721
04:51:08,800 --> 04:51:13,440
And here inside of model.py, I have defined a Bayesian network,
5722
04:51:13,440 --> 04:51:17,800
just using the structure and the syntax that the pomegranate library expects.
5723
04:51:17,800 --> 04:51:20,560
And what I'm effectively doing is just, in Python,
5724
04:51:20,560 --> 04:51:24,400
creating nodes to represent each of the nodes of the Bayesian network
5725
04:51:24,400 --> 04:51:26,600
that you saw me describe a moment ago.
5726
04:51:26,600 --> 04:51:29,400
So here on line four, after I've imported pomegranate,
5727
04:51:29,400 --> 04:51:31,520
I'm defining a variable called rain that is going
5728
04:51:31,520 --> 04:51:35,640
to represent a node inside of my Bayesian network.
5729
04:51:35,640 --> 04:51:39,160
It's going to be a node that follows this distribution, where
5730
04:51:39,160 --> 04:51:42,320
there are three possible values, none for no rain, light for light rain,
5731
04:51:42,320 --> 04:51:43,600
heavy for heavy rain.
5732
04:51:43,600 --> 04:51:46,840
And these are the probabilities of each of those taking place.
5733
04:51:46,840 --> 04:51:53,280
0.7 is the likelihood of no rain, 0.2 for light rain, 0.1 for heavy rain.
5734
04:51:53,280 --> 04:51:55,400
Then after that, we go to the next variable,
5735
04:51:55,400 --> 04:51:57,920
the variable for track maintenance, for example,
5736
04:51:57,920 --> 04:52:00,520
which is dependent upon that rain variable.
5737
04:52:00,520 --> 04:52:03,520
And this, instead of being an unconditional distribution,
5738
04:52:03,520 --> 04:52:05,720
is a conditional distribution, as indicated
5739
04:52:05,720 --> 04:52:07,960
by a conditional probability table here.
5740
04:52:07,960 --> 04:52:11,720
And the idea is that I'm following this is conditional
5741
04:52:11,720 --> 04:52:13,520
on the distribution of rain.
5742
04:52:13,520 --> 04:52:17,000
So if there is no rain, then the chance that there is, yes, track maintenance
5743
04:52:17,000 --> 04:52:17,840
is 0.4.
5744
04:52:17,840 --> 04:52:21,360
If there's no rain, the chance that there is no track maintenance is 0.6.
5745
04:52:21,360 --> 04:52:23,360
Likewise, for light rain, I have a distribution.
5746
04:52:23,360 --> 04:52:25,400
For heavy rain, I have a distribution as well.
5747
04:52:25,400 --> 04:52:27,760
But I'm effectively encoding the same information
5748
04:52:27,760 --> 04:52:29,720
you saw represented graphically a moment ago.
5749
04:52:29,720 --> 04:52:33,320
But I'm telling this Python program that the maintenance node
5750
04:52:33,320 --> 04:52:37,200
obeys this particular conditional probability distribution.
5751
04:52:37,200 --> 04:52:40,720
And we do the same thing for the other random variables as well.
5752
04:52:40,720 --> 04:52:44,480
Train was a node inside my distribution that
5753
04:52:44,480 --> 04:52:47,680
was a conditional probability table with two parents.
5754
04:52:47,680 --> 04:52:51,040
It was dependent not only on rain, but also on track maintenance.
5755
04:52:51,040 --> 04:52:53,080
And so here I'm saying something like, given
5756
04:52:53,080 --> 04:52:55,840
that there is no rain and, yes, track maintenance,
5757
04:52:55,840 --> 04:52:59,240
the probability that my train is on time is 0.8.
5758
04:52:59,240 --> 04:53:01,880
And the probability that it's delayed is 0.2.
5759
04:53:01,880 --> 04:53:03,840
And likewise, I can do the same thing for all
5760
04:53:03,840 --> 04:53:07,960
of the other possible values of the parents of the train node
5761
04:53:07,960 --> 04:53:12,440
inside of my Bayesian network by saying, for all of those possible values,
5762
04:53:12,440 --> 04:53:16,160
here is the distribution that the train node should follow.
5763
04:53:16,160 --> 04:53:18,360
Then I do the same thing for an appointment
5764
04:53:18,360 --> 04:53:21,440
based on the distribution of the variable train.
5765
04:53:21,440 --> 04:53:24,960
Then at the end, what I do is actually construct this network
5766
04:53:24,960 --> 04:53:27,480
by describing what the states of the network are
5767
04:53:27,480 --> 04:53:30,240
and by adding edges between the dependent nodes.
5768
04:53:30,240 --> 04:53:33,440
So I create a new Bayesian network, add states to it, one for rain,
5769
04:53:33,440 --> 04:53:36,280
one for maintenance, one for the train, one for the appointment.
5770
04:53:36,280 --> 04:53:40,120
And then I add edges connecting the related pieces.
5771
04:53:40,120 --> 04:53:44,200
Rain has an arrow to maintenance because rain influences track maintenance.
5772
04:53:44,200 --> 04:53:46,120
Rain also influences the train.
5773
04:53:46,120 --> 04:53:48,160
Maintenance also influences the train.
5774
04:53:48,160 --> 04:53:50,800
And train influences whether I make it to my appointment
5775
04:53:50,800 --> 04:53:54,440
and bake just finalizes the model and does some additional computation.
5776
04:53:54,440 --> 04:53:57,880
So the specific syntax of this is not really the important part.
5777
04:53:57,880 --> 04:54:00,640
Pomegranate just happens to be one of several different libraries
5778
04:54:00,640 --> 04:54:02,640
that can all be used for similar purposes.
5779
04:54:02,640 --> 04:54:05,840
And you could describe and define a library for yourself
5780
04:54:05,840 --> 04:54:07,560
that implemented similar things.
5781
04:54:07,560 --> 04:54:11,160
But the key idea here is that someone can design a library
5782
04:54:11,160 --> 04:54:15,320
for a general Bayesian network that has nodes that are based upon its parents.
5783
04:54:15,320 --> 04:54:18,840
And then all a programmer needs to do using one of those libraries
5784
04:54:18,840 --> 04:54:23,040
is to define what those nodes and what those probability distributions are.
5785
04:54:23,040 --> 04:54:26,600
And we can begin to do some interesting logic based on it.
5786
04:54:26,600 --> 04:54:30,800
So let's try doing that conditional or joint probability calculation
5787
04:54:30,800 --> 04:54:36,600
that we saw us do by hand before by going into likelihood.py, where
5788
04:54:36,600 --> 04:54:40,000
here I'm importing the model that I just defined a moment ago.
5789
04:54:40,000 --> 04:54:42,880
And here I'd just like to calculate model.probability, which
5790
04:54:42,880 --> 04:54:46,000
calculates the probability for a given observation.
5791
04:54:46,000 --> 04:54:51,480
And I'd like to calculate the probability of no rain, no track maintenance,
5792
04:54:51,480 --> 04:54:54,600
my train is on time, and I'm able to attend the meeting.
5793
04:54:54,600 --> 04:54:58,200
So sort of the optimal scenario that there is no rain and no maintenance
5794
04:54:58,200 --> 04:55:01,240
on the track, my train is on time, and I'm able to attend the meeting.
5795
04:55:01,240 --> 04:55:04,560
What is the probability that all of that actually happens?
5796
04:55:04,560 --> 04:55:08,840
And I can calculate that using the library and just print out its probability.
5797
04:55:08,840 --> 04:55:12,400
And so I'll go ahead and run python of likelihood.py.
5798
04:55:12,400 --> 04:55:16,840
And I see that, OK, the probability is about 0.34.
5799
04:55:16,840 --> 04:55:20,480
So about a third of the time, everything goes right for me in this case.
5800
04:55:20,480 --> 04:55:22,840
No rain, no track maintenance, train is on time,
5801
04:55:22,840 --> 04:55:24,760
and I'm able to attend the meeting.
5802
04:55:24,760 --> 04:55:28,280
But I could experiment with this, try and calculate other probabilities as well.
5803
04:55:28,280 --> 04:55:31,480
What's the probability that everything goes right up until the train,
5804
04:55:31,480 --> 04:55:33,680
but I still miss my meeting?
5805
04:55:33,680 --> 04:55:37,520
So no rain, no track maintenance, train is on time,
5806
04:55:37,520 --> 04:55:39,320
but I miss the appointment.
5807
04:55:39,320 --> 04:55:41,280
Let's calculate that probability.
5808
04:55:41,280 --> 04:55:44,240
And all right, that has a probability of about 0.04.
5809
04:55:44,240 --> 04:55:47,400
So about 4% of the time, the train will be on time,
5810
04:55:47,400 --> 04:55:49,240
there won't be any rain, no track maintenance,
5811
04:55:49,240 --> 04:55:52,200
and yet I'll still miss the meeting.
5812
04:55:52,200 --> 04:55:54,440
And so this is really just an implementation
5813
04:55:54,440 --> 04:55:57,560
of the calculation of the joint probabilities that we did before.
5814
04:55:57,560 --> 04:56:00,320
What this library is likely doing is first figuring out
5815
04:56:00,320 --> 04:56:03,400
the probability of no rain, then figuring out
5816
04:56:03,400 --> 04:56:06,760
the probability of no track maintenance given no rain,
5817
04:56:06,760 --> 04:56:10,160
then the probability that my train is on time given both of these values,
5818
04:56:10,160 --> 04:56:13,600
and then the probability that I miss my appointment given that I
5819
04:56:13,600 --> 04:56:15,600
know that the train was on time.
5820
04:56:15,600 --> 04:56:18,800
So this, again, is the calculation of that joint probability.
5821
04:56:18,800 --> 04:56:22,000
And turns out we can also begin to have our computer solve inference problems
5822
04:56:22,000 --> 04:56:26,560
as well, to begin to infer, based on information, evidence that we see,
5823
04:56:26,560 --> 04:56:30,640
what is the likelihood of other variables also being true.
5824
04:56:30,640 --> 04:56:33,720
So let's go into inference.py, for example.
5825
04:56:33,720 --> 04:56:36,760
We're here, I'm again importing that exact same model from before,
5826
04:56:36,760 --> 04:56:38,920
importing all the nodes and all the edges
5827
04:56:38,920 --> 04:56:42,840
and the probability distribution that is encoded there as well.
5828
04:56:42,840 --> 04:56:45,960
And now there's a function for doing some sort of prediction.
5829
04:56:45,960 --> 04:56:50,400
And here, into this model, I pass in the evidence that I observe.
5830
04:56:50,400 --> 04:56:54,400
So here, I've encoded into this Python program the evidence
5831
04:56:54,400 --> 04:56:55,400
that I have observed.
5832
04:56:55,400 --> 04:56:58,600
I have observed the fact that the train is delayed.
5833
04:56:58,600 --> 04:57:01,840
And that is the value for one of the four random variables
5834
04:57:01,840 --> 04:57:03,800
inside of this Bayesian network.
5835
04:57:03,800 --> 04:57:07,320
And using that information, I would like to be able to draw inspiration
5836
04:57:07,320 --> 04:57:09,680
and figure out inferences about the values
5837
04:57:09,680 --> 04:57:13,120
of the other random variables that are inside of my Bayesian network.
5838
04:57:13,120 --> 04:57:15,920
I would like to make predictions about everything else.
5839
04:57:15,920 --> 04:57:19,960
So all of the actual computational logic is happening in just these three lines,
5840
04:57:19,960 --> 04:57:21,920
where I'm making this call to this prediction.
5841
04:57:21,920 --> 04:57:25,720
Down below, I'm just iterating over all of the states and all the predictions
5842
04:57:25,720 --> 04:57:29,360
and just printing them out so that we can visually see what the results are.
5843
04:57:29,360 --> 04:57:31,640
But let's find out, given the train is delayed,
5844
04:57:31,640 --> 04:57:35,840
what can I predict about the values of the other random variables?
5845
04:57:35,840 --> 04:57:38,960
Let's go ahead and run python inference.py.
5846
04:57:38,960 --> 04:57:41,520
I run that, and all right, here is the result that I get.
5847
04:57:41,520 --> 04:57:44,280
Given the fact that I know that the train is delayed,
5848
04:57:44,280 --> 04:57:46,400
this is evidence that I have observed.
5849
04:57:46,400 --> 04:57:50,120
Well, given that there is a 45% chance or a 46% chance
5850
04:57:50,120 --> 04:57:52,720
that there was no rain, a 31% chance there was light rain,
5851
04:57:52,720 --> 04:57:56,360
a 23% chance there was heavy rain, I can see a probability distribution
5852
04:57:56,360 --> 04:57:58,720
of a track maintenance and a probability distribution
5853
04:57:58,720 --> 04:58:01,760
over whether I'm able to attend or miss my appointment.
5854
04:58:01,760 --> 04:58:04,560
Now, we know that whether I attend or miss the appointment,
5855
04:58:04,560 --> 04:58:07,960
that is only dependent upon the train being delayed or not delayed.
5856
04:58:07,960 --> 04:58:10,160
It shouldn't depend on anything else.
5857
04:58:10,160 --> 04:58:14,240
So let's imagine, for example, that I knew that there was heavy rain.
5858
04:58:14,240 --> 04:58:18,240
That shouldn't affect the distribution for making the appointment.
5859
04:58:18,240 --> 04:58:21,000
And indeed, if I go up here and add some evidence,
5860
04:58:21,000 --> 04:58:23,680
say that I know that the value of rain is heavy.
5861
04:58:23,680 --> 04:58:25,520
That is evidence that I now have access to.
5862
04:58:25,520 --> 04:58:27,040
I now have two pieces of evidence.
5863
04:58:27,040 --> 04:58:31,600
I know that the rain is heavy, and I know that my train is delayed.
5864
04:58:31,600 --> 04:58:35,160
I can calculate the probability by running this inference procedure again
5865
04:58:35,160 --> 04:58:37,960
and seeing the result. I know that the rain is heavy.
5866
04:58:37,960 --> 04:58:39,480
I know my train is delayed.
5867
04:58:39,480 --> 04:58:42,680
The probability distribution for track maintenance changed.
5868
04:58:42,680 --> 04:58:44,680
Given that I know that there's heavy rain,
5869
04:58:44,680 --> 04:58:48,240
now it's more likely that there is no track maintenance, 88%,
5870
04:58:48,240 --> 04:58:51,880
as opposed to 64% from here before.
5871
04:58:51,880 --> 04:58:55,680
And now, what is the probability that I make the appointment?
5872
04:58:55,680 --> 04:58:57,120
Well, that's the same as before.
5873
04:58:57,120 --> 04:59:00,720
It's still going to be attend the appointment with probability 0.6,
5874
04:59:00,720 --> 04:59:03,080
missed the appointment with probability 0.4,
5875
04:59:03,080 --> 04:59:05,440
because it was only dependent upon whether or not
5876
04:59:05,440 --> 04:59:07,760
my train was on time or delayed.
5877
04:59:07,760 --> 04:59:11,240
And so this here is implementing that idea of that inference algorithm
5878
04:59:11,240 --> 04:59:14,600
to be able to figure out, based on the evidence that I have,
5879
04:59:14,600 --> 04:59:18,800
what can we infer about the values of the other variables that exist as well.
5880
04:59:18,800 --> 04:59:22,520
So inference by enumeration is one way of doing this inference procedure,
5881
04:59:22,520 --> 04:59:26,360
just looping over all of the values the hidden variables could take on
5882
04:59:26,360 --> 04:59:29,080
and figuring out what the probability is.
5883
04:59:29,080 --> 04:59:31,640
Now, it turns out this is not particularly efficient.
5884
04:59:31,640 --> 04:59:35,800
And there are definitely optimizations you can make by avoiding repeated work.
5885
04:59:35,800 --> 04:59:38,680
If you're calculating the same sort of probability multiple times,
5886
04:59:38,680 --> 04:59:40,840
there are ways of optimizing the program to avoid
5887
04:59:40,840 --> 04:59:44,280
having to recalculate the same probabilities again and again.
5888
04:59:44,280 --> 04:59:47,240
But even then, as the number of variables get large,
5889
04:59:47,240 --> 04:59:50,640
as the number of possible values of variables could take on, get large,
5890
04:59:50,640 --> 04:59:52,920
we're going to start to have to do a lot of computation,
5891
04:59:52,920 --> 04:59:55,800
a lot of calculation, to be able to do this inference.
5892
04:59:55,800 --> 04:59:58,560
And at that point, it might start to get unreasonable,
5893
04:59:58,560 --> 05:00:00,680
in terms of the amount of time that it would take
5894
05:00:00,680 --> 05:00:04,280
to be able to do this sort of exact inference.
5895
05:00:04,280 --> 05:00:06,080
And it's for that reason that oftentimes, when
5896
05:00:06,080 --> 05:00:09,560
it comes towards probability and things we're not entirely sure about,
5897
05:00:09,560 --> 05:00:11,880
we don't always care about doing exact inference
5898
05:00:11,880 --> 05:00:14,640
and knowing exactly what the probability is.
5899
05:00:14,640 --> 05:00:17,120
But if we can approximate the inference procedure,
5900
05:00:17,120 --> 05:00:21,160
do some sort of approximate inference, that that can be pretty good as well.
5901
05:00:21,160 --> 05:00:23,160
That if I don't know the exact probability,
5902
05:00:23,160 --> 05:00:25,120
but I have a general sense for the probability
5903
05:00:25,120 --> 05:00:28,000
that I can get increasingly accurate with more time,
5904
05:00:28,000 --> 05:00:30,360
that that's probably pretty good, especially
5905
05:00:30,360 --> 05:00:33,200
if I can get that to happen even faster.
5906
05:00:33,200 --> 05:00:37,520
So how could I do approximate inference inside of a Bayesian network?
5907
05:00:37,520 --> 05:00:40,080
Well, one method is through a procedure known as sampling.
5908
05:00:40,080 --> 05:00:42,200
In the process of sampling, I'm going to take
5909
05:00:42,200 --> 05:00:46,440
a sample of all of the variables inside of this Bayesian network here.
5910
05:00:46,440 --> 05:00:47,840
And how am I going to sample?
5911
05:00:47,840 --> 05:00:51,840
Well, I'm going to sample one of the values from each of these nodes
5912
05:00:51,840 --> 05:00:54,160
according to their probability distribution.
5913
05:00:54,160 --> 05:00:56,120
So how might I take a sample of all these nodes?
5914
05:00:56,120 --> 05:00:57,040
Well, I'll start at the root.
5915
05:00:57,040 --> 05:00:58,080
I'll start with rain.
5916
05:00:58,080 --> 05:00:59,800
Here's the distribution for rain.
5917
05:00:59,800 --> 05:01:03,520
And I'll go ahead and, using a random number generator or something like it,
5918
05:01:03,520 --> 05:01:05,400
randomly pick one of these three values.
5919
05:01:05,400 --> 05:01:09,360
I'll pick none with probability 0.7, light with probability 0.2,
5920
05:01:09,360 --> 05:01:11,080
and heavy with probability 0.1.
5921
05:01:11,080 --> 05:01:14,400
So I'll randomly just pick one of them according to that distribution.
5922
05:01:14,400 --> 05:01:17,480
And maybe in this case, I pick none, for example.
5923
05:01:17,480 --> 05:01:19,440
Then I do the same thing for the other variable.
5924
05:01:19,440 --> 05:01:22,120
Maintenance also has a probability distribution.
5925
05:01:22,120 --> 05:01:23,680
And I'm going to sample.
5926
05:01:23,680 --> 05:01:26,120
Now, there are three probability distributions here.
5927
05:01:26,120 --> 05:01:29,360
But I'm only going to sample from this first row here,
5928
05:01:29,360 --> 05:01:33,880
because I've observed already in my sample that the value of rain is none.
5929
05:01:33,880 --> 05:01:37,960
So given that rain is none, I'm going to sample from this distribution to say,
5930
05:01:37,960 --> 05:01:40,040
all right, what should the value of maintenance be?
5931
05:01:40,040 --> 05:01:42,800
And in this case, maintenance is going to be, let's just say yes,
5932
05:01:42,800 --> 05:01:47,560
which happens 40% of the time in the event that there is no rain, for example.
5933
05:01:47,560 --> 05:01:50,360
And we'll sample all of the rest of the nodes in this way as well,
5934
05:01:50,360 --> 05:01:52,480
that I want to sample from the train distribution.
5935
05:01:52,480 --> 05:01:56,680
And I'll sample from this first row here, where there is no rain,
5936
05:01:56,680 --> 05:01:58,200
but there is track maintenance.
5937
05:01:58,200 --> 05:02:00,160
And I'll sample 80% of the time.
5938
05:02:00,160 --> 05:02:01,560
I'll say the train is on time.
5939
05:02:01,560 --> 05:02:04,320
20% of the time, I'll say the train is delayed.
5940
05:02:04,320 --> 05:02:07,280
And finally, we'll do the same thing for whether I make it to my appointment
5941
05:02:07,280 --> 05:02:07,560
or not.
5942
05:02:07,560 --> 05:02:09,120
Did I attend or miss the appointment?
5943
05:02:09,120 --> 05:02:11,640
We'll sample based on this distribution and maybe say
5944
05:02:11,640 --> 05:02:13,760
that in this case, I attend the appointment, which
5945
05:02:13,760 --> 05:02:18,480
happens 90% of the time when the train is actually on time.
5946
05:02:18,480 --> 05:02:22,560
So by going through these nodes, I can very quickly just do some sampling
5947
05:02:22,560 --> 05:02:26,200
and get a sample of the possible values that could come up
5948
05:02:26,200 --> 05:02:28,600
from going through this entire Bayesian network
5949
05:02:28,600 --> 05:02:31,160
according to those probability distributions.
5950
05:02:31,160 --> 05:02:34,040
And where this becomes powerful is if I do this not once,
5951
05:02:34,040 --> 05:02:36,640
but I do this thousands or tens of thousands of times
5952
05:02:36,640 --> 05:02:39,960
and generate a whole bunch of samples all using this distribution.
5953
05:02:39,960 --> 05:02:41,040
I get different samples.
5954
05:02:41,040 --> 05:02:42,480
Maybe some of them are the same.
5955
05:02:42,480 --> 05:02:47,320
But I get a value for each of the possible variables that could come up.
5956
05:02:47,320 --> 05:02:49,320
And so then if I'm ever faced with a question,
5957
05:02:49,320 --> 05:02:53,480
a question like, what is the probability that the train is on time,
5958
05:02:53,480 --> 05:02:55,520
you could do an exact inference procedure.
5959
05:02:55,520 --> 05:02:58,240
This is no different than the inference problem we had before
5960
05:02:58,240 --> 05:03:01,400
where I could just marginalize, look at all the possible other values
5961
05:03:01,400 --> 05:03:05,080
of the variables, and do the computation of inference by enumeration
5962
05:03:05,080 --> 05:03:07,840
to find out this probability exactly.
5963
05:03:07,840 --> 05:03:10,680
But I could also, if I don't care about the exact probability,
5964
05:03:10,680 --> 05:03:12,800
just sample it, approximate it to get close.
5965
05:03:12,800 --> 05:03:16,240
And this is a powerful tool in AI where we don't need to be right 100%
5966
05:03:16,240 --> 05:03:18,440
of the time or we don't need to be exactly right.
5967
05:03:18,440 --> 05:03:20,760
If we just need to be right with some probability,
5968
05:03:20,760 --> 05:03:23,800
we can often do so more effectively, more efficiently.
5969
05:03:23,800 --> 05:03:26,920
And so if here now are all of those possible samples,
5970
05:03:26,920 --> 05:03:30,000
I'll highlight the ones where the train is on time.
5971
05:03:30,000 --> 05:03:32,240
I'm ignoring the ones where the train is delayed.
5972
05:03:32,240 --> 05:03:35,640
And in this case, there's like six out of eight of the samples
5973
05:03:35,640 --> 05:03:37,320
have the train is arriving on time.
5974
05:03:37,320 --> 05:03:40,960
And so maybe in this case, I can say that in six out of eight cases,
5975
05:03:40,960 --> 05:03:43,200
that's the likelihood that the train is on time.
5976
05:03:43,200 --> 05:03:45,640
And with eight samples, that might not be a great prediction.
5977
05:03:45,640 --> 05:03:48,160
But if I had thousands upon thousands of samples,
5978
05:03:48,160 --> 05:03:51,240
then this could be a much better inference procedure
5979
05:03:51,240 --> 05:03:53,320
to be able to do these sorts of calculations.
5980
05:03:53,320 --> 05:03:56,960
So this is a direct sampling method to just do a bunch of samples
5981
05:03:56,960 --> 05:04:00,920
and then figure out what the probability of some event is.
5982
05:04:00,920 --> 05:04:03,960
Now, this from before was an unconditional probability.
5983
05:04:03,960 --> 05:04:07,080
What is the probability that the train is on time?
5984
05:04:07,080 --> 05:04:09,880
And I did that by looking at all the samples and figuring out, right,
5985
05:04:09,880 --> 05:04:12,120
here are the ones where the train is on time.
5986
05:04:12,120 --> 05:04:16,000
But sometimes what I want to calculate is not an unconditional probability,
5987
05:04:16,000 --> 05:04:18,360
but rather a conditional probability, something
5988
05:04:18,360 --> 05:04:21,240
like what is the probability that there is light rain,
5989
05:04:21,240 --> 05:04:24,600
given that the train is on time, something to that effect.
5990
05:04:24,600 --> 05:04:28,200
And to do that kind of calculation, well, what I might do
5991
05:04:28,200 --> 05:04:31,360
is here are all the samples that I have.
5992
05:04:31,360 --> 05:04:33,920
And I want to calculate a probability distribution,
5993
05:04:33,920 --> 05:04:36,920
given that I know that the train is on time.
5994
05:04:36,920 --> 05:04:38,800
So to be able to do that, I can kind of look
5995
05:04:38,800 --> 05:04:43,280
at the two cases where the train was delayed and ignore or reject them,
5996
05:04:43,280 --> 05:04:47,400
sort of exclude them from the possible samples that I'm considering.
5997
05:04:47,400 --> 05:04:50,760
And now I want to look at these remaining cases where the train is on time.
5998
05:04:50,760 --> 05:04:53,480
Here are the cases where there is light rain.
5999
05:04:53,480 --> 05:04:56,440
And I say, OK, these are two out of the six possible cases.
6000
05:04:56,440 --> 05:05:00,200
That can give me an approximation for the probability of light rain,
6001
05:05:00,200 --> 05:05:03,080
given the fact that I know the train was on time.
6002
05:05:03,080 --> 05:05:05,340
And I did that in almost exactly the same way,
6003
05:05:05,340 --> 05:05:08,600
just by adding an additional step, by saying that, all right,
6004
05:05:08,600 --> 05:05:12,080
when I take each sample, let me reject all of the samples that
6005
05:05:12,080 --> 05:05:14,960
don't match my evidence and only consider
6006
05:05:14,960 --> 05:05:19,200
the samples that do match what it is that I have in my evidence
6007
05:05:19,200 --> 05:05:21,640
that I want to make some sort of calculation about.
6008
05:05:21,640 --> 05:05:25,560
And it turns out, using the libraries that we've had for Bayesian networks,
6009
05:05:25,560 --> 05:05:28,180
we can begin to implement this same sort of idea,
6010
05:05:28,180 --> 05:05:31,520
like implement rejection sampling, which is what this method is called,
6011
05:05:31,520 --> 05:05:35,480
to be able to figure out some probability, not via direct inference,
6012
05:05:35,480 --> 05:05:37,600
but instead by sampling.
6013
05:05:37,600 --> 05:05:39,920
So what I have here is a program called sample.py.
6014
05:05:39,920 --> 05:05:41,840
Imports the exact same model.
6015
05:05:41,840 --> 05:05:45,000
And what I define first is a program to generate a sample.
6016
05:05:45,000 --> 05:05:48,720
And the way I generate a sample is just by looping over all of the states.
6017
05:05:48,720 --> 05:05:50,520
The states need to be in some sort of order
6018
05:05:50,520 --> 05:05:52,360
to make sure I'm looping in the correct order.
6019
05:05:52,360 --> 05:05:55,640
But effectively, if it is a conditional distribution,
6020
05:05:55,640 --> 05:05:58,040
I'm going to sample based on the parents.
6021
05:05:58,040 --> 05:06:00,240
And otherwise, I'm just going to directly sample
6022
05:06:00,240 --> 05:06:02,280
the variable, like rain, which has no parents.
6023
05:06:02,280 --> 05:06:05,000
It's just an unconditional distribution and keep
6024
05:06:05,000 --> 05:06:08,240
track of all those parent samples and return the final sample.
6025
05:06:08,240 --> 05:06:11,040
The exact syntax of this, again, not particularly important.
6026
05:06:11,040 --> 05:06:13,680
It just happens to be part of the implementation details
6027
05:06:13,680 --> 05:06:15,440
of this particular library.
6028
05:06:15,440 --> 05:06:17,920
The interesting logic is down below.
6029
05:06:17,920 --> 05:06:20,440
Now that I have the ability to generate a sample,
6030
05:06:20,440 --> 05:06:24,280
if I want to know the distribution of the appointment random variable,
6031
05:06:24,280 --> 05:06:26,520
given that the train is delayed, well, then I
6032
05:06:26,520 --> 05:06:28,400
can begin to do calculations like this.
6033
05:06:28,400 --> 05:06:32,080
Let me take 10,000 samples and assemble all my results
6034
05:06:32,080 --> 05:06:33,440
in this list called data.
6035
05:06:33,440 --> 05:06:36,760
I'll go ahead and loop n times, in this case, 10,000 times.
6036
05:06:36,760 --> 05:06:38,720
I'll generate a sample.
6037
05:06:38,720 --> 05:06:41,320
And I want to know the distribution of appointment,
6038
05:06:41,320 --> 05:06:43,040
given that the train is delayed.
6039
05:06:43,040 --> 05:06:45,520
So according to rejection sampling, I'm only
6040
05:06:45,520 --> 05:06:47,840
going to consider samples where the train is delayed.
6041
05:06:47,840 --> 05:06:51,400
If the train is not delayed, I'm not going to consider those values at all.
6042
05:06:51,400 --> 05:06:53,400
So I'm going to say, all right, if I take the sample,
6043
05:06:53,400 --> 05:06:57,560
look at the value of the train random variable, if the train is delayed,
6044
05:06:57,560 --> 05:06:59,320
well, let me go ahead and add to my data
6045
05:06:59,320 --> 05:07:02,640
that I'm collecting the value of the appointment random variable
6046
05:07:02,640 --> 05:07:05,400
that it took on in this particular sample.
6047
05:07:05,400 --> 05:07:08,240
So I'm only considering the samples where the train is delayed.
6048
05:07:08,240 --> 05:07:11,840
And for each of those samples, considering what the value of appointment
6049
05:07:11,840 --> 05:07:14,440
is, and then at the end, I'm using a Python class called
6050
05:07:14,440 --> 05:07:18,120
counter, which quickly counts up all the values inside of a data set.
6051
05:07:18,120 --> 05:07:20,560
So I can take this list of data and figure out
6052
05:07:20,560 --> 05:07:25,680
how many times was my appointment made and how many times was my appointment
6053
05:07:25,680 --> 05:07:27,080
missed.
6054
05:07:27,080 --> 05:07:29,240
And so this here, with just a couple lines of code,
6055
05:07:29,240 --> 05:07:32,720
is an implementation of rejection sampling.
6056
05:07:32,720 --> 05:07:37,800
And I can run it by going ahead and running Python sample.py.
6057
05:07:37,800 --> 05:07:39,840
And when I do that, here is the result I get.
6058
05:07:39,840 --> 05:07:41,760
This is the result of the counter.
6059
05:07:41,760 --> 05:07:45,400
1,251 times, I was able to attend the meeting.
6060
05:07:45,400 --> 05:07:48,520
And 856 times, I was able to miss the meeting.
6061
05:07:48,520 --> 05:07:51,080
And you can imagine, by doing more and more samples,
6062
05:07:51,080 --> 05:07:54,120
I'll be able to get a better and better, more accurate result.
6063
05:07:54,120 --> 05:07:55,680
And this is a randomized process.
6064
05:07:55,680 --> 05:07:58,560
It's going to be an approximation of the probability.
6065
05:07:58,560 --> 05:08:01,760
If I run it a different time, you'll notice the numbers are similar, 12,
6066
05:08:01,760 --> 05:08:03,760
72, and 905.
6067
05:08:03,760 --> 05:08:07,560
But they're not identical because there's some randomization, some likelihood
6068
05:08:07,560 --> 05:08:09,280
that things might be higher or lower.
6069
05:08:09,280 --> 05:08:12,800
And so this is why we generally want to try and use more samples so that we
6070
05:08:12,800 --> 05:08:15,520
can have a greater amount of confidence in our result,
6071
05:08:15,520 --> 05:08:18,840
be more sure about the result that we're getting of whether or not
6072
05:08:18,840 --> 05:08:23,520
it accurately reflects or represents the actual underlying probabilities that
6073
05:08:23,520 --> 05:08:26,680
are inherent inside of this distribution.
6074
05:08:26,680 --> 05:08:29,720
And so this, then, was an instance of rejection sampling.
6075
05:08:29,720 --> 05:08:32,280
And it turns out there are a number of other sampling methods
6076
05:08:32,280 --> 05:08:34,720
that you could use to begin to try to sample.
6077
05:08:34,720 --> 05:08:37,160
One problem that rejection sampling has is
6078
05:08:37,160 --> 05:08:41,800
that if the evidence you're looking for is a fairly unlikely event,
6079
05:08:41,800 --> 05:08:44,240
well, you're going to be rejecting a lot of samples.
6080
05:08:44,240 --> 05:08:48,160
Like if I'm looking for the probability of x given some evidence e,
6081
05:08:48,160 --> 05:08:52,320
if e is very unlikely to occur, like occurs maybe one every 1,000 times,
6082
05:08:52,320 --> 05:08:56,120
then I'm only going to be considering 1 out of every 1,000 samples that I do,
6083
05:08:56,120 --> 05:08:59,760
which is a pretty inefficient method for trying to do this sort of calculation.
6084
05:08:59,760 --> 05:09:01,720
I'm throwing away a lot of samples.
6085
05:09:01,720 --> 05:09:05,040
And it takes computational effort to be able to generate those samples.
6086
05:09:05,040 --> 05:09:07,320
So I'd like to not have to do something like that.
6087
05:09:07,320 --> 05:09:09,880
So there are other sampling methods that can try and address this.
6088
05:09:09,880 --> 05:09:13,320
One such sampling method is called likelihood weighting.
6089
05:09:13,320 --> 05:09:16,600
In likelihood weighting, we follow a slightly different procedure.
6090
05:09:16,600 --> 05:09:20,680
And the goal is to avoid needing to throw out samples
6091
05:09:20,680 --> 05:09:22,240
that didn't match the evidence.
6092
05:09:22,240 --> 05:09:26,400
And so what we'll do is we'll start by fixing the values for the evidence
6093
05:09:26,400 --> 05:09:26,920
variables.
6094
05:09:26,920 --> 05:09:29,080
Rather than sample everything, we're going
6095
05:09:29,080 --> 05:09:33,480
to fix the values of the evidence variables and not sample those.
6096
05:09:33,480 --> 05:09:36,640
Then we're going to sample all the other non-evidence variables
6097
05:09:36,640 --> 05:09:38,920
in the same way, just using the Bayesian network looking
6098
05:09:38,920 --> 05:09:43,640
at the probability distributions, sampling all the non-evidence variables.
6099
05:09:43,640 --> 05:09:48,080
But then what we need to do is weight each sample by its likelihood.
6100
05:09:48,080 --> 05:09:50,120
If our evidence is really unlikely, we want
6101
05:09:50,120 --> 05:09:53,840
to make sure that we've taken into account how likely was the evidence
6102
05:09:53,840 --> 05:09:55,920
to actually show up in the sample.
6103
05:09:55,920 --> 05:09:58,200
If I have a sample where the evidence was much more
6104
05:09:58,200 --> 05:10:00,360
likely to show up than another sample, then I
6105
05:10:00,360 --> 05:10:02,680
want to weight the more likely one higher.
6106
05:10:02,680 --> 05:10:06,080
So we're going to weight each sample by its likelihood, where likelihood is just
6107
05:10:06,080 --> 05:10:09,120
defined as the probability of all the evidence.
6108
05:10:09,120 --> 05:10:11,720
Given all the evidence we have, what is the probability
6109
05:10:11,720 --> 05:10:14,280
that it would happen in that particular sample?
6110
05:10:14,280 --> 05:10:16,860
So before, all of our samples were weighted equally.
6111
05:10:16,860 --> 05:10:19,360
They all had a weight of 1 when we were calculating
6112
05:10:19,360 --> 05:10:20,600
the overall average.
6113
05:10:20,600 --> 05:10:22,640
In this case, we're going to weight each sample,
6114
05:10:22,640 --> 05:10:25,840
multiply each sample by its likelihood in order
6115
05:10:25,840 --> 05:10:28,880
to get the more accurate distribution.
6116
05:10:28,880 --> 05:10:30,080
So what would this look like?
6117
05:10:30,080 --> 05:10:33,520
Well, if I ask the same question, what is the probability of light rain,
6118
05:10:33,520 --> 05:10:36,680
given that the train is on time, when I do the sampling procedure
6119
05:10:36,680 --> 05:10:40,720
and start by trying to sample, I'm going to start by fixing the evidence
6120
05:10:40,720 --> 05:10:41,280
variable.
6121
05:10:41,280 --> 05:10:44,280
I'm already going to have in my sample the train is on time.
6122
05:10:44,280 --> 05:10:46,480
That way, I don't have to throw out anything.
6123
05:10:46,480 --> 05:10:50,280
I'm only sampling things where I know the value of the variables that
6124
05:10:50,280 --> 05:10:53,440
are my evidence are what I expect them to be.
6125
05:10:53,440 --> 05:10:55,200
So I'll go ahead and sample from rain.
6126
05:10:55,200 --> 05:10:58,160
And maybe this time, I sample light rain instead of no rain.
6127
05:10:58,160 --> 05:11:00,000
Then I'll sample from track maintenance and say,
6128
05:11:00,000 --> 05:11:01,720
maybe, yes, there's track maintenance.
6129
05:11:01,720 --> 05:11:04,800
Then for train, well, I've already fixed it in place.
6130
05:11:04,800 --> 05:11:06,840
Train was an evidence variable.
6131
05:11:06,840 --> 05:11:09,000
So I'm not going to bother sampling again.
6132
05:11:09,000 --> 05:11:10,520
I'll just go ahead and move on.
6133
05:11:10,520 --> 05:11:14,880
I'll move on to appointment and go ahead and sample from appointment as well.
6134
05:11:14,880 --> 05:11:16,680
So now I've generated a sample.
6135
05:11:16,680 --> 05:11:19,840
I've generated a sample by fixing this evidence variable
6136
05:11:19,840 --> 05:11:22,000
and sampling the other three.
6137
05:11:22,000 --> 05:11:24,000
And the last step is now weighting the sample.
6138
05:11:24,000 --> 05:11:25,560
How much weight should it have?
6139
05:11:25,560 --> 05:11:28,520
And the weight is based on how probable is it
6140
05:11:28,520 --> 05:11:32,080
that the train was actually on time, this evidence actually happened,
6141
05:11:32,080 --> 05:11:35,080
given the values of these other variables, light rain and the fact
6142
05:11:35,080 --> 05:11:37,280
that, yes, there was track maintenance.
6143
05:11:37,280 --> 05:11:39,880
Well, to do that, I can just go back to the train variable
6144
05:11:39,880 --> 05:11:43,280
and say, all right, if there was light rain and track maintenance,
6145
05:11:43,280 --> 05:11:46,800
the likelihood of my evidence, the likelihood that my train was on time,
6146
05:11:46,800 --> 05:11:48,200
is 0.6.
6147
05:11:48,200 --> 05:11:52,880
And so this particular sample would have a weight of 0.6.
6148
05:11:52,880 --> 05:11:55,360
And I could repeat the sampling procedure again and again.
6149
05:11:55,360 --> 05:11:57,760
Each time every sample would be given a weight
6150
05:11:57,760 --> 05:12:02,560
according to the probability of the evidence that I see associated with it.
6151
05:12:02,560 --> 05:12:04,920
And there are other sampling methods that exist as well,
6152
05:12:04,920 --> 05:12:07,320
but all of them are designed to try and get it the same idea,
6153
05:12:07,320 --> 05:12:13,160
to approximate the inference procedure of figuring out the value of a variable.
6154
05:12:13,160 --> 05:12:15,200
So we've now dealt with probability as it
6155
05:12:15,200 --> 05:12:18,480
pertains to particular variables that have these discrete values.
6156
05:12:18,480 --> 05:12:22,880
But what we haven't really considered is how values might change over time.
6157
05:12:22,880 --> 05:12:25,120
That we've considered something like a variable for rain,
6158
05:12:25,120 --> 05:12:28,920
where rain can take on values of none or light rain or heavy rain.
6159
05:12:28,920 --> 05:12:32,600
But in practice, usually when we consider values for variables like rain,
6160
05:12:32,600 --> 05:12:37,120
we like to consider it for over time, how do the values of these variables
6161
05:12:37,120 --> 05:12:37,640
change?
6162
05:12:37,640 --> 05:12:40,320
What do we do with when we're dealing with uncertainty
6163
05:12:40,320 --> 05:12:43,240
over a period of time, which can come up in the context of weather,
6164
05:12:43,240 --> 05:12:46,360
for example, if I have sunny days and I have rainy days.
6165
05:12:46,360 --> 05:12:51,080
And I'd like to know not just what is the probability that it's raining now,
6166
05:12:51,080 --> 05:12:53,480
but what is the probability that it rains tomorrow,
6167
05:12:53,480 --> 05:12:55,520
or the day after that, or the day after that.
6168
05:12:55,520 --> 05:12:57,280
And so to do this, we're going to introduce
6169
05:12:57,280 --> 05:12:58,960
a slightly different kind of model.
6170
05:12:58,960 --> 05:13:02,920
But here, we're going to have a random variable, not just one for the weather,
6171
05:13:02,920 --> 05:13:05,360
but for every possible time step.
6172
05:13:05,360 --> 05:13:07,200
And you can define time step however you like.
6173
05:13:07,200 --> 05:13:10,280
A simple way is just to use days as your time step.
6174
05:13:10,280 --> 05:13:13,840
And so we can define a variable called x sub t, which
6175
05:13:13,840 --> 05:13:16,280
is going to be the weather at time t.
6176
05:13:16,280 --> 05:13:19,200
So x sub 0 might be the weather on day 0.
6177
05:13:19,200 --> 05:13:22,040
x sub 1 might be the weather on day 1, so on and so forth.
6178
05:13:22,040 --> 05:13:24,720
x sub 2 is the weather on day 2.
6179
05:13:24,720 --> 05:13:26,560
But as you can imagine, if we start to do this
6180
05:13:26,560 --> 05:13:28,560
over longer and longer periods of time, there's
6181
05:13:28,560 --> 05:13:30,840
an incredible amount of data that might go into this.
6182
05:13:30,840 --> 05:13:33,600
If you're keeping track of data about the weather for a year,
6183
05:13:33,600 --> 05:13:36,400
now suddenly you might be trying to predict the weather tomorrow,
6184
05:13:36,400 --> 05:13:40,000
given 365 days of previous pieces of evidence.
6185
05:13:40,000 --> 05:13:43,200
And that's a lot of evidence to have to deal with and manipulate and calculate.
6186
05:13:43,200 --> 05:13:47,080
Probably nobody knows what the exact conditional probability distribution
6187
05:13:47,080 --> 05:13:49,880
is for all of those combinations of variables.
6188
05:13:49,880 --> 05:13:52,560
And so when we're trying to do this inference inside of a computer,
6189
05:13:52,560 --> 05:13:56,280
when we're trying to reasonably do this sort of analysis,
6190
05:13:56,280 --> 05:13:58,800
it's helpful to make some simplifying assumptions,
6191
05:13:58,800 --> 05:14:01,920
some assumptions about the problem that we can just assume are true,
6192
05:14:01,920 --> 05:14:03,600
to make our lives a little bit easier.
6193
05:14:03,600 --> 05:14:05,920
Even if they're not totally accurate assumptions,
6194
05:14:05,920 --> 05:14:09,520
if they're close to accurate or approximate, they're usually pretty good.
6195
05:14:09,520 --> 05:14:13,160
And the assumption we're going to make is called the Markov assumption, which
6196
05:14:13,160 --> 05:14:16,640
is the assumption that the current state depends only
6197
05:14:16,640 --> 05:14:19,880
on a finite fixed number of previous states.
6198
05:14:19,880 --> 05:14:23,880
So the current day's weather depends not on all the previous day's weather
6199
05:14:23,880 --> 05:14:26,720
for the rest of all of history, but the current day's weather
6200
05:14:26,720 --> 05:14:29,520
I can predict just based on yesterday's weather,
6201
05:14:29,520 --> 05:14:32,680
or just based on the last two days weather, or the last three days weather.
6202
05:14:32,680 --> 05:14:36,960
But oftentimes, we're going to deal with just the one previous state
6203
05:14:36,960 --> 05:14:39,720
that helps to predict this current state.
6204
05:14:39,720 --> 05:14:42,280
And by putting a whole bunch of these random variables together,
6205
05:14:42,280 --> 05:14:46,120
using this Markov assumption, we can create what's called a Markov chain,
6206
05:14:46,120 --> 05:14:49,560
where a Markov chain is just some sequence of random variables
6207
05:14:49,560 --> 05:14:53,120
where each of the variables distribution follows that Markov assumption.
6208
05:14:53,120 --> 05:14:56,040
And so we'll do an example of this where the Markov assumption is,
6209
05:14:56,040 --> 05:14:57,200
I can predict the weather.
6210
05:14:57,200 --> 05:14:58,760
Is it sunny or rainy?
6211
05:14:58,760 --> 05:15:01,160
And we'll just consider those two possibilities for now,
6212
05:15:01,160 --> 05:15:02,920
even though there are other types of weather.
6213
05:15:02,920 --> 05:15:06,280
But I can predict each day's weather just on the prior day's weather,
6214
05:15:06,280 --> 05:15:10,040
using today's weather, I can come up with a probability distribution
6215
05:15:10,040 --> 05:15:11,480
for tomorrow's weather.
6216
05:15:11,480 --> 05:15:13,320
And here's what this weather might look like.
6217
05:15:13,320 --> 05:15:16,640
It's formatted in terms of a matrix, as you might describe it,
6218
05:15:16,640 --> 05:15:21,040
as rows and columns of values, where on the left-hand side,
6219
05:15:21,040 --> 05:15:25,480
I have today's weather, represented by the variable x sub t.
6220
05:15:25,480 --> 05:15:28,360
And over here in the columns, I have tomorrow's weather,
6221
05:15:28,360 --> 05:15:34,440
represented by the variable x sub t plus 1, t plus 1 day's weather instead.
6222
05:15:34,440 --> 05:15:38,600
And what this matrix is saying is, if today is sunny,
6223
05:15:38,600 --> 05:15:42,040
well, then it's more likely than not that tomorrow is also sunny.
6224
05:15:42,040 --> 05:15:45,520
Oftentimes, the weather stays consistent for multiple days in a row.
6225
05:15:45,520 --> 05:15:47,840
And for example, let's say that if today is sunny,
6226
05:15:47,840 --> 05:15:52,440
our model says that tomorrow, with probability 0.8, it will also be sunny.
6227
05:15:52,440 --> 05:15:55,240
And with probability 0.2, it will be raining.
6228
05:15:55,240 --> 05:15:59,920
And likewise, if today is raining, then it's more likely than not
6229
05:15:59,920 --> 05:16:01,120
that tomorrow is also raining.
6230
05:16:01,120 --> 05:16:06,320
With probability 0.7, it'll be raining. With probability 0.3, it will be sunny.
6231
05:16:06,320 --> 05:16:10,760
So this matrix, this description of how it is we transition from one state
6232
05:16:10,760 --> 05:16:14,160
to the next state is what we're going to call the transition model.
6233
05:16:14,160 --> 05:16:16,680
And using the transition model, you can begin
6234
05:16:16,680 --> 05:16:20,360
to construct this Markov chain by just predicting,
6235
05:16:20,360 --> 05:16:23,300
given today's weather, what's the likelihood of tomorrow's weather
6236
05:16:23,300 --> 05:16:23,800
happening.
6237
05:16:23,800 --> 05:16:27,500
And you can imagine doing a similar sampling procedure,
6238
05:16:27,500 --> 05:16:30,880
where you take this information, you sample what tomorrow's weather is
6239
05:16:30,880 --> 05:16:31,600
going to be.
6240
05:16:31,600 --> 05:16:33,640
Using that, you sample the next day's weather.
6241
05:16:33,640 --> 05:16:38,040
And the result of that is you can form this Markov chain of like x0,
6242
05:16:38,040 --> 05:16:40,760
time and time, day zero is sunny, the next day is sunny,
6243
05:16:40,760 --> 05:16:43,880
maybe the next day it changes to raining, then raining, then raining.
6244
05:16:43,880 --> 05:16:46,600
And the pattern that this Markov chain follows,
6245
05:16:46,600 --> 05:16:50,320
given the distribution that we had access to, this transition model here,
6246
05:16:50,320 --> 05:16:53,280
is that when it's sunny, it tends to stay sunny for a little while.
6247
05:16:53,280 --> 05:16:55,760
The next couple of days tend to be sunny too.
6248
05:16:55,760 --> 05:16:59,360
And when it's raining, it tends to be raining as well.
6249
05:16:59,360 --> 05:17:01,400
And so you get a Markov chain that looks like this,
6250
05:17:01,400 --> 05:17:02,720
and you can do analysis on this.
6251
05:17:02,720 --> 05:17:06,380
You can say, given that today is raining, what is the probability
6252
05:17:06,380 --> 05:17:07,420
that tomorrow is raining?
6253
05:17:07,420 --> 05:17:09,400
Or you can begin to ask probability questions
6254
05:17:09,400 --> 05:17:13,600
like, what is the probability of this sequence of five values, sun, sun,
6255
05:17:13,600 --> 05:17:17,120
rain, rain, rain, and answer those sorts of questions too.
6256
05:17:17,120 --> 05:17:19,640
And it turns out there are, again, many Python libraries
6257
05:17:19,640 --> 05:17:23,160
for interacting with models like this of probabilities
6258
05:17:23,160 --> 05:17:25,320
that have distributions and random variables that
6259
05:17:25,320 --> 05:17:29,340
are based on previous variables according to this Markov assumption.
6260
05:17:29,340 --> 05:17:32,720
And pomegranate2 has ways of dealing with these sorts of variables.
6261
05:17:32,720 --> 05:17:39,440
So I'll go ahead and go into the chain directory,
6262
05:17:39,440 --> 05:17:42,200
where I have some information about Markov chains.
6263
05:17:42,200 --> 05:17:45,240
And here, I've defined a file called model.py,
6264
05:17:45,240 --> 05:17:47,960
where I've defined in a very similar syntax.
6265
05:17:47,960 --> 05:17:50,720
And again, the exact syntax doesn't matter so much as the idea
6266
05:17:50,720 --> 05:17:54,080
that I'm encoding this information into a Python program
6267
05:17:54,080 --> 05:17:56,940
so that the program has access to these distributions.
6268
05:17:56,940 --> 05:17:59,560
I've here defined some starting distribution.
6269
05:17:59,560 --> 05:18:02,640
So every Markov model begins at some point in time,
6270
05:18:02,640 --> 05:18:04,720
and I need to give it some starting distribution.
6271
05:18:04,720 --> 05:18:08,480
And so we'll just say, you know at the start, you can pick 50-50 between sunny
6272
05:18:08,480 --> 05:18:09,120
and rainy.
6273
05:18:09,120 --> 05:18:13,000
We'll say it's sunny 50% of the time, rainy 50% of the time.
6274
05:18:13,000 --> 05:18:16,080
And then down below, I've here defined the transition model,
6275
05:18:16,080 --> 05:18:19,320
how it is that I transition from one day to the next.
6276
05:18:19,320 --> 05:18:22,160
And here, I've encoded that exact same matrix from before,
6277
05:18:22,160 --> 05:18:24,840
that if it was sunny today, then with probability 0.8,
6278
05:18:24,840 --> 05:18:26,280
it will be sunny tomorrow.
6279
05:18:26,280 --> 05:18:29,180
And it'll be rainy tomorrow with probability 0.2.
6280
05:18:29,180 --> 05:18:34,400
And I likewise have another distribution for if it was raining today instead.
6281
05:18:34,400 --> 05:18:36,640
And so that alone defines the Markov model.
6282
05:18:36,640 --> 05:18:39,040
You can begin to answer questions using that model.
6283
05:18:39,040 --> 05:18:42,320
But one thing I'll just do is sample from the Markov chain.
6284
05:18:42,320 --> 05:18:45,640
It turns out there is a method built into this Markov chain library
6285
05:18:45,640 --> 05:18:48,120
that allows me to sample 50 states from the chain,
6286
05:18:48,120 --> 05:18:52,640
basically just simulating like 50 instances of weather.
6287
05:18:52,640 --> 05:18:54,400
And so let me go ahead and run this.
6288
05:18:54,400 --> 05:18:57,840
Python model.py.
6289
05:18:57,840 --> 05:18:59,920
And when I run it, what I get is that it's
6290
05:18:59,920 --> 05:19:04,480
going to sample from this Markov chain 50 states, 50 days worth of weather
6291
05:19:04,480 --> 05:19:06,240
that it's just going to randomly sample.
6292
05:19:06,240 --> 05:19:09,040
And you can imagine sampling many times to be able to get more data,
6293
05:19:09,040 --> 05:19:10,480
to be able to do more analysis.
6294
05:19:10,480 --> 05:19:13,800
But here, for example, it's sunny two days in a row,
6295
05:19:13,800 --> 05:19:17,000
rainy a whole bunch of days in a row before it changes back to sun.
6296
05:19:17,000 --> 05:19:20,080
And so you get this model that follows the distribution
6297
05:19:20,080 --> 05:19:23,600
that we originally described, that follows the distribution of sunny days
6298
05:19:23,600 --> 05:19:25,240
tend to lead to more sunny days.
6299
05:19:25,240 --> 05:19:29,400
Rainy days tend to lead to more rainy days.
6300
05:19:29,400 --> 05:19:31,680
And that then is a Markov model.
6301
05:19:31,680 --> 05:19:34,800
And Markov models rely on us knowing the values
6302
05:19:34,800 --> 05:19:35,880
of these individual states.
6303
05:19:35,880 --> 05:19:38,640
I know that today is sunny or that today is raining.
6304
05:19:38,640 --> 05:19:41,880
And using that information, I can draw some sort of inference
6305
05:19:41,880 --> 05:19:44,320
about what tomorrow is going to be like.
6306
05:19:44,320 --> 05:19:46,640
But in practice, this often isn't the case.
6307
05:19:46,640 --> 05:19:49,200
It often isn't the case that I know for certain what
6308
05:19:49,200 --> 05:19:51,280
the exact state of the world is.
6309
05:19:51,280 --> 05:19:54,320
Oftentimes, the state of the world is exactly unknown.
6310
05:19:54,320 --> 05:19:58,120
But I'm able to somehow sense some information about that state,
6311
05:19:58,120 --> 05:20:01,040
that a robot or an AI doesn't have exact knowledge
6312
05:20:01,040 --> 05:20:02,200
about the world around it.
6313
05:20:02,200 --> 05:20:05,120
But it has some sort of sensor, whether that sensor is a camera
6314
05:20:05,120 --> 05:20:09,040
or sensors that detect distance or just a microphone that is sensing audio,
6315
05:20:09,040 --> 05:20:09,920
for example.
6316
05:20:09,920 --> 05:20:11,400
It is sensing data.
6317
05:20:11,400 --> 05:20:14,240
And using that data, that data is somehow related
6318
05:20:14,240 --> 05:20:17,000
to the state of the world, even if it doesn't actually know,
6319
05:20:17,000 --> 05:20:20,720
our AI doesn't know, what the underlying true state of the world
6320
05:20:20,720 --> 05:20:22,200
actually is.
6321
05:20:22,200 --> 05:20:25,120
And for that, we need to get into the world of sensor models,
6322
05:20:25,120 --> 05:20:28,040
the way of describing how it is that we translate
6323
05:20:28,040 --> 05:20:31,200
what the hidden state, the underlying true state of the world,
6324
05:20:31,200 --> 05:20:36,120
is with what the observation, what it is that the AI knows or the AI has
6325
05:20:36,120 --> 05:20:38,360
access to, actually is.
6326
05:20:38,360 --> 05:20:42,520
And so for example, a hidden state might be a robot's position.
6327
05:20:42,520 --> 05:20:45,160
If a robot is exploring new uncharted territory,
6328
05:20:45,160 --> 05:20:48,240
the robot likely doesn't know exactly where it is.
6329
05:20:48,240 --> 05:20:49,640
But it does have an observation.
6330
05:20:49,640 --> 05:20:52,920
It has robot sensor data, where it can sense how far away
6331
05:20:52,920 --> 05:20:54,880
are possible obstacles around it.
6332
05:20:54,880 --> 05:20:58,880
And using that information, using the observed information that it has,
6333
05:20:58,880 --> 05:21:01,920
it can infer something about the hidden state.
6334
05:21:01,920 --> 05:21:05,880
Because what the true hidden state is influences those observations.
6335
05:21:05,880 --> 05:21:10,160
Whatever the robot's true position is affects or has some effect
6336
05:21:10,160 --> 05:21:13,480
upon what the sensor data of the robot is able to collect is,
6337
05:21:13,480 --> 05:21:18,720
even if the robot doesn't actually know for certain what its true position is.
6338
05:21:18,720 --> 05:21:21,960
Likewise, if you think about a voice recognition or a speech recognition
6339
05:21:21,960 --> 05:21:25,280
program that listens to you and is able to respond to you, something
6340
05:21:25,280 --> 05:21:29,640
like Alexa or what Apple and Google are doing with their voice recognition
6341
05:21:29,640 --> 05:21:33,720
as well, that you might imagine that the hidden state, the underlying state,
6342
05:21:33,720 --> 05:21:35,360
is what words are actually spoken.
6343
05:21:35,360 --> 05:21:38,240
The true nature of the world contains you saying
6344
05:21:38,240 --> 05:21:42,920
a particular sequence of words, but your phone or your smart home device
6345
05:21:42,920 --> 05:21:45,560
doesn't know for sure exactly what words you said.
6346
05:21:45,560 --> 05:21:50,720
The only observation that the AI has access to is some audio waveforms.
6347
05:21:50,720 --> 05:21:54,800
And those audio waveforms are, of course, dependent upon this hidden state.
6348
05:21:54,800 --> 05:21:57,560
And you can infer, based on those audio waveforms,
6349
05:21:57,560 --> 05:22:00,160
what the words spoken likely were.
6350
05:22:00,160 --> 05:22:04,600
But you might not know with 100% certainty what that hidden state actually
6351
05:22:04,600 --> 05:22:05,100
is.
6352
05:22:05,100 --> 05:22:08,440
And it might be a task to try and predict, given this observation,
6353
05:22:08,440 --> 05:22:12,600
given these audio waveforms, can you figure out what the actual words spoken
6354
05:22:12,600 --> 05:22:13,760
are.
6355
05:22:13,760 --> 05:22:16,680
And likewise, you might imagine on a website, true user engagement.
6356
05:22:16,680 --> 05:22:19,160
Might be information you don't directly have access to.
6357
05:22:19,160 --> 05:22:22,060
But you can observe data, like website or app analytics,
6358
05:22:22,060 --> 05:22:25,280
about how often was this button clicked or how often are people interacting
6359
05:22:25,280 --> 05:22:26,840
with a page in a particular way.
6360
05:22:26,840 --> 05:22:30,840
And you can use that to infer things about your users as well.
6361
05:22:30,840 --> 05:22:33,440
So this type of problem comes up all the time
6362
05:22:33,440 --> 05:22:36,400
when we're dealing with AI and trying to infer things about the world.
6363
05:22:36,400 --> 05:22:40,400
That often AI doesn't really know the hidden true state of the world.
6364
05:22:40,400 --> 05:22:43,560
All the AI has access to is some observation
6365
05:22:43,560 --> 05:22:45,920
that is related to the hidden true state.
6366
05:22:45,920 --> 05:22:47,080
But it's not direct.
6367
05:22:47,080 --> 05:22:48,440
There might be some noise there.
6368
05:22:48,440 --> 05:22:50,720
The audio waveform might have some additional noise
6369
05:22:50,720 --> 05:22:52,000
that might be difficult to parse.
6370
05:22:52,000 --> 05:22:54,560
The sensor data might not be exactly correct.
6371
05:22:54,560 --> 05:22:57,760
There's some noise that might not allow you to conclude with certainty what
6372
05:22:57,760 --> 05:23:01,880
the hidden state is, but can allow you to infer what it might be.
6373
05:23:01,880 --> 05:23:04,040
And so the simple example we'll take a look at here
6374
05:23:04,040 --> 05:23:07,040
is imagining the hidden state as the weather, whether it's sunny or rainy
6375
05:23:07,040 --> 05:23:07,720
or not.
6376
05:23:07,720 --> 05:23:11,360
And imagine you are programming an AI inside of a building that maybe has
6377
05:23:11,360 --> 05:23:14,400
access to just a camera to inside the building.
6378
05:23:14,400 --> 05:23:17,280
And all you have access to is an observation
6379
05:23:17,280 --> 05:23:19,600
as to whether or not employees are bringing
6380
05:23:19,600 --> 05:23:21,440
an umbrella into the building or not.
6381
05:23:21,440 --> 05:23:24,000
You can detect whether it's an umbrella or not.
6382
05:23:24,000 --> 05:23:26,640
And so you might have an observation as to whether or not
6383
05:23:26,640 --> 05:23:28,960
an umbrella is brought into the building or not.
6384
05:23:28,960 --> 05:23:32,840
And using that information, you want to predict whether it's sunny or rainy,
6385
05:23:32,840 --> 05:23:35,600
even if you don't know what the underlying weather is.
6386
05:23:35,600 --> 05:23:37,680
So the underlying weather might be sunny or rainy.
6387
05:23:37,680 --> 05:23:41,120
And if it's raining, obviously people are more likely to bring an umbrella.
6388
05:23:41,120 --> 05:23:44,320
And so whether or not people bring an umbrella, your observation,
6389
05:23:44,320 --> 05:23:46,560
tells you something about the hidden state.
6390
05:23:46,560 --> 05:23:48,600
And of course, this is a bit of a contrived example,
6391
05:23:48,600 --> 05:23:51,640
but the idea here is to think about this more broadly in terms of more
6392
05:23:51,640 --> 05:23:54,000
generally, any time you observe something,
6393
05:23:54,000 --> 05:23:57,680
it having to do with some underlying hidden state.
6394
05:23:57,680 --> 05:23:59,720
And so to try and model this type of idea where
6395
05:23:59,720 --> 05:24:02,000
we have these hidden states and observations,
6396
05:24:02,000 --> 05:24:05,320
rather than just use a Markov model, which has state, state, state, state,
6397
05:24:05,320 --> 05:24:08,560
each of which is connected by that transition matrix that we described
6398
05:24:08,560 --> 05:24:12,280
before, we're going to use what we call a hidden Markov model.
6399
05:24:12,280 --> 05:24:14,600
Very similar to a Markov model, but this is going
6400
05:24:14,600 --> 05:24:17,560
to allow us to model a system that has hidden states
6401
05:24:17,560 --> 05:24:21,160
that we don't directly observe, along with some observed event
6402
05:24:21,160 --> 05:24:23,360
that we do actually see.
6403
05:24:23,360 --> 05:24:25,800
And so in addition to that transition model that we still
6404
05:24:25,800 --> 05:24:28,400
need of saying, given the underlying state of the world,
6405
05:24:28,400 --> 05:24:32,080
if it's sunny or rainy, what's the probability of tomorrow's weather?
6406
05:24:32,080 --> 05:24:35,800
We also need another model that, given some state,
6407
05:24:35,800 --> 05:24:38,920
is going to give us an observation of green, yes, someone brings
6408
05:24:38,920 --> 05:24:43,560
an umbrella into the office, or red, no, nobody brings umbrellas into the office.
6409
05:24:43,560 --> 05:24:46,840
And so the observation might be that if it's sunny,
6410
05:24:46,840 --> 05:24:49,400
then odds are nobody is going to bring an umbrella to the office.
6411
05:24:49,400 --> 05:24:51,400
But maybe some people are just being cautious,
6412
05:24:51,400 --> 05:24:54,120
and they do bring an umbrella to the office anyways.
6413
05:24:54,120 --> 05:24:57,400
And if it's raining, then with much higher probability,
6414
05:24:57,400 --> 05:24:59,720
then people are going to bring umbrellas into the office.
6415
05:24:59,720 --> 05:25:02,900
But maybe if the rain was unexpected, people didn't bring an umbrella.
6416
05:25:02,900 --> 05:25:05,520
And so it might have some other probability as well.
6417
05:25:05,520 --> 05:25:07,560
And so using the observations, you can begin
6418
05:25:07,560 --> 05:25:11,680
to predict with reasonable likelihood what the underlying state is,
6419
05:25:11,680 --> 05:25:15,080
even if you don't actually get to observe the underlying state,
6420
05:25:15,080 --> 05:25:18,640
if you don't get to see what the hidden state is actually equal to.
6421
05:25:18,640 --> 05:25:21,040
This here we'll often call the sensor model.
6422
05:25:21,040 --> 05:25:23,920
It's also often called the emission probabilities,
6423
05:25:23,920 --> 05:25:27,760
because the state, the underlying state, emits some sort of emission
6424
05:25:27,760 --> 05:25:29,160
that you then observe.
6425
05:25:29,160 --> 05:25:32,840
And so that can be another way of describing that same idea.
6426
05:25:32,840 --> 05:25:35,480
And the sensor Markov assumption that we're going to use
6427
05:25:35,480 --> 05:25:38,960
is this assumption that the evidence variable, the thing we observe,
6428
05:25:38,960 --> 05:25:43,120
the emission that gets produced, depends only on the corresponding state,
6429
05:25:43,120 --> 05:25:46,600
meaning it can predict whether or not people will bring umbrellas or not
6430
05:25:46,600 --> 05:25:50,920
entirely dependent just on whether it is sunny or rainy today.
6431
05:25:50,920 --> 05:25:53,560
Of course, again, this assumption might not hold in practice,
6432
05:25:53,560 --> 05:25:55,680
that in practice, it might depend whether or not
6433
05:25:55,680 --> 05:25:58,240
people bring umbrellas, might depend not just on today's weather,
6434
05:25:58,240 --> 05:26:00,560
but also on yesterday's weather and the day before.
6435
05:26:00,560 --> 05:26:04,480
But for simplification purposes, it can be helpful to apply this sort
6436
05:26:04,480 --> 05:26:07,000
of assumption just to allow us to be able to reason
6437
05:26:07,000 --> 05:26:09,680
about these probabilities a little more easily.
6438
05:26:09,680 --> 05:26:14,440
And if we're able to approximate it, we can still often get a very good answer.
6439
05:26:14,440 --> 05:26:16,960
And so what these hidden Markov models end up looking like
6440
05:26:16,960 --> 05:26:20,000
is a little something like this, where now, rather than just have
6441
05:26:20,000 --> 05:26:23,520
one chain of states, like sun, sun, rain, rain, rain,
6442
05:26:23,520 --> 05:26:29,280
we instead have this upper level, which is the underlying state of the world.
6443
05:26:29,280 --> 05:26:30,560
Is it sunny or is it rainy?
6444
05:26:30,560 --> 05:26:34,360
And those are connected by that transition matrix we described before.
6445
05:26:34,360 --> 05:26:37,160
But each of these states produces an emission,
6446
05:26:37,160 --> 05:26:41,200
produces an observation that I see, that on this day, it was sunny
6447
05:26:41,200 --> 05:26:43,200
and people didn't bring umbrellas.
6448
05:26:43,200 --> 05:26:46,000
And on this day, it was sunny, but people did bring umbrellas.
6449
05:26:46,000 --> 05:26:48,160
And on this day, it was raining and people did bring umbrellas,
6450
05:26:48,160 --> 05:26:49,680
and so on and so forth.
6451
05:26:49,680 --> 05:26:52,560
And so each of these underlying states represented
6452
05:26:52,560 --> 05:26:56,400
by x sub t for x sub 1, 0, 1, 2, so on and so forth,
6453
05:26:56,400 --> 05:26:59,000
produces some sort of observation or emission,
6454
05:26:59,000 --> 05:27:04,320
which is what the e stands for, e sub 0, e sub 1, e sub 2, so on and so forth.
6455
05:27:04,320 --> 05:27:07,600
And so this, too, is a way of trying to represent this idea.
6456
05:27:07,600 --> 05:27:10,240
And what you want to think about is that these underlying states are
6457
05:27:10,240 --> 05:27:14,360
the true nature of the world, the robot's position as it moves over time,
6458
05:27:14,360 --> 05:27:17,720
and that produces some sort of sensor data that might be observed,
6459
05:27:17,720 --> 05:27:21,640
or what people are actually saying and using the emission data of what
6460
05:27:21,640 --> 05:27:24,880
audio waveforms do you detect in order to process that data
6461
05:27:24,880 --> 05:27:26,200
and try and figure it out.
6462
05:27:26,200 --> 05:27:29,440
And there are a number of possible tasks that you might want to do
6463
05:27:29,440 --> 05:27:30,800
given this kind of information.
6464
05:27:30,800 --> 05:27:33,720
And one of the simplest is trying to infer something
6465
05:27:33,720 --> 05:27:37,520
about the future or the past or about these sort of hidden states that
6466
05:27:37,520 --> 05:27:38,560
might exist.
6467
05:27:38,560 --> 05:27:40,520
And so the tasks that you'll often see, and we're not
6468
05:27:40,520 --> 05:27:42,520
going to go into the mathematics of these tasks,
6469
05:27:42,520 --> 05:27:45,960
but they're all based on the same idea of conditional probabilities
6470
05:27:45,960 --> 05:27:48,440
and using the probability distributions we
6471
05:27:48,440 --> 05:27:51,200
have to draw these sorts of conclusions.
6472
05:27:51,200 --> 05:27:55,440
One task is called filtering, which is given observations from the start
6473
05:27:55,440 --> 05:27:59,320
until now, calculate the distribution for the current state,
6474
05:27:59,320 --> 05:28:03,360
meaning given information about from the beginning of time until now,
6475
05:28:03,360 --> 05:28:06,720
on which days do people bring an umbrella or not bring an umbrella,
6476
05:28:06,720 --> 05:28:10,280
can I calculate the probability of the current state that today,
6477
05:28:10,280 --> 05:28:12,440
is it sunny or is it raining?
6478
05:28:12,440 --> 05:28:14,640
Another task that might be possible is prediction,
6479
05:28:14,640 --> 05:28:16,240
which is looking towards the future.
6480
05:28:16,240 --> 05:28:18,640
Given observations about people bringing umbrellas
6481
05:28:18,640 --> 05:28:22,240
from the beginning of when we started counting time until now,
6482
05:28:22,240 --> 05:28:25,600
can I figure out the distribution that tomorrow is it sunny or is it
6483
05:28:25,600 --> 05:28:26,680
raining?
6484
05:28:26,680 --> 05:28:29,520
And you can also go backwards as well by a smoothing,
6485
05:28:29,520 --> 05:28:32,560
where I can say given observations from start until now,
6486
05:28:32,560 --> 05:28:35,360
calculate the distributions for some past state.
6487
05:28:35,360 --> 05:28:38,920
Like I know that today people brought umbrellas and tomorrow people
6488
05:28:38,920 --> 05:28:39,920
brought umbrellas.
6489
05:28:39,920 --> 05:28:42,760
And so given two days worth of data of people bringing umbrellas,
6490
05:28:42,760 --> 05:28:45,720
what's the probability that yesterday it was raining?
6491
05:28:45,720 --> 05:28:47,880
And that I know that people brought umbrellas today,
6492
05:28:47,880 --> 05:28:50,160
that might inform that decision as well.
6493
05:28:50,160 --> 05:28:52,680
It might influence those probabilities.
6494
05:28:52,680 --> 05:28:56,280
And there's also a most likely explanation task,
6495
05:28:56,280 --> 05:28:58,560
in addition to other tasks that might exist as well, which
6496
05:28:58,560 --> 05:29:01,720
is combining some of these given observations from the start up
6497
05:29:01,720 --> 05:29:04,960
until now, figuring out the most likely sequence of states.
6498
05:29:04,960 --> 05:29:07,960
And this is what we're going to take a look at now, this idea that if I
6499
05:29:07,960 --> 05:29:11,560
have all these observations, umbrella, no umbrella, umbrella, no umbrella,
6500
05:29:11,560 --> 05:29:15,880
can I calculate the most likely states of sun, rain, sun, rain, and whatnot
6501
05:29:15,880 --> 05:29:18,520
that actually represented the true weather that
6502
05:29:18,520 --> 05:29:20,680
would produce these observations?
6503
05:29:20,680 --> 05:29:23,960
And this is quite common when you're trying to do something like voice
6504
05:29:23,960 --> 05:29:27,520
recognition, for example, that you have these emissions of the audio waveforms,
6505
05:29:27,520 --> 05:29:30,560
and you would like to calculate based on all of the observations
6506
05:29:30,560 --> 05:29:34,480
that you have, what is the most likely sequence of actual words, or syllables,
6507
05:29:34,480 --> 05:29:38,200
or sounds that the user actually made when they were speaking
6508
05:29:38,200 --> 05:29:41,760
to this particular device, or other tasks that might come up in that context
6509
05:29:41,760 --> 05:29:43,000
as well.
6510
05:29:43,000 --> 05:29:47,680
And so we can try this out by going ahead and going into the HMM directory,
6511
05:29:47,680 --> 05:29:50,800
HMM for Hidden Markov Model.
6512
05:29:50,800 --> 05:29:57,160
And here, what I've done is I've defined a model where this model first defines
6513
05:29:57,160 --> 05:30:02,200
my possible state, sun, and rain, along with their emission probabilities,
6514
05:30:02,200 --> 05:30:06,240
the observation model, or the emission model, where here, given
6515
05:30:06,240 --> 05:30:09,040
that I know that it's sunny, the probability
6516
05:30:09,040 --> 05:30:11,680
that I see people bring an umbrella is 0.2,
6517
05:30:11,680 --> 05:30:14,560
the probability of no umbrella is 0.8.
6518
05:30:14,560 --> 05:30:16,600
And likewise, if it's raining, then people
6519
05:30:16,600 --> 05:30:18,000
are more likely to bring an umbrella.
6520
05:30:18,000 --> 05:30:21,720
Umbrella has probability 0.9, no umbrella has probability 0.1.
6521
05:30:21,720 --> 05:30:26,520
So the actual underlying hidden states, those states are sun and rain,
6522
05:30:26,520 --> 05:30:29,560
but the things that I observe, the observations that I can see,
6523
05:30:29,560 --> 05:30:35,320
are either umbrella or no umbrella as the things that I observe as a result.
6524
05:30:35,320 --> 05:30:39,840
So this then, I also need to add to it a transition matrix, same as before,
6525
05:30:39,840 --> 05:30:43,640
saying that if today is sunny, then tomorrow is more likely to be sunny.
6526
05:30:43,640 --> 05:30:47,000
And if today is rainy, then tomorrow is more likely to be raining.
6527
05:30:47,000 --> 05:30:49,320
As of before, I give it some starting probabilities,
6528
05:30:49,320 --> 05:30:53,120
saying at first, 50-50 chance for whether it's sunny or rainy.
6529
05:30:53,120 --> 05:30:56,640
And then I can create the model based on that information.
6530
05:30:56,640 --> 05:30:59,160
Again, the exact syntax of this is not so important,
6531
05:30:59,160 --> 05:31:02,600
so much as it is the data that I am now encoding into a program,
6532
05:31:02,600 --> 05:31:06,400
such that now I can begin to do some inference.
6533
05:31:06,400 --> 05:31:10,160
So I can give my program, for example, a list of observations,
6534
05:31:10,160 --> 05:31:13,560
umbrella, umbrella, no umbrella, umbrella, umbrella, so on and so forth,
6535
05:31:13,560 --> 05:31:14,960
no umbrella, no umbrella.
6536
05:31:14,960 --> 05:31:18,080
And I would like to calculate, I would like to figure out the most likely
6537
05:31:18,080 --> 05:31:20,360
explanation for these observations.
6538
05:31:20,360 --> 05:31:23,600
What is likely is whether rain, rain, is this rain,
6539
05:31:23,600 --> 05:31:25,960
or is it more likely that this was actually sunny,
6540
05:31:25,960 --> 05:31:28,000
and then it switched back to it being rainy?
6541
05:31:28,000 --> 05:31:29,440
And that's an interesting question.
6542
05:31:29,440 --> 05:31:31,640
We might not be sure, because it might just
6543
05:31:31,640 --> 05:31:34,640
be that it just so happened on this rainy day,
6544
05:31:34,640 --> 05:31:36,560
people decided not to bring an umbrella.
6545
05:31:36,560 --> 05:31:40,360
Or it could be that it switched from rainy to sunny back to rainy,
6546
05:31:40,360 --> 05:31:43,680
which doesn't seem too likely, but it certainly could happen.
6547
05:31:43,680 --> 05:31:46,280
And using the data we give to the hidden Markov model,
6548
05:31:46,280 --> 05:31:49,840
our model can begin to predict these answers, can begin to figure it out.
6549
05:31:49,840 --> 05:31:53,400
So we're going to go ahead and just predict these observations.
6550
05:31:53,400 --> 05:31:56,080
And then for each of those predictions, go ahead and print out
6551
05:31:56,080 --> 05:31:56,880
what the prediction is.
6552
05:31:56,880 --> 05:31:59,400
And this library just so happens to have a function called
6553
05:31:59,400 --> 05:32:03,040
predict that does this prediction process for me.
6554
05:32:03,040 --> 05:32:06,240
So I'll run python sequence.py.
6555
05:32:06,240 --> 05:32:07,880
And the result I get is this.
6556
05:32:07,880 --> 05:32:10,640
This is the prediction based on the observations
6557
05:32:10,640 --> 05:32:12,680
of what all of those states are likely to be.
6558
05:32:12,680 --> 05:32:14,400
And it's likely to be rain and rain.
6559
05:32:14,400 --> 05:32:16,640
In this case, it thinks that what most likely happened
6560
05:32:16,640 --> 05:32:19,560
is that it was sunny for a day and then went back to being rainy.
6561
05:32:19,560 --> 05:32:22,320
But in different situations, if it was rainy for longer maybe,
6562
05:32:22,320 --> 05:32:24,280
or if the probabilities were slightly different,
6563
05:32:24,280 --> 05:32:27,840
you might imagine that it's more likely that it was rainy all the way through.
6564
05:32:27,840 --> 05:32:32,960
And it just so happened on one rainy day, people decided not to bring umbrellas.
6565
05:32:32,960 --> 05:32:35,360
And so here, too, Python libraries can begin
6566
05:32:35,360 --> 05:32:38,240
to allow for the sort of inference procedure.
6567
05:32:38,240 --> 05:32:40,480
And by taking what we know and by putting it
6568
05:32:40,480 --> 05:32:43,080
in terms of these tasks that already exist,
6569
05:32:43,080 --> 05:32:45,960
these general tasks that work with hidden Markov models,
6570
05:32:45,960 --> 05:32:50,160
then any time we can take an idea and formulate it as a hidden Markov model,
6571
05:32:50,160 --> 05:32:52,800
formulate it as something that has hidden states
6572
05:32:52,800 --> 05:32:55,320
and observed emissions that result from those states,
6573
05:32:55,320 --> 05:32:57,360
then we can take advantage of these algorithms
6574
05:32:57,360 --> 05:33:01,360
that are known to exist for trying to do this sort of inference.
6575
05:33:01,360 --> 05:33:05,360
So now we've seen a couple of ways that AI can begin to deal with uncertainty.
6576
05:33:05,360 --> 05:33:08,460
We've taken a look at probability and how we can use probability
6577
05:33:08,460 --> 05:33:11,840
to describe numerically things that are likely or more likely or less
6578
05:33:11,840 --> 05:33:14,640
likely to happen than other events or other variables.
6579
05:33:14,640 --> 05:33:17,400
And using that information, we can begin to construct
6580
05:33:17,400 --> 05:33:20,960
these standard types of models, things like Bayesian networks and Markov
6581
05:33:20,960 --> 05:33:25,200
chains and hidden Markov models that all allow us to be able to describe
6582
05:33:25,200 --> 05:33:27,720
how particular events relate to other events
6583
05:33:27,720 --> 05:33:30,800
or how the values of particular variables relate to other variables,
6584
05:33:30,800 --> 05:33:34,200
not for certain, but with some sort of probability distribution.
6585
05:33:34,200 --> 05:33:37,600
And by formulating things in terms of these models that already exist,
6586
05:33:37,600 --> 05:33:39,920
we can take advantage of Python libraries that
6587
05:33:39,920 --> 05:33:42,560
implement these sort of models already and allow us just
6588
05:33:42,560 --> 05:33:46,520
to be able to use them to produce some sort of resulting effect.
6589
05:33:46,520 --> 05:33:48,520
So all of this then allows our AI to begin
6590
05:33:48,520 --> 05:33:50,920
to deal with these sort of uncertain problems
6591
05:33:50,920 --> 05:33:53,360
so that our AI doesn't need to know things for certain
6592
05:33:53,360 --> 05:33:56,720
but can infer based on information it doesn't know.
6593
05:33:56,720 --> 05:33:59,560
Next time, we'll take a look at additional types of problems
6594
05:33:59,560 --> 05:34:02,520
that we can solve by taking advantage of AI-related algorithms,
6595
05:34:02,520 --> 05:34:05,760
even beyond the world of the types of problems we've already explored.
6596
05:34:05,760 --> 05:34:08,480
We'll see you next time.
6597
05:34:08,480 --> 05:34:27,360
OK.
6598
05:34:27,360 --> 05:34:30,120
Welcome back, everyone, to an introduction to artificial intelligence
6599
05:34:30,120 --> 05:34:31,080
with Python.
6600
05:34:31,080 --> 05:34:32,880
And now, so far, we've taken a look at a couple
6601
05:34:32,880 --> 05:34:34,600
of different types of problems.
6602
05:34:34,600 --> 05:34:36,320
We've seen classical search problems where
6603
05:34:36,320 --> 05:34:38,680
we're trying to get from an initial state to a goal
6604
05:34:38,680 --> 05:34:40,560
by figuring out some optimal path.
6605
05:34:40,560 --> 05:34:42,360
We've taken a look at adversarial search where
6606
05:34:42,360 --> 05:34:45,360
we have a game-playing agent that is trying to make the best move.
6607
05:34:45,360 --> 05:34:48,080
We've seen knowledge-based problems where we're trying to use logic
6608
05:34:48,080 --> 05:34:50,320
and inference to be able to figure out and draw
6609
05:34:50,320 --> 05:34:51,800
some additional conclusions.
6610
05:34:51,800 --> 05:34:54,400
And we've seen some probabilistic models as well where we might not
6611
05:34:54,400 --> 05:34:56,280
have certain information about the world,
6612
05:34:56,280 --> 05:34:59,480
but we want to use the knowledge about probabilities that we do have
6613
05:34:59,480 --> 05:35:01,480
to be able to draw some conclusions.
6614
05:35:01,480 --> 05:35:04,400
Today, we're going to turn our attention to another category of problems
6615
05:35:04,400 --> 05:35:08,480
generally known as optimization problems, where optimization is really
6616
05:35:08,480 --> 05:35:12,400
all about choosing the best option from a set of possible options.
6617
05:35:12,400 --> 05:35:14,680
And we've already seen optimization in some contexts,
6618
05:35:14,680 --> 05:35:17,140
like game-playing, where we're trying to create an AI that
6619
05:35:17,140 --> 05:35:19,760
chooses the best move out of a set of possible moves.
6620
05:35:19,760 --> 05:35:23,120
But what we'll take a look at today is a category of types of problems
6621
05:35:23,120 --> 05:35:25,360
and algorithms to solve them that can be used
6622
05:35:25,360 --> 05:35:29,720
in order to deal with a broader range of potential optimization problems.
6623
05:35:29,720 --> 05:35:32,040
And the first of the algorithms that we'll take a look at
6624
05:35:32,040 --> 05:35:34,280
is known as a local search.
6625
05:35:34,280 --> 05:35:36,320
And local search differs from search algorithms
6626
05:35:36,320 --> 05:35:38,960
we've seen before in the sense that the search algorithms we've
6627
05:35:38,960 --> 05:35:42,400
looked at so far, which are things like breadth-first search or A-star search,
6628
05:35:42,400 --> 05:35:45,800
for example, generally maintain a whole bunch of different paths
6629
05:35:45,800 --> 05:35:47,920
that we're simultaneously exploring, and we're
6630
05:35:47,920 --> 05:35:50,240
looking at a bunch of different paths at once trying
6631
05:35:50,240 --> 05:35:51,920
to find our way to the solution.
6632
05:35:51,920 --> 05:35:53,920
On the other hand, in local search, this is going
6633
05:35:53,920 --> 05:35:57,520
to be a search algorithm that's really just going to maintain a single node,
6634
05:35:57,520 --> 05:35:59,240
looking at a single state.
6635
05:35:59,240 --> 05:36:02,860
And we'll generally run this algorithm by maintaining that single node
6636
05:36:02,860 --> 05:36:05,680
and then moving ourselves to one of the neighboring nodes
6637
05:36:05,680 --> 05:36:07,600
throughout this search process.
6638
05:36:07,600 --> 05:36:10,900
And this is generally useful in context not like these problems, which
6639
05:36:10,900 --> 05:36:13,360
we've seen before, like a maze-solving situation where
6640
05:36:13,360 --> 05:36:16,060
we're trying to find our way from the initial state to the goal
6641
05:36:16,060 --> 05:36:17,680
by following some path.
6642
05:36:17,680 --> 05:36:20,200
But local search is most applicable when we really
6643
05:36:20,200 --> 05:36:23,320
don't care about the path at all, and all we care about
6644
05:36:23,320 --> 05:36:24,880
is what the solution is.
6645
05:36:24,880 --> 05:36:27,460
And in the case of solving a maze, the solution was always obvious.
6646
05:36:27,460 --> 05:36:28,840
You could point to the solution.
6647
05:36:28,840 --> 05:36:31,280
You know exactly what the goal is, and the real question
6648
05:36:31,280 --> 05:36:33,160
is, what is the path to get there?
6649
05:36:33,160 --> 05:36:35,120
But local search is going to come up in cases
6650
05:36:35,120 --> 05:36:37,440
where figuring out exactly what the solution is,
6651
05:36:37,440 --> 05:36:41,640
exactly what the goal looks like, is actually the heart of the challenge.
6652
05:36:41,640 --> 05:36:44,140
And to give an example of one of these kinds of problems,
6653
05:36:44,140 --> 05:36:46,800
we'll consider a scenario where we have two types of buildings,
6654
05:36:46,800 --> 05:36:47,440
for example.
6655
05:36:47,440 --> 05:36:49,520
We have houses and hospitals.
6656
05:36:49,520 --> 05:36:52,520
And our goal might be in a world that's formatted as this grid,
6657
05:36:52,520 --> 05:36:55,080
where we have a whole bunch of houses, a house here, house here,
6658
05:36:55,080 --> 05:36:58,360
two houses over there, maybe we want to try and find a way
6659
05:36:58,360 --> 05:37:01,240
to place two hospitals on this map.
6660
05:37:01,240 --> 05:37:04,120
So maybe a hospital here and a hospital there.
6661
05:37:04,120 --> 05:37:07,160
And the problem now is we want to place two hospitals on the map,
6662
05:37:07,160 --> 05:37:09,960
but we want to do so with some sort of objective.
6663
05:37:09,960 --> 05:37:12,880
And our objective in this case is to try and minimize
6664
05:37:12,880 --> 05:37:16,280
the distance of any of the houses from a hospital.
6665
05:37:16,280 --> 05:37:18,320
So you might imagine, all right, what's the distance
6666
05:37:18,320 --> 05:37:20,440
from each of the houses to their nearest hospital?
6667
05:37:20,440 --> 05:37:23,040
There are a number of ways we could calculate that distance.
6668
05:37:23,040 --> 05:37:25,440
But one way is using a heuristic we've looked at before,
6669
05:37:25,440 --> 05:37:28,320
which is the Manhattan distance, this idea of how many rows
6670
05:37:28,320 --> 05:37:32,000
and columns would you have to move inside of this grid layout in order
6671
05:37:32,000 --> 05:37:34,360
to get to a hospital, for example.
6672
05:37:34,360 --> 05:37:36,760
And it turns out, if you take each of these four houses
6673
05:37:36,760 --> 05:37:39,600
and figure out, all right, how close are they to their nearest hospital,
6674
05:37:39,600 --> 05:37:42,960
you get something like this, where this house is three away from a hospital,
6675
05:37:42,960 --> 05:37:46,040
this house is six away, and these two houses are each four away.
6676
05:37:46,040 --> 05:37:48,040
And if you add all those numbers up together,
6677
05:37:48,040 --> 05:37:51,840
you get a total cost of 17, for example.
6678
05:37:51,840 --> 05:37:55,360
So for this particular configuration of hospitals, a hospital here
6679
05:37:55,360 --> 05:37:58,160
and a hospital there, that state, we might say,
6680
05:37:58,160 --> 05:37:59,920
has a cost of 17.
6681
05:37:59,920 --> 05:38:01,840
And the goal of this problem now that we would
6682
05:38:01,840 --> 05:38:04,160
like to apply a search algorithm to figure out
6683
05:38:04,160 --> 05:38:08,440
is, can you solve this problem to find a way to minimize that cost?
6684
05:38:08,440 --> 05:38:11,880
Minimize the total amount if you sum up all of the distances
6685
05:38:11,880 --> 05:38:14,040
from all the houses to the nearest hospital.
6686
05:38:14,040 --> 05:38:16,600
How can we minimize that final value?
6687
05:38:16,600 --> 05:38:19,320
And if we think about this problem a little bit more abstractly,
6688
05:38:19,320 --> 05:38:21,400
abstracting away from this specific problem
6689
05:38:21,400 --> 05:38:23,880
and thinking more generally about problems like it,
6690
05:38:23,880 --> 05:38:26,800
you can often formulate these problems by thinking about them
6691
05:38:26,800 --> 05:38:29,720
as a state-space landscape, as we'll soon call it.
6692
05:38:29,720 --> 05:38:32,120
Here in this diagram of a state-space landscape,
6693
05:38:32,120 --> 05:38:35,760
each of these vertical bars represents a particular state
6694
05:38:35,760 --> 05:38:37,040
that our world could be in.
6695
05:38:37,040 --> 05:38:39,320
So for example, each of these vertical bars
6696
05:38:39,320 --> 05:38:43,200
represents a particular configuration of two hospitals.
6697
05:38:43,200 --> 05:38:45,680
And the height of this vertical bar is generally
6698
05:38:45,680 --> 05:38:50,160
going to represent some function of that state, some value of that state.
6699
05:38:50,160 --> 05:38:52,560
So maybe in this case, the height of the vertical bar
6700
05:38:52,560 --> 05:38:56,160
represents what is the cost of this particular configuration
6701
05:38:56,160 --> 05:38:59,720
of hospitals in terms of what is the sum total of all the distances
6702
05:38:59,720 --> 05:39:03,320
from all of the houses to their nearest hospital.
6703
05:39:03,320 --> 05:39:06,360
And generally speaking, when we have a state-space landscape,
6704
05:39:06,360 --> 05:39:08,640
we want to do one of two things.
6705
05:39:08,640 --> 05:39:12,080
We might be trying to maximize the value of this function,
6706
05:39:12,080 --> 05:39:16,280
trying to find a global maximum, so to speak, of this state-space landscape,
6707
05:39:16,280 --> 05:39:20,360
a single state whose value is higher than all of the other states
6708
05:39:20,360 --> 05:39:22,040
that we could possibly choose from.
6709
05:39:22,040 --> 05:39:25,040
And generally in this case, when we're trying to find a global maximum,
6710
05:39:25,040 --> 05:39:27,720
we'll call the function that we're trying to optimize
6711
05:39:27,720 --> 05:39:30,120
some objective function, some function that
6712
05:39:30,120 --> 05:39:34,040
measures for any given state how good is that state,
6713
05:39:34,040 --> 05:39:37,160
such that we can take any state, pass it into the objective function,
6714
05:39:37,160 --> 05:39:39,640
and get a value for how good that state is.
6715
05:39:39,640 --> 05:39:42,760
And ultimately, what our goal is is to find one of these states
6716
05:39:42,760 --> 05:39:46,840
that has the highest possible value for that objective function.
6717
05:39:46,840 --> 05:39:49,280
An equivalent but reversed problem is the problem
6718
05:39:49,280 --> 05:39:52,400
of finding a global minimum, some state that has a value
6719
05:39:52,400 --> 05:39:55,960
after you pass it into this function that is lower than all of the other
6720
05:39:55,960 --> 05:39:57,840
possible values that we might choose from.
6721
05:39:57,840 --> 05:40:00,560
And generally speaking, when we're trying to find a global minimum,
6722
05:40:00,560 --> 05:40:03,720
we call the function that we're calculating a cost function.
6723
05:40:03,720 --> 05:40:05,960
Generally, each state has some sort of cost,
6724
05:40:05,960 --> 05:40:08,840
whether that cost is a monetary cost, or a time cost,
6725
05:40:08,840 --> 05:40:10,720
or in the case of the houses and hospitals,
6726
05:40:10,720 --> 05:40:13,360
we've been looking at just now, a distance cost in terms
6727
05:40:13,360 --> 05:40:17,000
of how far away each of the houses is from a hospital.
6728
05:40:17,000 --> 05:40:19,080
And we're trying to minimize the cost, find
6729
05:40:19,080 --> 05:40:23,560
the state that has the lowest possible value of that cost.
6730
05:40:23,560 --> 05:40:25,520
So these are the general types of ideas we
6731
05:40:25,520 --> 05:40:28,160
might be trying to go for within a state space landscape,
6732
05:40:28,160 --> 05:40:32,240
trying to find a global maximum, or trying to find a global minimum.
6733
05:40:32,240 --> 05:40:33,960
And how exactly do we do that?
6734
05:40:33,960 --> 05:40:36,160
We'll recall that in local search, we generally
6735
05:40:36,160 --> 05:40:39,160
operate this algorithm by maintaining just a single state,
6736
05:40:39,160 --> 05:40:41,960
just some current state represented inside of some node,
6737
05:40:41,960 --> 05:40:43,800
maybe inside of a data structure, where we're
6738
05:40:43,800 --> 05:40:46,280
keeping track of where we are currently.
6739
05:40:46,280 --> 05:40:49,320
And then ultimately, what we're going to do is from that state,
6740
05:40:49,320 --> 05:40:51,640
move to one of its neighbor states.
6741
05:40:51,640 --> 05:40:54,140
So in this case, represented in this one-dimensional space
6742
05:40:54,140 --> 05:40:57,000
by just the state immediately to the left or to the right of it.
6743
05:40:57,000 --> 05:40:58,960
But for any different problem, you might define
6744
05:40:58,960 --> 05:41:02,080
what it means for there to be a neighbor of a particular state.
6745
05:41:02,080 --> 05:41:05,000
In the case of a hospital, for example, that we were just looking at,
6746
05:41:05,000 --> 05:41:08,620
a neighbor might be moving one hospital one space to the left
6747
05:41:08,620 --> 05:41:10,280
or to the right or up or down.
6748
05:41:10,280 --> 05:41:14,560
Some state that is close to our current state, but slightly different,
6749
05:41:14,560 --> 05:41:17,040
and as a result, might have a slightly different value
6750
05:41:17,040 --> 05:41:21,600
in terms of its objective function or in terms of its cost function.
6751
05:41:21,600 --> 05:41:24,240
So this is going to be our general strategy in local search,
6752
05:41:24,240 --> 05:41:27,140
to be able to take a state, maintaining some current node,
6753
05:41:27,140 --> 05:41:29,960
and move where we're looking at in the state space landscape
6754
05:41:29,960 --> 05:41:33,800
in order to try to find a global maximum or a global minimum somehow.
6755
05:41:33,800 --> 05:41:35,760
And perhaps the simplest of algorithms that we
6756
05:41:35,760 --> 05:41:38,720
could use to implement this idea of local search
6757
05:41:38,720 --> 05:41:41,120
is an algorithm known as hill climbing.
6758
05:41:41,120 --> 05:41:43,160
And the basic idea of hill climbing is, let's
6759
05:41:43,160 --> 05:41:46,720
say I'm trying to maximize the value of my state.
6760
05:41:46,720 --> 05:41:49,160
I'm trying to figure out where the global maximum is.
6761
05:41:49,160 --> 05:41:50,720
I'm going to start at a state.
6762
05:41:50,720 --> 05:41:53,120
And generally, what hill climbing is going to do
6763
05:41:53,120 --> 05:41:55,720
is it's going to consider the neighbors of that state,
6764
05:41:55,720 --> 05:41:58,720
that from this state, all right, I could go left or I could go right,
6765
05:41:58,720 --> 05:42:01,880
and this neighbor happens to be higher and this neighbor happens to be lower.
6766
05:42:01,880 --> 05:42:04,880
And in hill climbing, if I'm trying to maximize the value,
6767
05:42:04,880 --> 05:42:07,680
I'll generally pick the highest one I can between the state
6768
05:42:07,680 --> 05:42:08,920
to the left and right of me.
6769
05:42:08,920 --> 05:42:10,120
This one is higher.
6770
05:42:10,120 --> 05:42:13,600
So I'll go ahead and move myself to consider that state instead.
6771
05:42:13,600 --> 05:42:17,160
And then I'll repeat this process, continually looking at all of my neighbors
6772
05:42:17,160 --> 05:42:19,360
and picking the highest neighbor, doing the same thing,
6773
05:42:19,360 --> 05:42:21,880
looking at my neighbors, picking the highest of my neighbors,
6774
05:42:21,880 --> 05:42:25,960
until I get to a point like right here, where I consider both of my neighbors
6775
05:42:25,960 --> 05:42:29,040
and both of my neighbors have a lower value than I do.
6776
05:42:29,040 --> 05:42:32,840
This current state has a value that is higher than any of its neighbors.
6777
05:42:32,840 --> 05:42:34,640
And at that point, the algorithm terminates.
6778
05:42:34,640 --> 05:42:38,320
And I can say, all right, here I have now found the solution.
6779
05:42:38,320 --> 05:42:40,760
And the same thing works in exactly the opposite way
6780
05:42:40,760 --> 05:42:42,120
for trying to find a global minimum.
6781
05:42:42,120 --> 05:42:44,120
But the algorithm is fundamentally the same.
6782
05:42:44,120 --> 05:42:47,360
If I'm trying to find a global minimum and say my current state starts here,
6783
05:42:47,360 --> 05:42:50,240
I'll continually look at my neighbors, pick the lowest value
6784
05:42:50,240 --> 05:42:53,160
that I possibly can, until I eventually, hopefully,
6785
05:42:53,160 --> 05:42:55,680
find that global minimum, a point at which when
6786
05:42:55,680 --> 05:42:58,600
I look at both of my neighbors, they each have a higher value.
6787
05:42:58,600 --> 05:43:02,560
And I'm trying to minimize the total score or cost or value
6788
05:43:02,560 --> 05:43:06,840
that I get as a result of calculating some sort of cost function.
6789
05:43:06,840 --> 05:43:09,880
So we can formulate this graphical idea in terms of pseudocode.
6790
05:43:09,880 --> 05:43:12,480
And the pseudocode for hill climbing might look like this.
6791
05:43:12,480 --> 05:43:15,080
We define some function called hill climb that
6792
05:43:15,080 --> 05:43:17,760
takes as input the problem that we're trying to solve.
6793
05:43:17,760 --> 05:43:21,200
And generally, we're going to start in some sort of initial state.
6794
05:43:21,200 --> 05:43:23,160
So I'll start with a variable called current
6795
05:43:23,160 --> 05:43:26,920
that is keeping track of my initial state, like an initial configuration
6796
05:43:26,920 --> 05:43:27,960
of hospitals.
6797
05:43:27,960 --> 05:43:30,480
And maybe some problems lend themselves to an initial state,
6798
05:43:30,480 --> 05:43:31,840
some place where you begin.
6799
05:43:31,840 --> 05:43:34,920
In other cases, maybe not, in which case we might just randomly
6800
05:43:34,920 --> 05:43:38,640
generate some initial state, just by choosing two locations for hospitals
6801
05:43:38,640 --> 05:43:41,160
at random, for example, and figuring out from there
6802
05:43:41,160 --> 05:43:42,800
how we might be able to improve.
6803
05:43:42,800 --> 05:43:46,240
But that initial state, we're going to store inside of current.
6804
05:43:46,240 --> 05:43:48,840
And now, here comes our loop, some repetitive process
6805
05:43:48,840 --> 05:43:52,040
we're going to do again and again until the algorithm terminates.
6806
05:43:52,040 --> 05:43:55,040
And what we're going to do is first say, let's
6807
05:43:55,040 --> 05:43:57,920
figure out all of the neighbors of the current state.
6808
05:43:57,920 --> 05:43:59,920
From my state, what are all of the neighboring
6809
05:43:59,920 --> 05:44:02,800
states for some definition of what it means to be a neighbor?
6810
05:44:02,800 --> 05:44:06,560
And I'll go ahead and choose the highest value of all of those neighbors
6811
05:44:06,560 --> 05:44:09,080
and save it inside of this variable called neighbor.
6812
05:44:09,080 --> 05:44:11,160
So keep track of the highest-valued neighbor.
6813
05:44:11,160 --> 05:44:14,080
This is in the case where I'm trying to maximize the value.
6814
05:44:14,080 --> 05:44:15,880
In the case where I'm trying to minimize the value,
6815
05:44:15,880 --> 05:44:17,360
you might imagine here, you'll pick the neighbor
6816
05:44:17,360 --> 05:44:18,880
with the lowest possible value.
6817
05:44:18,880 --> 05:44:21,640
But these ideas are really fundamentally interchangeable.
6818
05:44:21,640 --> 05:44:24,720
And it's possible, in some cases, there might be multiple neighbors
6819
05:44:24,720 --> 05:44:28,200
that each have an equally high value or an equally low value
6820
05:44:28,200 --> 05:44:29,480
in the minimizing case.
6821
05:44:29,480 --> 05:44:31,920
And in that case, we can just choose randomly from among them.
6822
05:44:31,920 --> 05:44:35,480
Choose one of them and save it inside of this variable neighbor.
6823
05:44:35,480 --> 05:44:39,680
And then the key question to ask is, is this neighbor better
6824
05:44:39,680 --> 05:44:41,600
than my current state?
6825
05:44:41,600 --> 05:44:44,840
And if the neighbor, the best neighbor that I was able to find,
6826
05:44:44,840 --> 05:44:48,440
is not better than my current state, well, then the algorithm is over.
6827
05:44:48,440 --> 05:44:50,520
And I'll just go ahead and return the current state.
6828
05:44:50,520 --> 05:44:53,800
If none of my neighbors are better, then I may as well stay where I am,
6829
05:44:53,800 --> 05:44:56,520
is the general logic of the hill climbing algorithm.
6830
05:44:56,520 --> 05:44:59,200
But otherwise, if the neighbor is better, then I may as well
6831
05:44:59,200 --> 05:45:00,320
move to that neighbor.
6832
05:45:00,320 --> 05:45:04,160
So you might imagine setting current equal to neighbor, where the general idea
6833
05:45:04,160 --> 05:45:07,040
is if I'm at a current state and I see a neighbor that is better than me,
6834
05:45:07,040 --> 05:45:08,360
then I'll go ahead and move there.
6835
05:45:08,360 --> 05:45:11,840
And then I'll repeat the process, continually moving to a better neighbor
6836
05:45:11,840 --> 05:45:15,760
until I reach a point at which none of my neighbors are better than I am.
6837
05:45:15,760 --> 05:45:19,600
And at that point, we'd say the algorithm can just terminate there.
6838
05:45:19,600 --> 05:45:21,640
So let's take a look at a real example of this
6839
05:45:21,640 --> 05:45:23,240
with these houses and hospitals.
6840
05:45:23,240 --> 05:45:26,480
So we've seen now that if we put the hospitals in these two locations,
6841
05:45:26,480 --> 05:45:28,360
that has a total cost of 17.
6842
05:45:28,360 --> 05:45:31,280
And now we need to define, if we're going to implement this hill climbing
6843
05:45:31,280 --> 05:45:34,920
algorithm, what it means to take this particular configuration
6844
05:45:34,920 --> 05:45:39,760
of hospitals, this particular state, and get a neighbor of that state.
6845
05:45:39,760 --> 05:45:42,080
And a simple definition of neighbor might be just,
6846
05:45:42,080 --> 05:45:46,680
let's pick one of the hospitals and move it by one square, the left or right
6847
05:45:46,680 --> 05:45:48,520
or up or down, for example.
6848
05:45:48,520 --> 05:45:50,960
And that would mean we have six possible neighbors
6849
05:45:50,960 --> 05:45:52,440
from this particular configuration.
6850
05:45:52,440 --> 05:45:56,200
We could take this hospital and move it to any of these three possible squares,
6851
05:45:56,200 --> 05:46:00,000
or we take this hospital and move it to any of those three possible squares.
6852
05:46:00,000 --> 05:46:02,640
And each of those would generate a neighbor.
6853
05:46:02,640 --> 05:46:04,800
And what I might do is say, all right, here's
6854
05:46:04,800 --> 05:46:07,720
the locations and the distances between each of the houses
6855
05:46:07,720 --> 05:46:09,240
and their nearest hospital.
6856
05:46:09,240 --> 05:46:12,280
Let me consider all of the neighbors and see if any of them
6857
05:46:12,280 --> 05:46:14,640
can do better than a cost of 17.
6858
05:46:14,640 --> 05:46:17,240
And it turns out there are a couple of ways that we could do that.
6859
05:46:17,240 --> 05:46:19,080
And it doesn't matter if we randomly choose
6860
05:46:19,080 --> 05:46:20,720
among all the ways that are the best.
6861
05:46:20,720 --> 05:46:24,880
But one such possible way is by taking a look at this hospital here
6862
05:46:24,880 --> 05:46:27,040
and considering the directions in which it might move.
6863
05:46:27,040 --> 05:46:30,360
If we hold this hospital constant, if we take this hospital
6864
05:46:30,360 --> 05:46:33,680
and move it one square up, for example, that doesn't really help us.
6865
05:46:33,680 --> 05:46:36,400
It gets closer to the house up here, but it gets further away
6866
05:46:36,400 --> 05:46:37,640
from the house down here.
6867
05:46:37,640 --> 05:46:40,080
And it doesn't really change anything for the two houses
6868
05:46:40,080 --> 05:46:41,800
along the left-hand side.
6869
05:46:41,800 --> 05:46:45,600
But if we take this hospital on the right and move it one square down,
6870
05:46:45,600 --> 05:46:46,600
it's the opposite problem.
6871
05:46:46,600 --> 05:46:49,000
It gets further away from the house up above,
6872
05:46:49,000 --> 05:46:51,160
and it gets closer to the house down below.
6873
05:46:51,160 --> 05:46:54,640
The real idea, the goal should be to be able to take this hospital
6874
05:46:54,640 --> 05:46:56,760
and move it one square to the left.
6875
05:46:56,760 --> 05:46:59,280
By moving it one square to the left, we move it closer
6876
05:46:59,280 --> 05:47:02,280
to both of these houses on the right without changing anything
6877
05:47:02,280 --> 05:47:03,360
about the houses on the left.
6878
05:47:03,360 --> 05:47:06,760
For them, this hospital is still the closer one, so they aren't affected.
6879
05:47:06,760 --> 05:47:10,480
So we're able to improve the situation by picking a neighbor that
6880
05:47:10,480 --> 05:47:13,000
results in a decrease in our total cost.
6881
05:47:13,000 --> 05:47:14,000
And so we might do that.
6882
05:47:14,000 --> 05:47:16,640
Move ourselves from this current state to a neighbor
6883
05:47:16,640 --> 05:47:19,440
by just taking that hospital and moving it.
6884
05:47:19,440 --> 05:47:21,160
And at this point, there's not a whole lot
6885
05:47:21,160 --> 05:47:22,640
that can be done with this hospital.
6886
05:47:22,640 --> 05:47:25,320
But there's still other optimizations we can make, other neighbors
6887
05:47:25,320 --> 05:47:27,920
we can move to that are going to have a better value.
6888
05:47:27,920 --> 05:47:29,960
If we consider this hospital, for example,
6889
05:47:29,960 --> 05:47:32,680
we might imagine that right now it's a bit far up,
6890
05:47:32,680 --> 05:47:34,840
that both of these houses are a little bit lower.
6891
05:47:34,840 --> 05:47:37,480
So we might be able to do better by taking this hospital
6892
05:47:37,480 --> 05:47:40,680
and moving it one square down, moving it down so that now instead
6893
05:47:40,680 --> 05:47:43,600
of a cost of 15, we're down to a cost of 13
6894
05:47:43,600 --> 05:47:45,320
for this particular configuration.
6895
05:47:45,320 --> 05:47:47,680
And we can do even better by taking the hospital
6896
05:47:47,680 --> 05:47:49,520
and moving it one square to the left.
6897
05:47:49,520 --> 05:47:52,360
Now instead of a cost of 13, we have a cost of 11,
6898
05:47:52,360 --> 05:47:54,880
because this house is one away from the hospital.
6899
05:47:54,880 --> 05:47:56,360
This one is four away.
6900
05:47:56,360 --> 05:47:57,520
This one is three away.
6901
05:47:57,520 --> 05:47:59,440
And this one is also three away.
6902
05:47:59,440 --> 05:48:02,160
So we've been able to do much better than that initial cost
6903
05:48:02,160 --> 05:48:04,680
that we had using the initial configuration.
6904
05:48:04,680 --> 05:48:07,720
Just by taking every state and asking ourselves the question,
6905
05:48:07,720 --> 05:48:11,120
can we do better by just making small incremental changes,
6906
05:48:11,120 --> 05:48:12,960
moving to a neighbor, moving to a neighbor,
6907
05:48:12,960 --> 05:48:15,360
and moving to a neighbor after that?
6908
05:48:15,360 --> 05:48:18,880
And now at this point, we can potentially see that at this point,
6909
05:48:18,880 --> 05:48:20,280
the algorithm is going to terminate.
6910
05:48:20,280 --> 05:48:22,680
There's actually no neighbor we can move to
6911
05:48:22,680 --> 05:48:27,120
that is going to improve the situation, get us a cost that is less than 11.
6912
05:48:27,120 --> 05:48:29,600
Because if we take this hospital and move it upper to the right,
6913
05:48:29,600 --> 05:48:31,320
well, that's going to make it further away.
6914
05:48:31,320 --> 05:48:34,480
If we take it and move it down, that doesn't really change the situation.
6915
05:48:34,480 --> 05:48:37,400
It gets further away from this house but closer to that house.
6916
05:48:37,400 --> 05:48:40,120
And likewise, the same story was true for this hospital.
6917
05:48:40,120 --> 05:48:42,880
Any neighbor we move it to, up, left, down, or right,
6918
05:48:42,880 --> 05:48:46,920
is either going to make it further away from the houses and increase the cost,
6919
05:48:46,920 --> 05:48:51,080
or it's going to have no effect on the cost whatsoever.
6920
05:48:51,080 --> 05:48:54,360
And so the question we might now ask is, is this the best we could do?
6921
05:48:54,360 --> 05:48:57,840
Is this the best placement of the hospitals we could possibly have?
6922
05:48:57,840 --> 05:49:00,560
And it turns out the answer is no, because there's a better way
6923
05:49:00,560 --> 05:49:02,720
that we could place these hospitals.
6924
05:49:02,720 --> 05:49:05,120
And in particular, there are a number of ways you could do this.
6925
05:49:05,120 --> 05:49:07,760
But one of the ways is by taking this hospital here
6926
05:49:07,760 --> 05:49:10,760
and moving it to this square, for example, moving it diagonally
6927
05:49:10,760 --> 05:49:13,520
by one square, which was not part of our definition of neighbor.
6928
05:49:13,520 --> 05:49:15,720
We could only move left, right, up, or down.
6929
05:49:15,720 --> 05:49:17,240
But this is, in fact, better.
6930
05:49:17,240 --> 05:49:18,760
It has a total cost of 9.
6931
05:49:18,760 --> 05:49:21,040
It is now closer to both of these houses.
6932
05:49:21,040 --> 05:49:24,240
And as a result, the total cost is less.
6933
05:49:24,240 --> 05:49:27,480
But we weren't able to find it, because in order to get there,
6934
05:49:27,480 --> 05:49:31,320
we had to go through a state that actually wasn't any better than the current
6935
05:49:31,320 --> 05:49:33,600
state that we had been on previously.
6936
05:49:33,600 --> 05:49:36,920
And so this appears to be a limitation, or a concern you might have
6937
05:49:36,920 --> 05:49:39,960
as you go about trying to implement a hill climbing algorithm,
6938
05:49:39,960 --> 05:49:43,320
is that it might not always give you the optimal solution.
6939
05:49:43,320 --> 05:49:46,400
If we're trying to maximize the value of any particular state,
6940
05:49:46,400 --> 05:49:49,000
we're trying to find the global maximum, a concern
6941
05:49:49,000 --> 05:49:53,040
might be that we could get stuck at one of the local maxima,
6942
05:49:53,040 --> 05:49:57,840
highlighted here in blue, where a local maxima is any state whose value is
6943
05:49:57,840 --> 05:49:59,360
higher than any of its neighbors.
6944
05:49:59,360 --> 05:50:02,040
If we ever find ourselves at one of these two states
6945
05:50:02,040 --> 05:50:04,320
when we're trying to maximize the value of the state,
6946
05:50:04,320 --> 05:50:05,820
we're not going to make any changes.
6947
05:50:05,820 --> 05:50:07,320
We're not going to move left or right.
6948
05:50:07,320 --> 05:50:10,760
We're not going to move left here, because those states are worse.
6949
05:50:10,760 --> 05:50:13,280
But yet, we haven't found the global optimum.
6950
05:50:13,280 --> 05:50:15,560
We haven't done as best as we could do.
6951
05:50:15,560 --> 05:50:18,100
And likewise, in the case of the hospitals, what we're ultimately
6952
05:50:18,100 --> 05:50:20,960
trying to do is find a global minimum, find a value that
6953
05:50:20,960 --> 05:50:22,720
is lower than all of the others.
6954
05:50:22,720 --> 05:50:26,640
But we have the potential to get stuck at one of the local minima,
6955
05:50:26,640 --> 05:50:30,160
any of these states whose value is lower than all of its neighbors,
6956
05:50:30,160 --> 05:50:33,680
but still not as low as the local minima.
6957
05:50:33,680 --> 05:50:36,800
And so the takeaway here is that it's not always
6958
05:50:36,800 --> 05:50:40,280
going to be the case that when we run this naive hill climbing algorithm,
6959
05:50:40,280 --> 05:50:42,280
that we're always going to find the optimal solution.
6960
05:50:42,280 --> 05:50:43,960
There are things that could go wrong.
6961
05:50:43,960 --> 05:50:47,640
If we started here, for example, and tried to maximize our value as much
6962
05:50:47,640 --> 05:50:50,800
as possible, we might move to the highest possible neighbor,
6963
05:50:50,800 --> 05:50:54,000
move to the highest possible neighbor, move to the highest possible neighbor,
6964
05:50:54,000 --> 05:50:57,960
and stop, and never realize that there's actually a better state way over there
6965
05:50:57,960 --> 05:51:00,280
that we could have gone to instead.
6966
05:51:00,280 --> 05:51:03,200
And other problems you might imagine just by taking a look at this state
6967
05:51:03,200 --> 05:51:06,800
space landscape are these various different types of plateaus,
6968
05:51:06,800 --> 05:51:09,200
something like this flat local maximum here,
6969
05:51:09,200 --> 05:51:12,840
where all six of these states each have the exact same value.
6970
05:51:12,840 --> 05:51:15,800
And so in the case of the algorithm we showed before,
6971
05:51:15,800 --> 05:51:17,800
none of the neighbors are better, so we might just
6972
05:51:17,800 --> 05:51:19,800
get stuck at this flat local maximum.
6973
05:51:19,800 --> 05:51:22,360
And even if you allowed yourself to move to one of the neighbors,
6974
05:51:22,360 --> 05:51:25,120
it wouldn't be clear which neighbor you would ultimately move to,
6975
05:51:25,120 --> 05:51:27,280
and you could get stuck here as well.
6976
05:51:27,280 --> 05:51:28,680
And there's another one over here.
6977
05:51:28,680 --> 05:51:30,040
This one is called a shoulder.
6978
05:51:30,040 --> 05:51:32,000
It's not really a local maximum, because there's still
6979
05:51:32,000 --> 05:51:35,240
places where we can go higher, not a local minimum, because we can go lower.
6980
05:51:35,240 --> 05:51:38,680
So we can still make progress, but it's still this flat area,
6981
05:51:38,680 --> 05:51:40,720
where if you have a local search algorithm,
6982
05:51:40,720 --> 05:51:44,560
there's potential to get lost here, unable to make some upward or downward
6983
05:51:44,560 --> 05:51:48,040
progress, depending on whether we're trying to maximize or minimize it,
6984
05:51:48,040 --> 05:51:50,200
and therefore another potential for us to be
6985
05:51:50,200 --> 05:51:54,960
able to find a solution that might not actually be the optimal solution.
6986
05:51:54,960 --> 05:51:57,880
And so because of this potential, the potential that hill climbing
6987
05:51:57,880 --> 05:52:00,500
has to not always find us the optimal result,
6988
05:52:00,500 --> 05:52:03,520
it turns out there are a number of different varieties and variations
6989
05:52:03,520 --> 05:52:07,360
on the hill climbing algorithm that help to solve the problem better
6990
05:52:07,360 --> 05:52:10,520
depending on the context, and depending on the specific type of problem,
6991
05:52:10,520 --> 05:52:13,240
some of these variants might be more applicable than others.
6992
05:52:13,240 --> 05:52:16,000
What we've taken a look at so far is a version of hill climbing
6993
05:52:16,000 --> 05:52:19,280
generally called steepest ascent hill climbing,
6994
05:52:19,280 --> 05:52:21,520
where the idea of steepest ascent hill climbing
6995
05:52:21,520 --> 05:52:24,440
is we are going to choose the highest valued neighbor,
6996
05:52:24,440 --> 05:52:27,320
in the case where we're trying to maximize or the lowest valued neighbor
6997
05:52:27,320 --> 05:52:28,860
in cases where we're trying to minimize.
6998
05:52:28,860 --> 05:52:31,160
But generally speaking, if I have five neighbors
6999
05:52:31,160 --> 05:52:33,240
and they're all better than my current state,
7000
05:52:33,240 --> 05:52:36,000
I will pick the best one of those five.
7001
05:52:36,000 --> 05:52:37,480
Now, sometimes that might work pretty well.
7002
05:52:37,480 --> 05:52:40,560
It's sort of a greedy approach of trying to take the best operation
7003
05:52:40,560 --> 05:52:43,360
at any particular time step, but it might not always work.
7004
05:52:43,360 --> 05:52:45,080
There might be cases where actually I want
7005
05:52:45,080 --> 05:52:47,320
to choose an option that is slightly better than me,
7006
05:52:47,320 --> 05:52:50,280
but maybe not the best one because that later on might
7007
05:52:50,280 --> 05:52:52,080
lead to a better outcome ultimately.
7008
05:52:52,080 --> 05:52:54,320
So there are other variants that we might consider
7009
05:52:54,320 --> 05:52:56,520
of this basic hill climbing algorithm.
7010
05:52:56,520 --> 05:52:58,560
One is known as stochastic hill climbing.
7011
05:52:58,560 --> 05:53:02,200
And in this case, we choose randomly from all of our higher value neighbors.
7012
05:53:02,200 --> 05:53:04,800
So if I'm at my current state and there are five neighbors that
7013
05:53:04,800 --> 05:53:07,680
are all better than I am, rather than choosing the best one,
7014
05:53:07,680 --> 05:53:10,320
as steep as the set would do, stochastic will just choose
7015
05:53:10,320 --> 05:53:13,880
randomly from one of them, thinking that if it's better, then it's better.
7016
05:53:13,880 --> 05:53:16,120
And maybe there's a potential to make forward progress,
7017
05:53:16,120 --> 05:53:20,680
even if it is not locally the best option I could possibly choose.
7018
05:53:20,680 --> 05:53:24,120
First choice hill climbing ends up just choosing the very first highest
7019
05:53:24,120 --> 05:53:27,040
valued neighbor that it follows, behaving on a similar idea,
7020
05:53:27,040 --> 05:53:28,960
rather than consider all of the neighbors.
7021
05:53:28,960 --> 05:53:31,800
As soon as we find a neighbor that is better than our current state,
7022
05:53:31,800 --> 05:53:33,080
we'll go ahead and move there.
7023
05:53:33,080 --> 05:53:35,000
There may be some efficiency improvements there
7024
05:53:35,000 --> 05:53:37,200
and maybe has the potential to find a solution
7025
05:53:37,200 --> 05:53:39,920
that the other strategies weren't able to find.
7026
05:53:39,920 --> 05:53:43,800
And with all of these variants, we still suffer from the same potential risk,
7027
05:53:43,800 --> 05:53:48,120
this risk that we might end up at a local minimum or a local maximum.
7028
05:53:48,120 --> 05:53:52,080
And we can reduce that risk by repeating the process multiple times.
7029
05:53:52,080 --> 05:53:55,760
So one variant of hill climbing is random restart hill climbing,
7030
05:53:55,760 --> 05:53:59,880
where the general idea is we'll conduct hill climbing multiple times.
7031
05:53:59,880 --> 05:54:02,680
If we apply steepest descent hill climbing, for example,
7032
05:54:02,680 --> 05:54:04,840
we'll start at some random state, try and figure out
7033
05:54:04,840 --> 05:54:06,560
how to solve the problem and figure out what
7034
05:54:06,560 --> 05:54:09,440
is the local maximum or local minimum we get to.
7035
05:54:09,440 --> 05:54:11,720
And then we'll just randomly restart and try again,
7036
05:54:11,720 --> 05:54:14,520
choose a new starting configuration, try and figure out
7037
05:54:14,520 --> 05:54:17,800
what the local maximum or minimum is, and do this some number of times.
7038
05:54:17,800 --> 05:54:19,920
And then after we've done it some number of times,
7039
05:54:19,920 --> 05:54:23,480
we can pick the best one out of all of the ones that we've taken a look at.
7040
05:54:23,480 --> 05:54:26,600
So there's another option we have access to as well.
7041
05:54:26,600 --> 05:54:29,360
And then, although I said that generally local search will usually
7042
05:54:29,360 --> 05:54:33,160
just keep track of a single node and then move to one of its neighbors,
7043
05:54:33,160 --> 05:54:36,880
there are variants of hill climbing that are known as local beam searches,
7044
05:54:36,880 --> 05:54:39,880
where rather than keep track of just one current best state,
7045
05:54:39,880 --> 05:54:43,880
we're keeping track of k highest valued neighbors, such that rather than
7046
05:54:43,880 --> 05:54:46,320
starting at one random initial configuration,
7047
05:54:46,320 --> 05:54:50,200
I might start with 3 or 4 or 5, randomly generate all the neighbors,
7048
05:54:50,200 --> 05:54:54,440
and then pick the 3 or 4 or 5 best of all of the neighbors that I find,
7049
05:54:54,440 --> 05:54:57,040
and continually repeat this process, with the idea
7050
05:54:57,040 --> 05:55:00,040
being that now I have more options that I'm considering,
7051
05:55:00,040 --> 05:55:02,680
more ways that I could potentially navigate myself
7052
05:55:02,680 --> 05:55:07,160
to the optimal solution that might exist for a particular problem.
7053
05:55:07,160 --> 05:55:09,440
So let's now take a look at some actual code that
7054
05:55:09,440 --> 05:55:11,800
can implement some of these kinds of ideas, something
7055
05:55:11,800 --> 05:55:14,440
like steepest ascent hill climbing, for example,
7056
05:55:14,440 --> 05:55:17,280
for trying to solve this hospital problem.
7057
05:55:17,280 --> 05:55:20,280
So I'm going to go ahead and go into my hospitals directory, where
7058
05:55:20,280 --> 05:55:24,240
I've actually set up the basic framework for solving this type of problem.
7059
05:55:24,240 --> 05:55:26,400
I'll go ahead and go into hospitals.py, and we'll
7060
05:55:26,400 --> 05:55:28,200
take a look at the code we've created here.
7061
05:55:28,200 --> 05:55:32,720
I've defined a class that is going to represent the state space.
7062
05:55:32,720 --> 05:55:36,760
So the space has a height, and a width, and also some number of hospitals.
7063
05:55:36,760 --> 05:55:41,040
So you can configure how big is your map, how many hospitals should go here.
7064
05:55:41,040 --> 05:55:44,000
We have a function for adding a new house to the state space,
7065
05:55:44,000 --> 05:55:45,880
and then some functions that are going to get
7066
05:55:45,880 --> 05:55:49,280
me all of the available spaces for if I want to randomly place hospitals
7067
05:55:49,280 --> 05:55:50,880
in particular locations.
7068
05:55:50,880 --> 05:55:54,240
And here now is the hill climbing algorithm.
7069
05:55:54,240 --> 05:55:56,800
So what are we going to do in the hill climbing algorithm?
7070
05:55:56,800 --> 05:55:59,920
Well, we're going to start by randomly initializing
7071
05:55:59,920 --> 05:56:01,360
where the hospitals are going to go.
7072
05:56:01,360 --> 05:56:03,480
We don't know where the hospitals should actually be,
7073
05:56:03,480 --> 05:56:05,360
so let's just randomly place them.
7074
05:56:05,360 --> 05:56:08,760
So here I'm running a loop for each of the hospitals that I have.
7075
05:56:08,760 --> 05:56:13,520
I'm going to go ahead and add a new hospital at some random location.
7076
05:56:13,520 --> 05:56:15,800
So I basically get all of the available spaces,
7077
05:56:15,800 --> 05:56:17,960
and I randomly choose one of them as where
7078
05:56:17,960 --> 05:56:20,960
I would like to add this particular hospital.
7079
05:56:20,960 --> 05:56:23,220
I have some logging output and generating some images,
7080
05:56:23,220 --> 05:56:25,200
which we'll take a look at a little bit later.
7081
05:56:25,200 --> 05:56:27,160
But here is the key idea.
7082
05:56:27,160 --> 05:56:30,240
So I'm going to just keep repeating this algorithm.
7083
05:56:30,240 --> 05:56:33,240
I could specify a maximum of how many times I want it to run,
7084
05:56:33,240 --> 05:56:37,200
or I could just run it up until it hits a local maximum or local minimum.
7085
05:56:37,200 --> 05:56:40,280
And now we'll basically consider all of the hospitals
7086
05:56:40,280 --> 05:56:41,480
that could potentially move.
7087
05:56:41,480 --> 05:56:43,960
So consider each of the two hospitals or more hospitals
7088
05:56:43,960 --> 05:56:45,440
if they're more than that.
7089
05:56:45,440 --> 05:56:49,120
And consider all of the places where that hospital could move to,
7090
05:56:49,120 --> 05:56:53,080
some neighbor of that hospital that we can move the neighbor to.
7091
05:56:53,080 --> 05:56:58,480
And then see, is this going to be better than where we were currently?
7092
05:56:58,480 --> 05:57:00,440
So if it is going to be better, then we'll
7093
05:57:00,440 --> 05:57:02,840
go ahead and update our best neighbor and keep
7094
05:57:02,840 --> 05:57:05,840
track of this new best neighbor that we found.
7095
05:57:05,840 --> 05:57:08,360
And then afterwards, we can ask ourselves the question,
7096
05:57:08,360 --> 05:57:10,440
if best neighbor cost is greater than or equal
7097
05:57:10,440 --> 05:57:13,040
to the cost of the current set of hospitals,
7098
05:57:13,040 --> 05:57:18,120
meaning if the cost of our best neighbor is greater than the current cost,
7099
05:57:18,120 --> 05:57:21,360
meaning our best neighbor is worse than our current state,
7100
05:57:21,360 --> 05:57:23,600
well, then we shouldn't make any changes at all.
7101
05:57:23,600 --> 05:57:27,200
And we should just go ahead and return the current set of hospitals.
7102
05:57:27,200 --> 05:57:29,800
But otherwise, we can update our hospitals
7103
05:57:29,800 --> 05:57:32,520
in order to change them to one of the best neighbors.
7104
05:57:32,520 --> 05:57:34,440
And if there are multiple that are all equivalent,
7105
05:57:34,440 --> 05:57:38,400
I'm here using random.choice to say go ahead and choose one randomly.
7106
05:57:38,400 --> 05:57:41,720
So this is really just a Python implementation of that same idea
7107
05:57:41,720 --> 05:57:44,640
that we were just talking about, this idea of taking a current state,
7108
05:57:44,640 --> 05:57:48,120
some current set of hospitals, generating all of the neighbors,
7109
05:57:48,120 --> 05:57:50,480
looking at all of the ways we could take one hospital
7110
05:57:50,480 --> 05:57:53,320
and move it one square to the left or right or up or down,
7111
05:57:53,320 --> 05:57:56,160
and then figuring out, based on all of that information, which
7112
05:57:56,160 --> 05:57:59,360
is the best neighbor or the set of all the best neighbors,
7113
05:57:59,360 --> 05:58:02,040
and then choosing from one of those.
7114
05:58:02,040 --> 05:58:05,920
And each time, we go ahead and generate an image in order to do that.
7115
05:58:05,920 --> 05:58:08,920
And so now what we're doing is if we look down at the bottom,
7116
05:58:08,920 --> 05:58:12,840
I'm going to randomly generate a space with height 10 and width 20.
7117
05:58:12,840 --> 05:58:16,000
And I'll say go ahead and put three hospitals somewhere in the space.
7118
05:58:16,000 --> 05:58:18,720
I'll randomly generate 15 houses that I just go ahead
7119
05:58:18,720 --> 05:58:20,680
and add in random locations.
7120
05:58:20,680 --> 05:58:23,640
And now I'm going to run this hill climbing algorithm in order
7121
05:58:23,640 --> 05:58:27,200
to try and figure out where we should place those hospitals.
7122
05:58:27,200 --> 05:58:31,400
So we'll go ahead and run this program by running Python hospitals.
7123
05:58:31,400 --> 05:58:32,440
And we see that we started.
7124
05:58:32,440 --> 05:58:35,360
Our initial state had a cost of 72, but we
7125
05:58:35,360 --> 05:58:38,560
were able to continually find neighbors that were able to decrease that cost,
7126
05:58:38,560 --> 05:58:43,440
decrease to 69, 66, 63, so on and so forth, all the way down to 53,
7127
05:58:43,440 --> 05:58:46,140
as the best neighbor we were able to ultimately find.
7128
05:58:46,140 --> 05:58:48,280
And we can take a look at what that looked like
7129
05:58:48,280 --> 05:58:50,200
by just opening up these files.
7130
05:58:50,200 --> 05:58:53,280
So here, for example, was the initial configuration.
7131
05:58:53,280 --> 05:58:57,280
We randomly selected a location for each of these 15 different houses
7132
05:58:57,280 --> 05:59:01,560
and then randomly selected locations for one, two, three hospitals
7133
05:59:01,560 --> 05:59:04,880
that were just located somewhere inside of the state space.
7134
05:59:04,880 --> 05:59:07,680
And if you add up all the distances from each of the houses
7135
05:59:07,680 --> 05:59:11,360
to their nearest hospital, you get a total cost of about 72.
7136
05:59:11,360 --> 05:59:14,280
And so now the question is, what neighbors can we move to
7137
05:59:14,280 --> 05:59:16,120
that improve the situation?
7138
05:59:16,120 --> 05:59:18,360
And it looks like the first one the algorithm found
7139
05:59:18,360 --> 05:59:21,680
was by taking this house that was over there on the right
7140
05:59:21,680 --> 05:59:23,880
and just moving it to the left.
7141
05:59:23,880 --> 05:59:25,640
And that probably makes sense because if you
7142
05:59:25,640 --> 05:59:29,760
look at the houses in that general area, really these five houses look like
7143
05:59:29,760 --> 05:59:33,240
they're probably the ones that are going to be closest to this hospital over here.
7144
05:59:33,240 --> 05:59:36,640
Moving it to the left decreases the total distance, at least
7145
05:59:36,640 --> 05:59:40,440
to most of these houses, though it does increase that distance for one of them.
7146
05:59:40,440 --> 05:59:43,160
And so we're able to make these improvements to the situation
7147
05:59:43,160 --> 05:59:47,280
by continually finding ways that we can move these hospitals around
7148
05:59:47,280 --> 05:59:50,800
until we eventually settle at this particular state that
7149
05:59:50,800 --> 05:59:54,760
has a cost of 53, where we figured out a position for each of the hospitals.
7150
05:59:54,760 --> 05:59:57,200
And now none of the neighbors that we could move to
7151
05:59:57,200 --> 05:59:59,600
are actually going to improve the situation.
7152
05:59:59,600 --> 06:00:02,280
We can take this hospital and this hospital and that hospital
7153
06:00:02,280 --> 06:00:03,760
and look at each of the neighbors.
7154
06:00:03,760 --> 06:00:07,400
And none of those are going to be better than this particular configuration.
7155
06:00:07,400 --> 06:00:10,040
And again, that's not to say that this is the best we could do.
7156
06:00:10,040 --> 06:00:12,520
There might be some other configuration of hospitals
7157
06:00:12,520 --> 06:00:14,240
that is a global minimum.
7158
06:00:14,240 --> 06:00:18,480
And this might just be a local minimum that is the best of all of its neighbors,
7159
06:00:18,480 --> 06:00:21,880
but maybe not the best in the entire possible state space.
7160
06:00:21,880 --> 06:00:24,000
And you could search through the entire state space
7161
06:00:24,000 --> 06:00:27,440
by considering all of the possible configurations for hospitals.
7162
06:00:27,440 --> 06:00:29,560
But ultimately, that's going to be very time intensive,
7163
06:00:29,560 --> 06:00:31,720
especially as our state space gets bigger and there
7164
06:00:31,720 --> 06:00:33,600
might be more and more possible states.
7165
06:00:33,600 --> 06:00:36,200
It's going to take quite a long time to look through all of them.
7166
06:00:36,200 --> 06:00:39,160
And so being able to use these sort of local search algorithms
7167
06:00:39,160 --> 06:00:42,600
can often be quite good for trying to find the best solution we can do.
7168
06:00:42,600 --> 06:00:45,440
And especially if we don't care about doing the best possible
7169
06:00:45,440 --> 06:00:47,800
and we just care about doing pretty good and finding
7170
06:00:47,800 --> 06:00:50,160
a pretty good placement of those hospitals,
7171
06:00:50,160 --> 06:00:53,240
then these methods can be particularly powerful.
7172
06:00:53,240 --> 06:00:56,080
But of course, we can try and mitigate some of this concern
7173
06:00:56,080 --> 06:00:59,520
by instead of using hill climbing to use random restart,
7174
06:00:59,520 --> 06:01:02,200
this idea of rather than just hill climb one time,
7175
06:01:02,200 --> 06:01:04,280
we can hill climb multiple times and say,
7176
06:01:04,280 --> 06:01:07,280
try hill climbing a whole bunch of times on the exact same map
7177
06:01:07,280 --> 06:01:10,320
and figure out what is the best one that we've been able to find.
7178
06:01:10,320 --> 06:01:14,440
And so I've here implemented a function for random restart
7179
06:01:14,440 --> 06:01:17,600
that restarts some maximum number of times.
7180
06:01:17,600 --> 06:01:22,280
And what we're going to do is repeat that number of times this process of just
7181
06:01:22,280 --> 06:01:24,380
go ahead and run the hill climbing algorithm,
7182
06:01:24,380 --> 06:01:28,000
figure out what the cost is of getting from all the houses to the hospitals,
7183
06:01:28,000 --> 06:01:31,640
and then figure out is this better than we've done so far.
7184
06:01:31,640 --> 06:01:35,120
So I can try this exact same idea where instead of running hill climbing,
7185
06:01:35,120 --> 06:01:37,400
I'll go ahead and run random restart.
7186
06:01:37,400 --> 06:01:41,240
And I'll randomly restart maybe 20 times, for example.
7187
06:01:41,240 --> 06:01:44,240
And we'll go ahead and now I'll remove all the images
7188
06:01:44,240 --> 06:01:46,280
and then rerun the program.
7189
06:01:46,280 --> 06:01:49,200
And now we started by finding a original state.
7190
06:01:49,200 --> 06:01:51,280
When we initially ran hill climbing, the best cost
7191
06:01:51,280 --> 06:01:53,000
we were able to find was 56.
7192
06:01:53,000 --> 06:01:56,960
Each of these iterations is a different iteration of the hill climbing
7193
06:01:56,960 --> 06:01:57,460
algorithm.
7194
06:01:57,460 --> 06:02:00,400
We're running hill climbing not one time, but 20 times here,
7195
06:02:00,400 --> 06:02:04,400
each time going until we find a local minimum in this case.
7196
06:02:04,400 --> 06:02:06,840
And we look and see each time did we do better
7197
06:02:06,840 --> 06:02:09,080
than we did the best time we've done so far.
7198
06:02:09,080 --> 06:02:11,180
So we went from 56 to 46.
7199
06:02:11,180 --> 06:02:12,720
This one was greater, so we ignored it.
7200
06:02:12,720 --> 06:02:16,440
This one was 41, which was less, so we went ahead and kept that one.
7201
06:02:16,440 --> 06:02:18,800
And for all of the remaining 16 times that we
7202
06:02:18,800 --> 06:02:21,860
tried to implement hill climbing and we tried to run the hill climbing
7203
06:02:21,860 --> 06:02:25,120
algorithm, we couldn't do any better than that 41.
7204
06:02:25,120 --> 06:02:28,000
Again, maybe there is a way to do better that we just didn't find,
7205
06:02:28,000 --> 06:02:31,760
but it looks like that way ended up being a pretty good solution
7206
06:02:31,760 --> 06:02:32,440
to the problem.
7207
06:02:32,440 --> 06:02:36,880
That was attempt number three, starting from counting at zero.
7208
06:02:36,880 --> 06:02:39,720
So we can take a look at that, open up number three.
7209
06:02:39,720 --> 06:02:42,880
And this was the state that happened to have a cost of 41,
7210
06:02:42,880 --> 06:02:45,360
that after running the hill climbing algorithm
7211
06:02:45,360 --> 06:02:48,720
on some particular random initial configuration of hospitals,
7212
06:02:48,720 --> 06:02:51,600
this is what we found was the local minimum in terms
7213
06:02:51,600 --> 06:02:53,040
of trying to minimize the cost.
7214
06:02:53,040 --> 06:02:54,800
And it looks like we did pretty well.
7215
06:02:54,800 --> 06:02:56,980
This hospital is pretty close to this region.
7216
06:02:56,980 --> 06:02:58,860
This one is pretty close to these houses here.
7217
06:02:58,860 --> 06:03:01,120
This hospital looks about as good as we can do
7218
06:03:01,120 --> 06:03:03,760
for trying to capture those houses over on that side.
7219
06:03:03,760 --> 06:03:06,400
And so these sorts of algorithms can be quite useful
7220
06:03:06,400 --> 06:03:09,200
for trying to solve these problems.
7221
06:03:09,200 --> 06:03:12,400
But the real problem with many of these different types of hill climbing,
7222
06:03:12,400 --> 06:03:15,200
steepest of sense, stochastic, first choice, and so forth,
7223
06:03:15,200 --> 06:03:18,720
is that they never make a move that makes our situation worse.
7224
06:03:18,720 --> 06:03:21,360
They're always going to take ourselves in our current state,
7225
06:03:21,360 --> 06:03:24,600
look at the neighbors, and consider can we do better than our current state
7226
06:03:24,600 --> 06:03:26,080
and move to one of those neighbors.
7227
06:03:26,080 --> 06:03:29,080
Which of those neighbors we choose might vary among these various different
7228
06:03:29,080 --> 06:03:32,560
types of algorithms, but we never go from a current position
7229
06:03:32,560 --> 06:03:35,560
to a position that is worse than our current position.
7230
06:03:35,560 --> 06:03:37,800
And ultimately, that's what we're going to need to do
7231
06:03:37,800 --> 06:03:40,920
if we want to be able to find a global maximum or a global minimum.
7232
06:03:40,920 --> 06:03:42,800
Because sometimes if we get stuck, we want
7233
06:03:42,800 --> 06:03:46,000
to find some way of dislodging ourselves from our local maximum
7234
06:03:46,000 --> 06:03:50,000
or local minimum in order to find the global maximum or the global minimum
7235
06:03:50,000 --> 06:03:52,840
or increase the probability that we do find it.
7236
06:03:52,840 --> 06:03:54,640
And so the most popular technique for trying
7237
06:03:54,640 --> 06:03:57,400
to approach the problem from that angle is a technique known
7238
06:03:57,400 --> 06:04:00,120
as simulated annealing, simulated because it's modeling
7239
06:04:00,120 --> 06:04:03,800
after a real physical process of annealing, where you can think about this
7240
06:04:03,800 --> 06:04:06,480
in terms of physics, a physical situation where
7241
06:04:06,480 --> 06:04:08,320
you have some system of particles.
7242
06:04:08,320 --> 06:04:10,200
And you might imagine that when you heat up
7243
06:04:10,200 --> 06:04:12,760
a particular physical system, there's a lot of energy there.
7244
06:04:12,760 --> 06:04:14,680
Things are moving around quite randomly.
7245
06:04:14,680 --> 06:04:17,640
But over time, as the system cools down, it eventually
7246
06:04:17,640 --> 06:04:20,280
settles into some final position.
7247
06:04:20,280 --> 06:04:23,220
And that's going to be the general idea of simulated annealing.
7248
06:04:23,220 --> 06:04:27,240
We're going to simulate that process of some high temperature system where
7249
06:04:27,240 --> 06:04:29,680
things are moving around randomly quite frequently,
7250
06:04:29,680 --> 06:04:32,920
but over time decreasing that temperature until we eventually
7251
06:04:32,920 --> 06:04:35,040
settle at our ultimate solution.
7252
06:04:35,040 --> 06:04:38,240
And the idea is going to be if we have some state space landscape that
7253
06:04:38,240 --> 06:04:42,400
looks like this and we begin at its initial state here,
7254
06:04:42,400 --> 06:04:44,680
if we're looking for a global maximum and we're
7255
06:04:44,680 --> 06:04:46,960
trying to maximize the value of the state,
7256
06:04:46,960 --> 06:04:50,160
our traditional hill climbing algorithms would just take the state
7257
06:04:50,160 --> 06:04:52,160
and look at the two neighbor ones and always
7258
06:04:52,160 --> 06:04:55,960
pick the one that is going to increase the value of the state.
7259
06:04:55,960 --> 06:04:58,880
But if we want some chance of being able to find the global maximum,
7260
06:04:58,880 --> 06:05:01,540
we can't always make good moves.
7261
06:05:01,540 --> 06:05:04,720
We have to sometimes make bad moves and allow ourselves
7262
06:05:04,720 --> 06:05:08,320
to make a move in a direction that actually seems for now
7263
06:05:08,320 --> 06:05:11,000
to make our situation worse such that later we
7264
06:05:11,000 --> 06:05:14,560
can find our way up to that global maximum in terms
7265
06:05:14,560 --> 06:05:16,160
of trying to solve that problem.
7266
06:05:16,160 --> 06:05:18,200
Of course, once we get up to this global maximum,
7267
06:05:18,200 --> 06:05:20,000
once we've done a whole lot of the searching,
7268
06:05:20,000 --> 06:05:22,360
then we probably don't want to be moving to states
7269
06:05:22,360 --> 06:05:24,080
that are worse than our current state.
7270
06:05:24,080 --> 06:05:26,200
And so this is where this metaphor for annealing
7271
06:05:26,200 --> 06:05:30,120
starts to come in, where we want to start making more random moves
7272
06:05:30,120 --> 06:05:33,440
and over time start to make fewer of those random moves based
7273
06:05:33,440 --> 06:05:36,160
on a particular temperature schedule.
7274
06:05:36,160 --> 06:05:38,240
So the basic outline looks something like this.
7275
06:05:38,240 --> 06:05:42,520
Early on in simulated annealing, we have a higher temperature state.
7276
06:05:42,520 --> 06:05:44,920
And what we mean by a higher temperature state
7277
06:05:44,920 --> 06:05:47,520
is that we are more likely to accept neighbors that
7278
06:05:47,520 --> 06:05:49,200
are worse than our current state.
7279
06:05:49,200 --> 06:05:50,520
We might look at our neighbors.
7280
06:05:50,520 --> 06:05:53,000
And if one of our neighbors is worse than the current state,
7281
06:05:53,000 --> 06:05:54,920
especially if it's not all that much worse,
7282
06:05:54,920 --> 06:05:57,200
if it's pretty close but just slightly worse,
7283
06:05:57,200 --> 06:05:59,960
then we might be more likely to accept that and go ahead
7284
06:05:59,960 --> 06:06:02,120
and move to that neighbor anyways.
7285
06:06:02,120 --> 06:06:04,560
But later on as we run simulated annealing,
7286
06:06:04,560 --> 06:06:06,440
we're going to decrease that temperature.
7287
06:06:06,440 --> 06:06:10,280
And at a lower temperature, we're going to be less likely to accept neighbors
7288
06:06:10,280 --> 06:06:12,800
that are worse than our current state.
7289
06:06:12,800 --> 06:06:15,300
Now to formalize this and put a little bit of pseudocode to it,
7290
06:06:15,300 --> 06:06:17,120
here is what that algorithm might look like.
7291
06:06:17,120 --> 06:06:19,080
We have a function called simulated annealing
7292
06:06:19,080 --> 06:06:21,560
that takes as input the problem we're trying to solve
7293
06:06:21,560 --> 06:06:24,320
and also potentially some maximum number of times
7294
06:06:24,320 --> 06:06:27,600
we might want to run the simulated annealing process, how many different
7295
06:06:27,600 --> 06:06:29,300
neighbors we're going to try and look for.
7296
06:06:29,300 --> 06:06:33,200
And that value is going to vary based on the problem you're trying to solve.
7297
06:06:33,200 --> 06:06:34,880
We'll, again, start with some current state
7298
06:06:34,880 --> 06:06:37,320
that will be equal to the initial state of the problem.
7299
06:06:37,320 --> 06:06:40,760
But now we need to repeat this process over and over
7300
06:06:40,760 --> 06:06:42,600
for max number of times.
7301
06:06:42,600 --> 06:06:45,880
Repeat some process some number of times where we're first
7302
06:06:45,880 --> 06:06:48,120
going to calculate a temperature.
7303
06:06:48,120 --> 06:06:51,160
And this temperature function takes the current time t
7304
06:06:51,160 --> 06:06:53,440
starting at 1 going all the way up to max
7305
06:06:53,440 --> 06:06:57,360
and then gives us some temperature that we can use in our computation,
7306
06:06:57,360 --> 06:07:01,120
where the idea is that this temperature is going to be higher early on
7307
06:07:01,120 --> 06:07:02,840
and it's going to be lower later on.
7308
06:07:02,840 --> 06:07:05,760
So there are a number of ways this temperature function could often work.
7309
06:07:05,760 --> 06:07:07,680
One of the simplest ways is just to say it
7310
06:07:07,680 --> 06:07:10,760
is like the proportion of time that we still have remaining.
7311
06:07:10,760 --> 06:07:14,040
Out of max units of time, how much time do we have remaining?
7312
06:07:14,040 --> 06:07:16,160
You start off with a lot of that time remaining.
7313
06:07:16,160 --> 06:07:18,580
And as time goes on, the temperature is going to decrease
7314
06:07:18,580 --> 06:07:22,440
because you have less and less of that remaining time still available to you.
7315
06:07:22,440 --> 06:07:25,200
So we calculate a temperature for the current time.
7316
06:07:25,200 --> 06:07:28,240
And then we pick a random neighbor of the current state.
7317
06:07:28,240 --> 06:07:31,360
No longer are we going to be picking the best neighbor that we possibly can
7318
06:07:31,360 --> 06:07:33,440
or just one of the better neighbors that we can.
7319
06:07:33,440 --> 06:07:34,840
We're going to pick a random neighbor.
7320
06:07:34,840 --> 06:07:35,500
It might be better.
7321
06:07:35,500 --> 06:07:36,280
It might be worse.
7322
06:07:36,280 --> 06:07:37,240
But we're going to calculate that.
7323
06:07:37,240 --> 06:07:40,900
We're going to calculate delta E, E for energy in this case,
7324
06:07:40,900 --> 06:07:45,320
which is just how much better is the neighbor than the current state.
7325
06:07:45,320 --> 06:07:47,840
So if delta E is positive, that means the neighbor
7326
06:07:47,840 --> 06:07:49,360
is better than our current state.
7327
06:07:49,360 --> 06:07:51,760
If delta E is negative, that means the neighbor
7328
06:07:51,760 --> 06:07:53,840
is worse than our current state.
7329
06:07:53,840 --> 06:07:56,120
And so we can then have a condition that looks like this.
7330
06:07:56,120 --> 06:07:59,720
If delta E is greater than 0, that means the neighbor state
7331
06:07:59,720 --> 06:08:01,920
is better than our current state.
7332
06:08:01,920 --> 06:08:05,760
And if ever that situation arises, we'll just go ahead and update current
7333
06:08:05,760 --> 06:08:06,560
to be that neighbor.
7334
06:08:06,560 --> 06:08:09,720
Same as before, move where we are currently to be the neighbor
7335
06:08:09,720 --> 06:08:11,920
because the neighbor is better than our current state.
7336
06:08:11,920 --> 06:08:13,240
We'll go ahead and accept that.
7337
06:08:13,240 --> 06:08:16,240
But now the difference is that whereas before, we never,
7338
06:08:16,240 --> 06:08:19,160
ever wanted to take a move that made our situation worse,
7339
06:08:19,160 --> 06:08:22,360
now we sometimes want to make a move that is actually
7340
06:08:22,360 --> 06:08:24,560
going to make our situation worse because sometimes we're
7341
06:08:24,560 --> 06:08:27,920
going to need to dislodge ourselves from a local minimum or local maximum
7342
06:08:27,920 --> 06:08:31,360
to increase the probability that we're able to find the global minimum
7343
06:08:31,360 --> 06:08:34,120
or the global maximum a little bit later.
7344
06:08:34,120 --> 06:08:35,120
And so how do we do that?
7345
06:08:35,120 --> 06:08:39,520
How do we decide to sometimes accept some state that might actually be worse?
7346
06:08:39,520 --> 06:08:43,160
Well, we're going to accept a worse state with some probability.
7347
06:08:43,160 --> 06:08:46,000
And that probability needs to be based on a couple of factors.
7348
06:08:46,000 --> 06:08:49,080
It needs to be based in part on the temperature,
7349
06:08:49,080 --> 06:08:52,320
where if the temperature is higher, we're more likely to move to a worse
7350
06:08:52,320 --> 06:08:52,920
neighbor.
7351
06:08:52,920 --> 06:08:56,680
And if the temperature is lower, we're less likely to move to a worse neighbor.
7352
06:08:56,680 --> 06:09:00,560
But it also, to some degree, should be based on delta E.
7353
06:09:00,560 --> 06:09:03,480
If the neighbor is much worse than the current state,
7354
06:09:03,480 --> 06:09:05,920
we probably want to be less likely to choose that
7355
06:09:05,920 --> 06:09:09,640
than if the neighbor is just a little bit worse than the current state.
7356
06:09:09,640 --> 06:09:12,080
So again, there are a couple of ways you could calculate this.
7357
06:09:12,080 --> 06:09:14,320
But it turns out one of the most popular is just
7358
06:09:14,320 --> 06:09:19,360
to calculate E to the power of delta E over T, where E is just a constant.
7359
06:09:19,360 --> 06:09:22,960
Delta E over T are based on delta E and T here.
7360
06:09:22,960 --> 06:09:24,560
We calculate that value.
7361
06:09:24,560 --> 06:09:26,760
And that'll be some value between 0 and 1.
7362
06:09:26,760 --> 06:09:29,720
And that is the probability with which we should just say, all right,
7363
06:09:29,720 --> 06:09:31,220
let's go ahead and move to that neighbor.
7364
06:09:31,220 --> 06:09:33,560
And it turns out that if you do the math for this value,
7365
06:09:33,560 --> 06:09:36,240
when delta E is such that the neighbor is not
7366
06:09:36,240 --> 06:09:38,240
that much worse than the current state, that's
7367
06:09:38,240 --> 06:09:41,120
going to be more likely that we're going to go ahead and move to that state.
7368
06:09:41,120 --> 06:09:43,040
And likewise, when the temperature is lower,
7369
06:09:43,040 --> 06:09:47,120
we're going to be less likely to move to that neighboring state as well.
7370
06:09:47,120 --> 06:09:49,800
So now this is the big picture for simulated annealing,
7371
06:09:49,800 --> 06:09:53,040
this process of taking the problem and going ahead and generating
7372
06:09:53,040 --> 06:09:55,240
random neighbors will always move to a neighbor
7373
06:09:55,240 --> 06:09:56,960
if it's better than our current state.
7374
06:09:56,960 --> 06:09:59,520
But even if the neighbor is worse than our current state,
7375
06:09:59,520 --> 06:10:03,040
we'll sometimes move there depending on how much worse it is
7376
06:10:03,040 --> 06:10:04,680
and also based on the temperature.
7377
06:10:04,680 --> 06:10:07,600
And as a result, the hope, the goal of this whole process
7378
06:10:07,600 --> 06:10:11,640
is that as we begin to try and find our way to the global maximum
7379
06:10:11,640 --> 06:10:14,160
or the global minimum, we can dislodge ourselves
7380
06:10:14,160 --> 06:10:17,040
if we ever get stuck at a local maximum or local minimum
7381
06:10:17,040 --> 06:10:19,660
in order to eventually make our way to exploring
7382
06:10:19,660 --> 06:10:22,080
the part of the state space that is going to be the best.
7383
06:10:22,080 --> 06:10:25,600
And then as the temperature decreases, eventually we settle there
7384
06:10:25,600 --> 06:10:27,840
without moving around too much from what we've
7385
06:10:27,840 --> 06:10:31,440
found to be the globally best thing that we can do thus far.
7386
06:10:31,440 --> 06:10:35,320
So at the very end, we just return whatever the current state happens to be.
7387
06:10:35,320 --> 06:10:37,520
And that is the conclusion of this algorithm.
7388
06:10:37,520 --> 06:10:40,600
We've been able to figure out what the solution is.
7389
06:10:40,600 --> 06:10:44,000
And these types of algorithms have a lot of different applications.
7390
06:10:44,000 --> 06:10:46,400
Any time you can take a problem and formulate it
7391
06:10:46,400 --> 06:10:49,760
as something where you can explore a particular configuration
7392
06:10:49,760 --> 06:10:51,940
and then ask, are any of the neighbors better
7393
06:10:51,940 --> 06:10:54,920
than this current configuration and have some way of measuring that,
7394
06:10:54,920 --> 06:10:58,440
then there is an applicable case for these hill climbing, simulated annealing
7395
06:10:58,440 --> 06:10:59,800
types of algorithms.
7396
06:10:59,800 --> 06:11:02,800
So sometimes it can be for facility location type problems,
7397
06:11:02,800 --> 06:11:05,080
like for when you're trying to plan a city and figure out
7398
06:11:05,080 --> 06:11:06,480
where the hospitals should be.
7399
06:11:06,480 --> 06:11:08,720
But there are definitely other applications as well.
7400
06:11:08,720 --> 06:11:11,200
And one of the most famous problems in computer science
7401
06:11:11,200 --> 06:11:13,240
is the traveling salesman problem.
7402
06:11:13,240 --> 06:11:16,240
Traveling salesman problem generally is formulated like this.
7403
06:11:16,240 --> 06:11:19,360
I have a whole bunch of cities here indicated by these dots.
7404
06:11:19,360 --> 06:11:22,000
And what I'd like to do is find some route that
7405
06:11:22,000 --> 06:11:25,600
takes me through all of the cities and ends up back where I started.
7406
06:11:25,600 --> 06:11:29,120
So some route that starts here, goes through all these cities,
7407
06:11:29,120 --> 06:11:32,080
and ends up back where I originally started.
7408
06:11:32,080 --> 06:11:35,800
And what I might like to do is minimize the total distance
7409
06:11:35,800 --> 06:11:40,040
that I have to travel or the total cost of taking this entire path.
7410
06:11:40,040 --> 06:11:43,720
And you can imagine this is a problem that's very applicable in situations
7411
06:11:43,720 --> 06:11:46,840
like when delivery companies are trying to deliver things
7412
06:11:46,840 --> 06:11:48,640
to a whole bunch of different houses, they
7413
06:11:48,640 --> 06:11:51,040
want to figure out, how do I get from the warehouse
7414
06:11:51,040 --> 06:11:53,980
to all these various different houses and get back again,
7415
06:11:53,980 --> 06:11:57,920
all using as minimal time and distance and energy as possible.
7416
06:11:57,920 --> 06:12:00,840
So you might want to try to solve these sorts of problems.
7417
06:12:00,840 --> 06:12:03,800
But it turns out that solving this particular kind of problem
7418
06:12:03,800 --> 06:12:05,680
is very computationally difficult.
7419
06:12:05,680 --> 06:12:09,320
It is a very computationally expensive task to be able to figure it out.
7420
06:12:09,320 --> 06:12:12,920
This falls under the category of what are known as NP-complete problems,
7421
06:12:12,920 --> 06:12:16,100
problems that there is no known efficient way to try and solve
7422
06:12:16,100 --> 06:12:17,560
these sorts of problems.
7423
06:12:17,560 --> 06:12:21,400
And so what we ultimately have to do is come up with some approximation,
7424
06:12:21,400 --> 06:12:25,040
some ways of trying to find a good solution, even if we're not
7425
06:12:25,040 --> 06:12:27,960
going to find the globally best solution that we possibly can,
7426
06:12:27,960 --> 06:12:30,920
at least not in a feasible or tractable amount of time.
7427
06:12:30,920 --> 06:12:34,040
And so what we could do is take the traveling salesman problem
7428
06:12:34,040 --> 06:12:38,160
and try to formulate it using local search and ask a question like, all right,
7429
06:12:38,160 --> 06:12:41,680
I can pick some state, some configuration, some route between all
7430
06:12:41,680 --> 06:12:42,800
of these nodes.
7431
06:12:42,800 --> 06:12:46,040
And I can measure the cost of that state, figure out what the distance is.
7432
06:12:46,040 --> 06:12:49,960
And I might now want to try to minimize that cost as much as possible.
7433
06:12:49,960 --> 06:12:51,920
And then the only question now is, what does it
7434
06:12:51,920 --> 06:12:54,080
mean to have a neighbor of this state?
7435
06:12:54,080 --> 06:12:55,920
What does it mean to take this particular route
7436
06:12:55,920 --> 06:12:59,040
and have some neighboring route that is close to it but slightly different
7437
06:12:59,040 --> 06:13:01,480
and such that it might have a different total distance?
7438
06:13:01,480 --> 06:13:03,440
And there are a number of different definitions
7439
06:13:03,440 --> 06:13:07,000
for what a neighbor of a traveling salesman configuration might look like.
7440
06:13:07,000 --> 06:13:09,280
But one way is just to say, a neighbor is
7441
06:13:09,280 --> 06:13:13,760
what happens if we pick two of these edges between nodes
7442
06:13:13,760 --> 06:13:16,640
and switch them effectively.
7443
06:13:16,640 --> 06:13:19,120
So for example, I might pick these two edges here,
7444
06:13:19,120 --> 06:13:23,200
these two that just happened across this node goes here, this node goes there,
7445
06:13:23,200 --> 06:13:24,880
and go ahead and switch them.
7446
06:13:24,880 --> 06:13:26,920
And what that process will generally look like
7447
06:13:26,920 --> 06:13:31,080
is removing both of these edges from the graph, taking this node,
7448
06:13:31,080 --> 06:13:33,640
and connecting it to the node it wasn't connected to.
7449
06:13:33,640 --> 06:13:35,720
So connecting it up here instead.
7450
06:13:35,720 --> 06:13:37,800
We'll need to take these arrows that were originally
7451
06:13:37,800 --> 06:13:40,880
going this way and reverse them, so move them going the other way,
7452
06:13:40,880 --> 06:13:42,960
and then just fill in that last remaining blank,
7453
06:13:42,960 --> 06:13:45,360
add an arrow that goes in that direction instead.
7454
06:13:45,360 --> 06:13:48,400
So by taking two edges and just switching them,
7455
06:13:48,400 --> 06:13:51,480
I have been able to consider one possible neighbor
7456
06:13:51,480 --> 06:13:53,080
of this particular configuration.
7457
06:13:53,080 --> 06:13:55,160
And it looks like this neighbor is actually better.
7458
06:13:55,160 --> 06:13:57,960
It looks like this probably travels a shorter distance in order
7459
06:13:57,960 --> 06:14:00,480
to get through all the cities through this route
7460
06:14:00,480 --> 06:14:02,080
than the current state did.
7461
06:14:02,080 --> 06:14:05,720
And so you could imagine implementing this idea inside of a hill climbing
7462
06:14:05,720 --> 06:14:08,640
or simulated annealing algorithm, where we repeat this process
7463
06:14:08,640 --> 06:14:11,720
to try and take a state of this traveling salesman problem,
7464
06:14:11,720 --> 06:14:14,720
look at all the neighbors, and then move to the neighbors if they're better,
7465
06:14:14,720 --> 06:14:16,920
or maybe even move to the neighbors if they're worse,
7466
06:14:16,920 --> 06:14:20,120
until we eventually settle upon some best solution
7467
06:14:20,120 --> 06:14:21,760
that we've been able to find.
7468
06:14:21,760 --> 06:14:24,280
And it turns out that these types of approximation algorithms,
7469
06:14:24,280 --> 06:14:26,840
even if they don't always find the very best solution,
7470
06:14:26,840 --> 06:14:32,120
can often do pretty well at trying to find solutions that are helpful too.
7471
06:14:32,120 --> 06:14:36,240
So that then was a look at local search, a particular category of algorithms
7472
06:14:36,240 --> 06:14:38,640
that can be used for solving a particular type of problem,
7473
06:14:38,640 --> 06:14:41,160
where we don't really care about the path to the solution.
7474
06:14:41,160 --> 06:14:43,280
I didn't care about the steps I took to decide
7475
06:14:43,280 --> 06:14:44,760
where the hospitals should go.
7476
06:14:44,760 --> 06:14:46,600
I just cared about the solution itself.
7477
06:14:46,600 --> 06:14:49,040
I just care about where the hospitals should be,
7478
06:14:49,040 --> 06:14:53,520
or what the route through the traveling salesman journey really ought to be.
7479
06:14:53,520 --> 06:14:55,720
Another type of algorithm that might come up
7480
06:14:55,720 --> 06:14:59,120
are known as these categories of linear programming types of problems.
7481
06:14:59,120 --> 06:15:01,640
And linear programming often comes up in the context
7482
06:15:01,640 --> 06:15:04,960
where we're trying to optimize for some mathematical function.
7483
06:15:04,960 --> 06:15:07,640
But oftentimes, linear programming will come up
7484
06:15:07,640 --> 06:15:10,000
when we might have real numbered values.
7485
06:15:10,000 --> 06:15:13,000
So it's not just discrete fixed values that we might have,
7486
06:15:13,000 --> 06:15:16,240
but any decimal values that we might want to be able to calculate.
7487
06:15:16,240 --> 06:15:19,680
And so linear programming is a family of types of problems
7488
06:15:19,680 --> 06:15:22,400
where we might have a situation that looks like this, where
7489
06:15:22,400 --> 06:15:26,640
the goal of linear programming is to minimize a cost function.
7490
06:15:26,640 --> 06:15:29,080
And you can invert the numbers and say try and maximize it,
7491
06:15:29,080 --> 06:15:32,560
but often we'll frame it as trying to minimize a cost function that
7492
06:15:32,560 --> 06:15:36,920
has some number of variables, x1, x2, x3, all the way up to xn,
7493
06:15:36,920 --> 06:15:38,840
just some number of variables that are involved,
7494
06:15:38,840 --> 06:15:41,320
things that I want to know the values to.
7495
06:15:41,320 --> 06:15:43,520
And this cost function might have coefficients
7496
06:15:43,520 --> 06:15:45,000
in front of those variables.
7497
06:15:45,000 --> 06:15:47,520
And this is what we would call a linear equation,
7498
06:15:47,520 --> 06:15:50,240
where we just have all of these variables that might be multiplied
7499
06:15:50,240 --> 06:15:52,040
by a coefficient and then add it together.
7500
06:15:52,040 --> 06:15:53,880
We're not going to square anything or cube anything,
7501
06:15:53,880 --> 06:15:56,040
because that'll give us different types of equations.
7502
06:15:56,040 --> 06:15:59,760
With linear programming, we're just dealing with linear equations
7503
06:15:59,760 --> 06:16:03,800
in addition to linear constraints, where a constraint is going
7504
06:16:03,800 --> 06:16:07,440
to look something like if we sum up this particular equation that
7505
06:16:07,440 --> 06:16:10,400
is just some linear combination of all of these variables,
7506
06:16:10,400 --> 06:16:13,280
it is less than or equal to some bound b.
7507
06:16:13,280 --> 06:16:16,400
And we might have a whole number of these various different constraints
7508
06:16:16,400 --> 06:16:21,280
that we might place onto our linear programming exercise.
7509
06:16:21,280 --> 06:16:24,400
And likewise, just as we can have constraints that are saying this linear
7510
06:16:24,400 --> 06:16:27,160
equation is less than or equal to some bound b,
7511
06:16:27,160 --> 06:16:28,760
it might also be equal to something.
7512
06:16:28,760 --> 06:16:31,400
That if you want some sum of some combination of variables
7513
06:16:31,400 --> 06:16:33,960
to be equal to a value, you can specify that.
7514
06:16:33,960 --> 06:16:37,840
And we can also maybe specify that each variable has lower and upper bounds,
7515
06:16:37,840 --> 06:16:39,960
that it needs to be a positive number, for example,
7516
06:16:39,960 --> 06:16:42,960
or it needs to be a number that is less than 50, for example.
7517
06:16:42,960 --> 06:16:44,800
And there are a number of other choices that we
7518
06:16:44,800 --> 06:16:47,920
can make there for defining what the bounds of a variable are.
7519
06:16:47,920 --> 06:16:50,200
But it turns out that if you can take a problem
7520
06:16:50,200 --> 06:16:54,560
and formulate it in these terms, formulate the problem as your goal
7521
06:16:54,560 --> 06:16:56,800
is to minimize a cost function, and you're
7522
06:16:56,800 --> 06:17:00,440
minimizing that cost function subject to particular constraints,
7523
06:17:00,440 --> 06:17:03,880
subjects to equations that are of the form like this of some sequence
7524
06:17:03,880 --> 06:17:07,840
of variables is less than a bound or is equal to some particular value,
7525
06:17:07,840 --> 06:17:10,320
then there are a number of algorithms that already
7526
06:17:10,320 --> 06:17:13,960
exist for solving these sorts of problems.
7527
06:17:13,960 --> 06:17:16,480
So let's go ahead and take a look at an example.
7528
06:17:16,480 --> 06:17:18,400
Here's an example of a problem that might come up
7529
06:17:18,400 --> 06:17:19,880
in the world of linear programming.
7530
06:17:19,880 --> 06:17:21,600
Often, this is going to come up when we're
7531
06:17:21,600 --> 06:17:23,320
trying to optimize for something.
7532
06:17:23,320 --> 06:17:25,360
And we want to be able to do some calculations,
7533
06:17:25,360 --> 06:17:27,880
and we have constraints on what we're trying to optimize.
7534
06:17:27,880 --> 06:17:29,760
And so it might be something like this.
7535
06:17:29,760 --> 06:17:34,040
In the context of a factory, we have two machines, x1 and x2.
7536
06:17:34,040 --> 06:17:36,480
x1 costs $50 an hour to run.
7537
06:17:36,480 --> 06:17:38,880
x2 costs $80 an hour to run.
7538
06:17:38,880 --> 06:17:41,960
And our goal, what we're trying to do, our objective,
7539
06:17:41,960 --> 06:17:45,040
is to minimize the total cost.
7540
06:17:45,040 --> 06:17:46,560
So that's what we'd like to do.
7541
06:17:46,560 --> 06:17:49,560
But we need to do so subject to certain constraints.
7542
06:17:49,560 --> 06:17:51,640
So there might be a labor constraint that x1
7543
06:17:51,640 --> 06:17:56,720
requires five units of labor per hour, x2 requires two units of labor per hour,
7544
06:17:56,720 --> 06:18:00,080
and we have a total of 20 units of labor that we have to spend.
7545
06:18:00,080 --> 06:18:01,040
So this is a constraint.
7546
06:18:01,040 --> 06:18:04,800
We have no more than 20 units of labor that we can spend,
7547
06:18:04,800 --> 06:18:08,120
and we have to spend it across x1 and x2, each of which
7548
06:18:08,120 --> 06:18:10,840
requires a different amount of labor.
7549
06:18:10,840 --> 06:18:13,120
And we might also have a constraint like this
7550
06:18:13,120 --> 06:18:16,760
that tells us x1 is going to produce 10 units of output per hour,
7551
06:18:16,760 --> 06:18:19,800
x2 is going to produce 12 units of output per hour,
7552
06:18:19,800 --> 06:18:22,640
and the company needs 90 units of output.
7553
06:18:22,640 --> 06:18:24,760
So we have some goal, something we need to achieve.
7554
06:18:24,760 --> 06:18:28,240
We need to achieve 90 units of output, but there are some constraints
7555
06:18:28,240 --> 06:18:31,060
that x1 can only produce 10 units of output per hour,
7556
06:18:31,060 --> 06:18:34,040
x2 produces 12 units of output per hour.
7557
06:18:34,040 --> 06:18:36,560
These types of problems come up quite frequently,
7558
06:18:36,560 --> 06:18:39,360
and you can start to notice patterns in these types of problems,
7559
06:18:39,360 --> 06:18:43,280
problems where I am trying to optimize for some goal, minimizing cost,
7560
06:18:43,280 --> 06:18:46,800
maximizing output, maximizing profits, or something like that.
7561
06:18:46,800 --> 06:18:50,040
And there are constraints that are placed on that process.
7562
06:18:50,040 --> 06:18:52,520
And so now we just need to formulate this problem
7563
06:18:52,520 --> 06:18:55,120
in terms of linear equations.
7564
06:18:55,120 --> 06:18:56,560
So let's start with this first point.
7565
06:18:56,560 --> 06:19:01,760
Two machines, x1 and x2, x costs $50 an hour, x2 costs $80 an hour.
7566
06:19:01,760 --> 06:19:05,360
Here we can come up with an objective function that might look like this.
7567
06:19:05,360 --> 06:19:07,280
This is our cost function, rather.
7568
06:19:07,280 --> 06:19:11,160
50 times x1 plus 80 times x2, where x1 is going
7569
06:19:11,160 --> 06:19:15,680
to be a variable representing how many hours do we run machine x1 for,
7570
06:19:15,680 --> 06:19:18,280
x2 is going to be a variable representing how many hours
7571
06:19:18,280 --> 06:19:20,280
are we running machine x2 for.
7572
06:19:20,280 --> 06:19:23,760
And what we're trying to minimize is this cost function, which
7573
06:19:23,760 --> 06:19:27,640
is just how much it costs to run each of these machines per hour summed up.
7574
06:19:27,640 --> 06:19:31,080
This is an example of a linear equation, just some combination
7575
06:19:31,080 --> 06:19:34,360
of these variables plus coefficients that are placed in front of them.
7576
06:19:34,360 --> 06:19:37,040
And I would like to minimize that total value.
7577
06:19:37,040 --> 06:19:40,360
But I need to do so subject to these constraints.
7578
06:19:40,360 --> 06:19:44,200
x1 requires 50 units of labor per hour, x2 requires 2,
7579
06:19:44,200 --> 06:19:46,800
and we have a total of 20 units of labor to spend.
7580
06:19:46,800 --> 06:19:50,200
And so that gives us a constraint of this form.
7581
06:19:50,200 --> 06:19:54,600
5 times x1 plus 2 times x2 is less than or equal to 20.
7582
06:19:54,600 --> 06:19:57,680
20 is the total number of units of labor we have to spend.
7583
06:19:57,680 --> 06:20:00,600
And that's spent across x1 and x2, each of which
7584
06:20:00,600 --> 06:20:05,120
requires a different number of units of labor per hour, for example.
7585
06:20:05,120 --> 06:20:07,360
And finally, we have this constraint here.
7586
06:20:07,360 --> 06:20:10,840
x1 produces 10 units of output per hour, x2 produces 12,
7587
06:20:10,840 --> 06:20:13,640
and we need 90 units of output.
7588
06:20:13,640 --> 06:20:15,920
And so this might look something like this.
7589
06:20:15,920 --> 06:20:20,080
That 10x1 plus 12x2, this is amount of output per hour,
7590
06:20:20,080 --> 06:20:21,920
it needs to be at least 90.
7591
06:20:21,920 --> 06:20:25,240
We can do better or great, but it needs to be at least 90.
7592
06:20:25,240 --> 06:20:27,320
And if you recall from my formulation before,
7593
06:20:27,320 --> 06:20:29,760
I said that generally speaking in linear programming,
7594
06:20:29,760 --> 06:20:33,520
we deal with equals constraints or less than or equal to constraints.
7595
06:20:33,520 --> 06:20:35,520
So we have a greater than or equal to sign here.
7596
06:20:35,520 --> 06:20:36,320
That's not a problem.
7597
06:20:36,320 --> 06:20:38,240
Whenever we have a greater than or equal to sign,
7598
06:20:38,240 --> 06:20:40,640
we can just multiply the equation by negative 1,
7599
06:20:40,640 --> 06:20:44,040
and that'll flip it around to a less than or equals negative 90,
7600
06:20:44,040 --> 06:20:47,440
for example, instead of a greater than or equal to 90.
7601
06:20:47,440 --> 06:20:49,440
And that's going to be an equivalent expression
7602
06:20:49,440 --> 06:20:51,920
that we can use to represent this problem.
7603
06:20:51,920 --> 06:20:55,840
So now that we have this cost function and these constraints
7604
06:20:55,840 --> 06:20:58,920
that it's subject to, it turns out there are a number of algorithms
7605
06:20:58,920 --> 06:21:02,080
that can be used in order to solve these types of problems.
7606
06:21:02,080 --> 06:21:05,400
And these problems go a little bit more into geometry and linear algebra
7607
06:21:05,400 --> 06:21:06,840
than we're really going to get into.
7608
06:21:06,840 --> 06:21:09,640
But the most popular of these types of algorithms
7609
06:21:09,640 --> 06:21:12,640
are simplex, which was one of the first algorithms discovered
7610
06:21:12,640 --> 06:21:14,720
for trying to solve linear programs.
7611
06:21:14,720 --> 06:21:17,680
And later on, a class of interior point algorithms
7612
06:21:17,680 --> 06:21:20,240
can be used to solve this type of problem as well.
7613
06:21:20,240 --> 06:21:23,120
The key is not to understand exactly how these algorithms work,
7614
06:21:23,120 --> 06:21:27,080
but to realize that these algorithms exist for efficiently finding solutions
7615
06:21:27,080 --> 06:21:30,400
any time we have a problem of this particular form.
7616
06:21:30,400 --> 06:21:39,760
And so we can take a look, for example, at the production directory here,
7617
06:21:39,760 --> 06:21:43,560
where here I have a file called production.py, where here I'm
7618
06:21:43,560 --> 06:21:47,920
using scipy, which was the library for a lot of science-related functions
7619
06:21:47,920 --> 06:21:49,000
within Python.
7620
06:21:49,000 --> 06:21:52,760
And I can go ahead and just run this optimization function
7621
06:21:52,760 --> 06:21:54,560
in order to run a linear program.
7622
06:21:54,560 --> 06:21:58,000
.linprog here is going to try and solve this linear program for me,
7623
06:21:58,000 --> 06:22:01,720
where I provide to this expression, to this function call,
7624
06:22:01,720 --> 06:22:03,520
all of the data about my linear program.
7625
06:22:03,520 --> 06:22:05,480
So it needs to be in a particular format, which
7626
06:22:05,480 --> 06:22:07,200
might be a little confusing at first.
7627
06:22:07,200 --> 06:22:11,000
But this first argument to scipy.optimize.linprogramming
7628
06:22:11,000 --> 06:22:15,040
is the cost function, which is in this case just an array or a list that
7629
06:22:15,040 --> 06:22:20,200
has 50 and 80, because my original cost function was 50 times x1 plus 80
7630
06:22:20,200 --> 06:22:21,280
times x2.
7631
06:22:21,280 --> 06:22:25,040
So I just tell Python, 50 and 80, those are the coefficients
7632
06:22:25,040 --> 06:22:27,960
that I am now trying to optimize for.
7633
06:22:27,960 --> 06:22:30,920
And then I provide all of the constraints.
7634
06:22:30,920 --> 06:22:33,560
So the constraints, and I wrote them up above in comments,
7635
06:22:33,560 --> 06:22:39,280
is the constraint 1 is 5x1 plus 2x2 is less than or equal to 20.
7636
06:22:39,280 --> 06:22:44,600
And constraint 2 is negative 10x1 plus negative 12x2
7637
06:22:44,600 --> 06:22:47,120
is less than or equal to negative 90.
7638
06:22:47,120 --> 06:22:51,440
And so scipy expects these constraints to be in a particular format.
7639
06:22:51,440 --> 06:22:54,680
It first expects me to provide all of the coefficients
7640
06:22:54,680 --> 06:22:58,440
for the upper bound equations, ub just for upper bound,
7641
06:22:58,440 --> 06:23:00,480
where the coefficients of the first equation
7642
06:23:00,480 --> 06:23:03,960
are 5 and 2, because we have 5x1 and 2x2.
7643
06:23:03,960 --> 06:23:06,120
And the coefficients for the second equation
7644
06:23:06,120 --> 06:23:12,560
are negative 10 and negative 12, because I have negative 10x1 plus negative 12x2.
7645
06:23:12,560 --> 06:23:14,880
And then here, we provide it as a separate argument,
7646
06:23:14,880 --> 06:23:17,520
just to keep things separate, what the actual bound is.
7647
06:23:17,520 --> 06:23:20,160
What is the upper bound for each of these constraints?
7648
06:23:20,160 --> 06:23:22,440
Well, for the first constraint, the upper bound is 20.
7649
06:23:22,440 --> 06:23:24,120
That was constraint number 1.
7650
06:23:24,120 --> 06:23:28,200
And then for constraint number 2, the upper bound is 90.
7651
06:23:28,200 --> 06:23:30,240
So a bit of a cryptic way of representing it.
7652
06:23:30,240 --> 06:23:33,680
It's not quite as simple as just writing the mathematical equations.
7653
06:23:33,680 --> 06:23:36,800
What really is being expected here are all of the coefficients
7654
06:23:36,800 --> 06:23:39,000
and all of the numbers that are in these equations
7655
06:23:39,000 --> 06:23:42,120
by first providing the coefficients for the cost function,
7656
06:23:42,120 --> 06:23:45,880
then providing all the coefficients for the inequality constraints,
7657
06:23:45,880 --> 06:23:50,560
and then providing all of the upper bounds for those inequality constraints.
7658
06:23:50,560 --> 06:23:52,880
And once all of that information is there,
7659
06:23:52,880 --> 06:23:57,080
then we can run any of these interior point algorithms or the simplex algorithm.
7660
06:23:57,080 --> 06:23:59,000
Even if you don't understand how it works,
7661
06:23:59,000 --> 06:24:02,640
you can just run the function and figure out what the result should be.
7662
06:24:02,640 --> 06:24:04,520
And here, I said if the result is a success,
7663
06:24:04,520 --> 06:24:06,520
we were able to solve this problem.
7664
06:24:06,520 --> 06:24:10,640
Go ahead and print out what the value of x1 and x2 should be.
7665
06:24:10,640 --> 06:24:13,440
Otherwise, go ahead and print out no solution.
7666
06:24:13,440 --> 06:24:19,760
And so if I run this program by running python production.py,
7667
06:24:19,760 --> 06:24:21,440
it takes a second to calculate.
7668
06:24:21,440 --> 06:24:24,520
But then we see here is what the optimal solution should be.
7669
06:24:24,520 --> 06:24:26,960
x1 should run for 1.5 hours.
7670
06:24:26,960 --> 06:24:30,080
x2 should run for 6.25 hours.
7671
06:24:30,080 --> 06:24:33,160
And we were able to do this by just formulating the problem
7672
06:24:33,160 --> 06:24:36,120
as a linear equation that we were trying to optimize,
7673
06:24:36,120 --> 06:24:38,200
some cost that we were trying to minimize,
7674
06:24:38,200 --> 06:24:40,440
and then some constraints that were placed on that.
7675
06:24:40,440 --> 06:24:43,600
And many, many problems fall into this category of problems
7676
06:24:43,600 --> 06:24:47,400
that you can solve if you can just figure out how to use equations
7677
06:24:47,400 --> 06:24:51,200
and use these constraints to represent that general idea.
7678
06:24:51,200 --> 06:24:53,400
And that's a theme that's going to come up a couple of times today,
7679
06:24:53,400 --> 06:24:55,600
where we want to be able to take some problem
7680
06:24:55,600 --> 06:24:57,640
and reduce it down to some problem we know
7681
06:24:57,640 --> 06:25:01,920
how to solve in order to begin to find a solution
7682
06:25:01,920 --> 06:25:04,400
and to use existing methods that we can use in order
7683
06:25:04,400 --> 06:25:08,120
to find a solution more effectively or more efficiently.
7684
06:25:08,120 --> 06:25:11,400
And it turns out that these types of problems, where we have constraints,
7685
06:25:11,400 --> 06:25:13,040
show up in other ways too.
7686
06:25:13,040 --> 06:25:16,320
And there's an entire class of problems that's more generally just known
7687
06:25:16,320 --> 06:25:18,880
as constraint satisfaction problems.
7688
06:25:18,880 --> 06:25:21,600
And we're going to now take a look at how you might formulate a constraint
7689
06:25:21,600 --> 06:25:24,640
satisfaction problem and how you might go about solving a constraint
7690
06:25:24,640 --> 06:25:26,000
satisfaction problem.
7691
06:25:26,000 --> 06:25:28,920
But the basic idea of a constraint satisfaction problem
7692
06:25:28,920 --> 06:25:32,400
is we have some number of variables that need to take on some values.
7693
06:25:32,400 --> 06:25:35,720
And we need to figure out what values each of those variables should take on.
7694
06:25:35,720 --> 06:25:39,240
But those variables are subject to particular constraints
7695
06:25:39,240 --> 06:25:43,440
that are going to limit what values those variables can actually take on.
7696
06:25:43,440 --> 06:25:46,560
So let's take a look at a real world example, for example.
7697
06:25:46,560 --> 06:25:48,520
Let's look at exam scheduling, that I have
7698
06:25:48,520 --> 06:25:51,440
four students here, students 1, 2, 3, and 4.
7699
06:25:51,440 --> 06:25:53,960
Each of them is taking some number of different classes.
7700
06:25:53,960 --> 06:25:56,440
Classes here are going to be represented by letters.
7701
06:25:56,440 --> 06:26:00,760
So student 1 is enrolled in courses A, B, and C. Student 2
7702
06:26:00,760 --> 06:26:04,480
is enrolled in courses B, D, and E, so on and so forth.
7703
06:26:04,480 --> 06:26:07,240
And now, say university, for example, is trying
7704
06:26:07,240 --> 06:26:10,080
to schedule exams for all of these courses.
7705
06:26:10,080 --> 06:26:13,960
But there are only three exam slots on Monday, Tuesday, and Wednesday.
7706
06:26:13,960 --> 06:26:17,240
And we have to schedule an exam for each of these courses.
7707
06:26:17,240 --> 06:26:19,280
But the constraint now, the constraint we
7708
06:26:19,280 --> 06:26:21,160
have to deal with with the scheduling, is
7709
06:26:21,160 --> 06:26:25,160
that we don't want anyone to have to take two exams on the same day.
7710
06:26:25,160 --> 06:26:29,560
We would like to try and minimize that or eliminate it if at all possible.
7711
06:26:29,560 --> 06:26:31,720
So how do we begin to represent this idea?
7712
06:26:31,720 --> 06:26:35,920
How do we structure this in a way that a computer with an AI algorithm
7713
06:26:35,920 --> 06:26:37,760
can begin to try and solve the problem?
7714
06:26:37,760 --> 06:26:41,240
Well, let's in particular just look at these classes that we might take
7715
06:26:41,240 --> 06:26:45,920
and represent each of the courses as some node inside of a graph.
7716
06:26:45,920 --> 06:26:49,560
And what we'll do is we'll create an edge between two nodes in this graph
7717
06:26:49,560 --> 06:26:54,360
if there is a constraint between those two nodes.
7718
06:26:54,360 --> 06:26:55,440
So what does this mean?
7719
06:26:55,440 --> 06:26:59,840
Well, we can start with student 1, who's enrolled in courses A, B, and C.
7720
06:26:59,840 --> 06:27:03,880
What that means is that A and B can't have an exam at the same time.
7721
06:27:03,880 --> 06:27:06,280
A and C can't have an exam at the same time.
7722
06:27:06,280 --> 06:27:09,160
And B and C also can't have an exam at the same time.
7723
06:27:09,160 --> 06:27:12,200
And I can represent that in this graph by just drawing edges.
7724
06:27:12,200 --> 06:27:15,200
One edge between A and B, one between B and C,
7725
06:27:15,200 --> 06:27:18,960
and then one between C and A. And that encodes now the idea
7726
06:27:18,960 --> 06:27:21,680
that between those nodes, there is a constraint.
7727
06:27:21,680 --> 06:27:23,800
And in particular, the constraint happens to be
7728
06:27:23,800 --> 06:27:25,760
that these two can't be equal to each other,
7729
06:27:25,760 --> 06:27:28,080
though there are other types of constraints that are possible,
7730
06:27:28,080 --> 06:27:31,240
depending on the type of problem that you're trying to solve.
7731
06:27:31,240 --> 06:27:34,000
And then we can do the same thing for each of the other students.
7732
06:27:34,000 --> 06:27:36,920
So for student 2, who's enrolled in courses B, D, and E,
7733
06:27:36,920 --> 06:27:39,080
well, that means B, D, and E, those all need
7734
06:27:39,080 --> 06:27:41,240
to have edges that connect each other as well.
7735
06:27:41,240 --> 06:27:44,520
Student 3 is enrolled in courses C, E, and F. So we'll go ahead
7736
06:27:44,520 --> 06:27:48,640
and take C, E, and F and connect those by drawing edges between them too.
7737
06:27:48,640 --> 06:27:52,240
And then finally, student 4 is enrolled in courses E, F, and G.
7738
06:27:52,240 --> 06:27:55,400
And we can represent that by drawing edges between E, F, and G,
7739
06:27:55,400 --> 06:27:57,440
although E and F already had an edge between them.
7740
06:27:57,440 --> 06:27:59,520
We don't need another one, because this constraint
7741
06:27:59,520 --> 06:28:03,400
is just encoding the idea that course E and course F cannot have
7742
06:28:03,400 --> 06:28:05,640
an exam on the same day.
7743
06:28:05,640 --> 06:28:09,360
So this then is what we might call the constraint graph.
7744
06:28:09,360 --> 06:28:13,040
There's some graphical representation of all of my variables,
7745
06:28:13,040 --> 06:28:16,960
so to speak, and the constraints between those possible variables.
7746
06:28:16,960 --> 06:28:19,800
Where in this particular case, each of the constraints
7747
06:28:19,800 --> 06:28:23,560
represents an inequality constraint, that an edge between B and D
7748
06:28:23,560 --> 06:28:27,040
means whatever value the variable B takes on cannot be the value
7749
06:28:27,040 --> 06:28:30,440
that the variable D takes on as well.
7750
06:28:30,440 --> 06:28:33,720
So what then actually is a constraint satisfaction problem?
7751
06:28:33,720 --> 06:28:38,000
Well, a constraint satisfaction problem is just some set of variables, x1
7752
06:28:38,000 --> 06:28:42,360
all the way through xn, some set of domains for each of those variables.
7753
06:28:42,360 --> 06:28:45,120
So every variable needs to take on some values.
7754
06:28:45,120 --> 06:28:47,040
Maybe every variable has the same domain,
7755
06:28:47,040 --> 06:28:49,640
but maybe each variable has a slightly different domain.
7756
06:28:49,640 --> 06:28:52,800
And then there's a set of constraints, and we'll just call a set C,
7757
06:28:52,800 --> 06:28:55,600
that is some constraints that are placed upon these variables,
7758
06:28:55,600 --> 06:28:58,000
like x1 is not equal to x2.
7759
06:28:58,000 --> 06:29:02,120
But there could be other forms too, like maybe x1 equals x2 plus 1
7760
06:29:02,120 --> 06:29:05,760
if these variables are taking on numerical values in their domain,
7761
06:29:05,760 --> 06:29:06,400
for example.
7762
06:29:06,400 --> 06:29:10,720
The types of constraints are going to vary based on the types of problems.
7763
06:29:10,720 --> 06:29:14,080
And constraint satisfaction shows up all over the place as well,
7764
06:29:14,080 --> 06:29:16,400
in any situation where we have variables that
7765
06:29:16,400 --> 06:29:19,200
are subject to particular constraints.
7766
06:29:19,200 --> 06:29:23,200
So one popular game is Sudoku, for example, this 9 by 9 grid
7767
06:29:23,200 --> 06:29:25,600
where you need to fill in numbers in each of these cells,
7768
06:29:25,600 --> 06:29:29,880
but you want to make sure there's never a duplicate number in any row,
7769
06:29:29,880 --> 06:29:34,240
or in any column, or in any grid of 3 by 3 cells, for example.
7770
06:29:34,240 --> 06:29:37,840
So what might this look like as a constraint satisfaction problem?
7771
06:29:37,840 --> 06:29:41,880
Well, my variables are all of the empty squares in the puzzle.
7772
06:29:41,880 --> 06:29:45,560
So represented here is just like an x comma y coordinate, for example,
7773
06:29:45,560 --> 06:29:48,080
as all of the squares where I need to plug in a value,
7774
06:29:48,080 --> 06:29:50,600
where I don't know what value it should take on.
7775
06:29:50,600 --> 06:29:54,760
The domain is just going to be all of the numbers from 1 through 9,
7776
06:29:54,760 --> 06:29:57,360
any value that I could fill in to one of these cells.
7777
06:29:57,360 --> 06:30:00,200
So that is going to be the domain for each of these variables.
7778
06:30:00,200 --> 06:30:02,800
And then the constraints are going to be of the form,
7779
06:30:02,800 --> 06:30:05,760
like this cell can't be equal to this cell, can't be equal to this cell,
7780
06:30:05,760 --> 06:30:08,360
can't be, and all of these need to be different, for example,
7781
06:30:08,360 --> 06:30:12,760
and same for all of the rows, and the columns, and the 3 by 3 squares as well.
7782
06:30:12,760 --> 06:30:17,920
So those constraints are going to enforce what values are actually allowed.
7783
06:30:17,920 --> 06:30:21,240
And we can formulate the same idea in the case of this exam scheduling
7784
06:30:21,240 --> 06:30:25,800
problem, where the variables we have are the different courses, a up through g.
7785
06:30:25,800 --> 06:30:29,560
The domain for each of these variables is going to be Monday, Tuesday,
7786
06:30:29,560 --> 06:30:30,120
and Wednesday.
7787
06:30:30,120 --> 06:30:33,560
Those are the possible values each of the variables can take on,
7788
06:30:33,560 --> 06:30:38,120
that in this case just represent when is the exam for that class.
7789
06:30:38,120 --> 06:30:41,600
And then the constraints are of this form, a is not equal to b,
7790
06:30:41,600 --> 06:30:45,920
a is not equal to c, meaning a and b can't have an exam on the same day,
7791
06:30:45,920 --> 06:30:48,240
a and c can't have an exam on the same day.
7792
06:30:48,240 --> 06:30:53,040
Or more formally, these two variables cannot take on the same value
7793
06:30:53,040 --> 06:30:56,080
within their domain.
7794
06:30:56,080 --> 06:31:00,040
So that then is this formulation of a constraint satisfaction problem
7795
06:31:00,040 --> 06:31:03,160
that we can begin to use to try and solve this problem.
7796
06:31:03,160 --> 06:31:05,400
And constraints can come in a number of different forms.
7797
06:31:05,400 --> 06:31:07,800
There are hard constraints, which are constraints
7798
06:31:07,800 --> 06:31:10,280
that must be satisfied for a correct solution.
7799
06:31:10,280 --> 06:31:14,240
So something like in the Sudoku puzzle, you cannot have this cell
7800
06:31:14,240 --> 06:31:17,360
and this cell that are in the same row take on the same value.
7801
06:31:17,360 --> 06:31:18,960
That is a hard constraint.
7802
06:31:18,960 --> 06:31:21,080
But problems can also have soft constraints,
7803
06:31:21,080 --> 06:31:24,040
where these are constraints that express some notion of preference,
7804
06:31:24,040 --> 06:31:27,840
that maybe a and b can't have an exam on the same day,
7805
06:31:27,840 --> 06:31:32,200
but maybe someone has a preference that a's exam is earlier than b's exam.
7806
06:31:32,200 --> 06:31:34,520
It doesn't need to be the case with some expression
7807
06:31:34,520 --> 06:31:37,560
that some solution is better than another solution.
7808
06:31:37,560 --> 06:31:39,680
And in that case, you might formulate the problem
7809
06:31:39,680 --> 06:31:43,000
as trying to optimize for maximizing people's preferences.
7810
06:31:43,000 --> 06:31:46,880
You want people's preferences to be satisfied as much as possible.
7811
06:31:46,880 --> 06:31:49,840
In this case, though, we'll mostly just deal with hard constraints,
7812
06:31:49,840 --> 06:31:54,280
constraints that must be met in order to have a correct solution to the problem.
7813
06:31:54,280 --> 06:31:57,600
So we want to figure out some assignment of these variables
7814
06:31:57,600 --> 06:32:00,080
to their particular values that is ultimately
7815
06:32:00,080 --> 06:32:02,360
going to give us a solution to the problem
7816
06:32:02,360 --> 06:32:05,760
by allowing us to assign some day to each of the classes
7817
06:32:05,760 --> 06:32:09,240
such that we don't have any conflicts between classes.
7818
06:32:09,240 --> 06:32:11,960
So it turns out that we can classify the constraints
7819
06:32:11,960 --> 06:32:16,200
in a constraint satisfaction problem into a number of different categories.
7820
06:32:16,200 --> 06:32:18,440
The first of those categories are perhaps the simplest
7821
06:32:18,440 --> 06:32:21,880
of the types of constraints, which are known as unary constraints,
7822
06:32:21,880 --> 06:32:26,200
where unary constraint is a constraint that just involves a single variable.
7823
06:32:26,200 --> 06:32:28,680
For example, a unary constraint might be something like,
7824
06:32:28,680 --> 06:32:33,360
a does not equal Monday, meaning Course A cannot have its exam on Monday.
7825
06:32:33,360 --> 06:32:35,320
If for some reason the instructor for the course
7826
06:32:35,320 --> 06:32:38,440
isn't available on Monday, you might have a constraint in your problem
7827
06:32:38,440 --> 06:32:41,640
that looks like this, something that just has a single variable a in it,
7828
06:32:41,640 --> 06:32:44,480
and maybe says a is not equal to Monday, or a is equal to something,
7829
06:32:44,480 --> 06:32:47,280
or in the case of numbers greater than or less than something,
7830
06:32:47,280 --> 06:32:51,920
a constraint that just has one variable, we consider to be a unary constraint.
7831
06:32:51,920 --> 06:32:55,280
And this is in contrast to something like a binary constraint, which
7832
06:32:55,280 --> 06:32:58,320
is a constraint that involves two variables, for example.
7833
06:32:58,320 --> 06:33:01,440
So this would be a constraint like the ones we were looking at before.
7834
06:33:01,440 --> 06:33:06,680
Something like a does not equal b is an example of a binary constraint,
7835
06:33:06,680 --> 06:33:10,560
because it is a constraint that has two variables involved in it, a and b.
7836
06:33:10,560 --> 06:33:14,880
And we represented that using some arc or some edge that
7837
06:33:14,880 --> 06:33:17,960
connects variable a to variable b.
7838
06:33:17,960 --> 06:33:20,440
And using this knowledge of, OK, what is a unary constraint?
7839
06:33:20,440 --> 06:33:21,880
What is a binary constraint?
7840
06:33:21,880 --> 06:33:23,640
There are different types of things we can
7841
06:33:23,640 --> 06:33:27,000
say about a particular constraint satisfaction problem.
7842
06:33:27,000 --> 06:33:31,600
And one thing we can say is we can try and make the problem node consistent.
7843
06:33:31,600 --> 06:33:33,360
So what does node consistency mean?
7844
06:33:33,360 --> 06:33:36,800
Node consistency means that we have all of the values
7845
06:33:36,800 --> 06:33:41,480
in a variable's domain satisfying that variable's unary constraints.
7846
06:33:41,480 --> 06:33:45,120
So for each of the variables inside of our constraint satisfaction problem,
7847
06:33:45,120 --> 06:33:48,840
if all of the values satisfy the unary constraints
7848
06:33:48,840 --> 06:33:53,040
for that particular variable, we can say that the entire problem is node
7849
06:33:53,040 --> 06:33:56,040
consistent, or we can even say that a particular variable is
7850
06:33:56,040 --> 06:34:00,680
node consistent if we just want to make one node consistent within itself.
7851
06:34:00,680 --> 06:34:02,320
So what does that actually look like?
7852
06:34:02,320 --> 06:34:04,480
Let's look at now a simplified example, where
7853
06:34:04,480 --> 06:34:06,520
instead of having a whole bunch of different classes,
7854
06:34:06,520 --> 06:34:09,640
we just have two classes, a and b, each of which
7855
06:34:09,640 --> 06:34:12,360
has an exam on either Monday or Tuesday or Wednesday.
7856
06:34:12,360 --> 06:34:14,640
So this is the domain for the variable a,
7857
06:34:14,640 --> 06:34:17,160
and this is the domain for the variable b.
7858
06:34:17,160 --> 06:34:21,120
And now let's imagine we have these constraints, a not equal to Monday,
7859
06:34:21,120 --> 06:34:24,920
b not equal to Tuesday, b not equal to Monday, a not equal to b.
7860
06:34:24,920 --> 06:34:28,600
So those are the constraints that we have on this particular problem.
7861
06:34:28,600 --> 06:34:32,560
And what we can now try to do is enforce node consistency.
7862
06:34:32,560 --> 06:34:35,480
And node consistency just means we make sure
7863
06:34:35,480 --> 06:34:41,280
that all of the values for any variable's domain satisfy its unary constraints.
7864
06:34:41,280 --> 06:34:45,760
And so we could start by trying to make node a node consistent.
7865
06:34:45,760 --> 06:34:46,560
Is it consistent?
7866
06:34:46,560 --> 06:34:51,120
Does every value inside of a's domain satisfy its unary constraints?
7867
06:34:51,120 --> 06:34:55,800
Well, initially, we'll see that Monday does not satisfy a's unary constraints,
7868
06:34:55,800 --> 06:34:58,520
because we have a constraint, a unary constraint here,
7869
06:34:58,520 --> 06:35:00,640
that a is not equal to Monday.
7870
06:35:00,640 --> 06:35:03,240
But Monday is still in a's domain.
7871
06:35:03,240 --> 06:35:06,120
And so this is something that is not node consistent,
7872
06:35:06,120 --> 06:35:07,640
because we have Monday in the domain.
7873
06:35:07,640 --> 06:35:11,160
But this is not a valid value for this particular node.
7874
06:35:11,160 --> 06:35:13,400
And so how do we make this node consistent?
7875
06:35:13,400 --> 06:35:15,520
Well, to make the node consistent, what we'll do
7876
06:35:15,520 --> 06:35:18,840
is we'll just go ahead and remove Monday from a's domain.
7877
06:35:18,840 --> 06:35:21,240
Now a can only be on Tuesday or Wednesday,
7878
06:35:21,240 --> 06:35:25,400
because we had this constraint that said a is not equal to Monday.
7879
06:35:25,400 --> 06:35:28,520
And at this point now, a is node consistent.
7880
06:35:28,520 --> 06:35:31,680
For each of the values that a can take on, Tuesday and Wednesday,
7881
06:35:31,680 --> 06:35:36,640
there is no constraint that is a unary constraint that conflicts with that idea.
7882
06:35:36,640 --> 06:35:39,120
There is no constraint that says that a can't be Tuesday.
7883
06:35:39,120 --> 06:35:43,000
There is no unary constraint that says that a cannot be on Wednesday.
7884
06:35:43,000 --> 06:35:44,800
And so now we can turn our attention to b.
7885
06:35:44,800 --> 06:35:47,520
b also has a domain, Monday, Tuesday, and Wednesday.
7886
06:35:47,520 --> 06:35:51,440
And we can begin to see whether those variables satisfy
7887
06:35:51,440 --> 06:35:53,120
the unary constraints as well.
7888
06:35:53,120 --> 06:35:56,600
Well, here is a unary constraint, b is not equal to Tuesday.
7889
06:35:56,600 --> 06:35:59,800
And that does not appear to be satisfied by this domain of Monday, Tuesday,
7890
06:35:59,800 --> 06:36:03,160
and Wednesday, because Tuesday, this possible value
7891
06:36:03,160 --> 06:36:07,680
that the variable b could take on is not consistent with this unary constraint,
7892
06:36:07,680 --> 06:36:09,520
that b is not equal to Tuesday.
7893
06:36:09,520 --> 06:36:13,560
So to solve that problem, we'll go ahead and remove Tuesday from b's domain.
7894
06:36:13,560 --> 06:36:16,320
Now b's domain only contains Monday and Wednesday.
7895
06:36:16,320 --> 06:36:18,920
But as it turns out, there's yet another unary constraint
7896
06:36:18,920 --> 06:36:21,600
that we placed on the variable b, which is here.
7897
06:36:21,600 --> 06:36:23,840
b is not equal to Monday.
7898
06:36:23,840 --> 06:36:27,280
And that means that this value, Monday, inside of b's domain,
7899
06:36:27,280 --> 06:36:30,040
is not consistent with b's unary constraints,
7900
06:36:30,040 --> 06:36:33,120
because we have a constraint that says the b cannot be Monday.
7901
06:36:33,120 --> 06:36:35,400
And so we can remove Monday from b's domain.
7902
06:36:35,400 --> 06:36:38,600
And now we've made it through all of the unary constraints.
7903
06:36:38,600 --> 06:36:41,920
We've not yet considered this constraint, which is a binary constraint.
7904
06:36:41,920 --> 06:36:44,080
But we've considered all of the unary constraints,
7905
06:36:44,080 --> 06:36:47,360
all of the constraints that involve just a single variable.
7906
06:36:47,360 --> 06:36:51,960
And we've made sure that every node is consistent with those unary constraints.
7907
06:36:51,960 --> 06:36:55,640
So we can say that now we have enforced node consistency,
7908
06:36:55,640 --> 06:36:59,280
that for each of these possible nodes, we can pick any of these values
7909
06:36:59,280 --> 06:37:00,160
in the domain.
7910
06:37:00,160 --> 06:37:05,560
And there won't be a unary constraint that is violated as a result of it.
7911
06:37:05,560 --> 06:37:07,760
So node consistency is fairly easy to enforce.
7912
06:37:07,760 --> 06:37:10,540
We just take each node, make sure the values in the domain
7913
06:37:10,540 --> 06:37:12,400
satisfy the unary constraints.
7914
06:37:12,400 --> 06:37:14,560
Where things get a little bit more interesting
7915
06:37:14,560 --> 06:37:17,480
is when we consider different types of consistency,
7916
06:37:17,480 --> 06:37:20,760
something like arc consistency, for example.
7917
06:37:20,760 --> 06:37:25,320
And arc consistency refers to when all of the values in a variable's domain
7918
06:37:25,320 --> 06:37:28,400
satisfy the variable's binary constraints.
7919
06:37:28,400 --> 06:37:31,640
So when we're looking at trying to make a arc consistent,
7920
06:37:31,640 --> 06:37:35,080
we're no longer just considering the unary constraints that involve a.
7921
06:37:35,080 --> 06:37:38,280
We're trying to consider all of the binary constraints
7922
06:37:38,280 --> 06:37:39,880
that involve a as well.
7923
06:37:39,880 --> 06:37:43,280
So any edge that connects a to another variable
7924
06:37:43,280 --> 06:37:47,560
inside of that constraint graph that we were taking a look at before.
7925
06:37:47,560 --> 06:37:50,360
Put a little bit more formally, arc consistency.
7926
06:37:50,360 --> 06:37:52,600
And arc really is just another word for an edge
7927
06:37:52,600 --> 06:37:55,840
that connects two of these nodes inside of our constraint graph.
7928
06:37:55,840 --> 06:37:59,040
We can define arc consistency a little more precisely like this.
7929
06:37:59,040 --> 06:38:03,520
In order to make some variable x arc consistent with respect
7930
06:38:03,520 --> 06:38:09,840
to some other variable y, we need to remove any element from x's domain
7931
06:38:09,840 --> 06:38:14,040
to make sure that every choice for x, every choice in x's domain,
7932
06:38:14,040 --> 06:38:17,440
has a possible choice for y.
7933
06:38:17,440 --> 06:38:19,200
So put another way, if I have a variable x
7934
06:38:19,200 --> 06:38:21,920
and I want to make x an arc consistent, then
7935
06:38:21,920 --> 06:38:25,560
I'm going to look at all of the possible values that x can take on
7936
06:38:25,560 --> 06:38:28,200
and make sure that for all of those possible values,
7937
06:38:28,200 --> 06:38:31,440
there is still some choice that I can make for y,
7938
06:38:31,440 --> 06:38:34,800
if there's some arc between x and y, to make sure
7939
06:38:34,800 --> 06:38:39,320
that y has a possible option that I can choose as well.
7940
06:38:39,320 --> 06:38:42,800
So let's look at an example of that going back to this example from before.
7941
06:38:42,800 --> 06:38:45,720
We enforced node consistency already by saying
7942
06:38:45,720 --> 06:38:47,640
that a can only be on Tuesday or Wednesday
7943
06:38:47,640 --> 06:38:49,640
because we knew that a could not be on Monday.
7944
06:38:49,640 --> 06:38:51,960
And we also said that b's only domain only
7945
06:38:51,960 --> 06:38:55,480
consists of Wednesday because we know that b does not equal Tuesday
7946
06:38:55,480 --> 06:38:58,400
and also b does not equal Monday.
7947
06:38:58,400 --> 06:39:01,440
So now let's begin to consider arc consistency.
7948
06:39:01,440 --> 06:39:05,000
Let's try and make a arc consistent with b.
7949
06:39:05,000 --> 06:39:08,560
And what that means is to make a arc consistent with respect to b
7950
06:39:08,560 --> 06:39:11,880
means that for any choice we make in a's domain,
7951
06:39:11,880 --> 06:39:16,520
there is some choice we can make in b's domain that is going to be consistent.
7952
06:39:16,520 --> 06:39:17,400
And we can try that.
7953
06:39:17,400 --> 06:39:20,680
For a, we can choose Tuesday as a possible value for a.
7954
06:39:20,680 --> 06:39:23,440
If I choose Tuesday for a, is there a value
7955
06:39:23,440 --> 06:39:26,360
for b that satisfies the binary constraint?
7956
06:39:26,360 --> 06:39:29,360
Well, yes, b Wednesday would satisfy this constraint
7957
06:39:29,360 --> 06:39:33,600
that a does not equal b because Tuesday does not equal Wednesday.
7958
06:39:33,600 --> 06:39:37,880
However, if we chose Wednesday for a, well, then
7959
06:39:37,880 --> 06:39:42,640
there is no choice in b's domain that satisfies this binary constraint.
7960
06:39:42,640 --> 06:39:47,320
There is no way I can choose something for b that satisfies a does not equal b
7961
06:39:47,320 --> 06:39:49,800
because I know b must be Wednesday.
7962
06:39:49,800 --> 06:39:52,080
And so if ever I run into a situation like this
7963
06:39:52,080 --> 06:39:55,480
where I see that here is a possible value for a such
7964
06:39:55,480 --> 06:39:59,600
that there is no choice of value for b that satisfies the binary constraint,
7965
06:39:59,600 --> 06:40:02,240
well, then this is not arc consistent.
7966
06:40:02,240 --> 06:40:05,560
And to make it arc consistent, I would need to take Wednesday
7967
06:40:05,560 --> 06:40:07,640
and remove it from a's domain.
7968
06:40:07,640 --> 06:40:11,240
Because Wednesday was not going to be a possible choice I can make for a
7969
06:40:11,240 --> 06:40:14,600
because it wasn't consistent with this binary constraint for b.
7970
06:40:14,600 --> 06:40:17,360
There was no way I could choose Wednesday for a
7971
06:40:17,360 --> 06:40:22,680
and still have an available solution by choosing something for b as well.
7972
06:40:22,680 --> 06:40:25,920
So here now, I've been able to enforce arc consistency.
7973
06:40:25,920 --> 06:40:28,320
And in doing so, I've actually solved this entire problem,
7974
06:40:28,320 --> 06:40:32,520
that given these constraints where a and b can have exams on either Monday
7975
06:40:32,520 --> 06:40:35,880
or Tuesday or Wednesday, the only solution, as it would appear,
7976
06:40:35,880 --> 06:40:40,400
is that a's exam must be on Tuesday and b's exam must be on Wednesday.
7977
06:40:40,400 --> 06:40:43,800
And that is the only option available to me.
7978
06:40:43,800 --> 06:40:46,720
So if we want to apply our consistency to a larger graph,
7979
06:40:46,720 --> 06:40:49,600
not just looking at one particular pair of our consistency,
7980
06:40:49,600 --> 06:40:51,040
there are ways we can do that too.
7981
06:40:51,040 --> 06:40:53,880
And we can begin to formalize what the pseudocode would look like
7982
06:40:53,880 --> 06:40:57,400
for trying to write an algorithm that enforces arc consistency.
7983
06:40:57,400 --> 06:41:01,000
And we'll start by defining a function called revise.
7984
06:41:01,000 --> 06:41:03,800
Revise is going to take as input a CSP, otherwise
7985
06:41:03,800 --> 06:41:06,320
known as a constraint satisfaction problem,
7986
06:41:06,320 --> 06:41:08,960
and also two variables, x and y.
7987
06:41:08,960 --> 06:41:11,160
And what revise is going to do is it is going
7988
06:41:11,160 --> 06:41:15,240
to make x arc consistent with respect to y,
7989
06:41:15,240 --> 06:41:18,120
meaning remove anything from x's domain that
7990
06:41:18,120 --> 06:41:21,720
doesn't allow for a possible option for y.
7991
06:41:21,720 --> 06:41:22,800
How does this work?
7992
06:41:22,800 --> 06:41:25,120
Well, we'll go ahead and first keep track of whether or not
7993
06:41:25,120 --> 06:41:26,040
we've made a revision.
7994
06:41:26,040 --> 06:41:29,240
Revise is ultimately going to return true or false.
7995
06:41:29,240 --> 06:41:33,560
It'll return true in the event that we did make a revision to x's domain.
7996
06:41:33,560 --> 06:41:37,000
It'll return false if we didn't make any change to x's domain.
7997
06:41:37,000 --> 06:41:39,880
And we'll see in a moment why that's going to be helpful.
7998
06:41:39,880 --> 06:41:41,720
But we start by saying revised equals false.
7999
06:41:41,720 --> 06:41:43,920
We haven't made any changes.
8000
06:41:43,920 --> 06:41:46,560
Then we'll say, all right, let's go ahead and loop over all
8001
06:41:46,560 --> 06:41:49,040
of the possible values in x's domain.
8002
06:41:49,040 --> 06:41:53,200
So loop over x's domain for each little x in x's domain.
8003
06:41:53,200 --> 06:41:55,520
I want to make sure that for each of those choices,
8004
06:41:55,520 --> 06:42:00,040
I have some available choice in y that satisfies the binary constraints that
8005
06:42:00,040 --> 06:42:03,480
are defined inside of my CSP, inside of my constraint
8006
06:42:03,480 --> 06:42:05,040
satisfaction problem.
8007
06:42:05,040 --> 06:42:11,200
So if ever it's the case that there is no value y in y's domain that
8008
06:42:11,200 --> 06:42:15,760
satisfies the constraint for x and y, well, if that's the case,
8009
06:42:15,760 --> 06:42:19,840
that means that this value x shouldn't be in x's domain.
8010
06:42:19,840 --> 06:42:22,440
So we'll go ahead and delete x from x's domain.
8011
06:42:22,440 --> 06:42:26,000
And I'll set revised equal to true because I did change x's domain.
8012
06:42:26,000 --> 06:42:29,320
I changed x's domain by removing little x.
8013
06:42:29,320 --> 06:42:33,040
And I removed little x because it wasn't art consistent.
8014
06:42:33,040 --> 06:42:35,720
There was no way I could choose a value for y
8015
06:42:35,720 --> 06:42:38,960
that would satisfy this xy constraint.
8016
06:42:38,960 --> 06:42:41,680
So in this case, we'll go ahead and set revised equal true.
8017
06:42:41,680 --> 06:42:44,800
And we'll do this again and again for every value in x's domain.
8018
06:42:44,800 --> 06:42:46,400
Sometimes it might be fine.
8019
06:42:46,400 --> 06:42:49,880
In other cases, it might not allow for a possible choice for y,
8020
06:42:49,880 --> 06:42:53,240
in which case we need to remove this value from x's domain.
8021
06:42:53,240 --> 06:42:56,920
And at the end, we just return revised to indicate whether or not
8022
06:42:56,920 --> 06:42:59,000
we actually made a change.
8023
06:42:59,000 --> 06:43:01,000
So this function, then, this revised function
8024
06:43:01,000 --> 06:43:04,760
is effectively an implementation of what you saw me do graphically a moment ago.
8025
06:43:04,760 --> 06:43:09,200
And it makes one variable, x, arc consistent with another variable,
8026
06:43:09,200 --> 06:43:10,960
in this case, y.
8027
06:43:10,960 --> 06:43:14,160
But generally speaking, when we want to enforce our consistency,
8028
06:43:14,160 --> 06:43:17,760
we'll often want to enforce our consistency not just for a single arc,
8029
06:43:17,760 --> 06:43:20,360
but for the entire constraint satisfaction problem.
8030
06:43:20,360 --> 06:43:22,880
And it turns out there's an algorithm to do that as well.
8031
06:43:22,880 --> 06:43:25,200
And that algorithm is known as AC3.
8032
06:43:25,200 --> 06:43:27,920
AC3 takes a constraint satisfaction problem.
8033
06:43:27,920 --> 06:43:32,160
And it enforces our consistency across the entire problem.
8034
06:43:32,160 --> 06:43:33,120
How does it do that?
8035
06:43:33,120 --> 06:43:36,600
Well, it's going to basically maintain a queue or basically just a line
8036
06:43:36,600 --> 06:43:39,800
of all of the arcs that it needs to make consistent.
8037
06:43:39,800 --> 06:43:42,360
And over time, we might remove things from that queue
8038
06:43:42,360 --> 06:43:44,560
as we begin dealing with our consistency.
8039
06:43:44,560 --> 06:43:47,000
And we might need to add things to that queue as well
8040
06:43:47,000 --> 06:43:50,680
if there are more things we need to make arc consistent.
8041
06:43:50,680 --> 06:43:52,560
So we'll go ahead and start with a queue that
8042
06:43:52,560 --> 06:43:56,480
contains all of the arcs in the constraint satisfaction problem,
8043
06:43:56,480 --> 06:43:58,840
all of the edges that connect two nodes that
8044
06:43:58,840 --> 06:44:02,200
have some sort of binary constraint between them.
8045
06:44:02,200 --> 06:44:06,320
And now, as long as the queue is non-empty, there is work to be done.
8046
06:44:06,320 --> 06:44:10,040
The queue is all of the things that we need to make arc consistent.
8047
06:44:10,040 --> 06:44:13,600
So as long as the queue is non-empty, there's still things we have to do.
8048
06:44:13,600 --> 06:44:15,200
What do we have to do?
8049
06:44:15,200 --> 06:44:17,960
Well, we'll start by de-queuing from the queue,
8050
06:44:17,960 --> 06:44:19,640
remove something from the queue.
8051
06:44:19,640 --> 06:44:21,400
And strictly speaking, it doesn't need to be a queue,
8052
06:44:21,400 --> 06:44:23,440
but a queue is a traditional way of doing this.
8053
06:44:23,440 --> 06:44:27,480
We'll de-queue from the queue, and that'll give us an arc, x and y,
8054
06:44:27,480 --> 06:44:32,920
these two variables where I would like to make x arc consistent with y.
8055
06:44:32,920 --> 06:44:35,840
So how do we make x arc consistent with y?
8056
06:44:35,840 --> 06:44:38,200
Well, we can go ahead and just use that revise function
8057
06:44:38,200 --> 06:44:39,640
that we talked about a moment ago.
8058
06:44:39,640 --> 06:44:43,560
We called the revise function, passing as input the constraint satisfaction
8059
06:44:43,560 --> 06:44:46,320
problem, and also these variables x and y,
8060
06:44:46,320 --> 06:44:49,240
because I want to make x arc consistent with y.
8061
06:44:49,240 --> 06:44:52,440
In other words, remove any values from x's domain
8062
06:44:52,440 --> 06:44:55,880
that don't leave an available option for y.
8063
06:44:55,880 --> 06:44:57,920
And recall, what does revised return?
8064
06:44:57,920 --> 06:45:00,840
Well, it returns true if we actually made a change,
8065
06:45:00,840 --> 06:45:04,000
if we removed something from x's domain, because there
8066
06:45:04,000 --> 06:45:06,680
wasn't an available option for y, for example.
8067
06:45:06,680 --> 06:45:10,760
And it returns false if we didn't make any change to x's domain at all.
8068
06:45:10,760 --> 06:45:14,360
And it turns out if revised returns false, if we didn't make any changes,
8069
06:45:14,360 --> 06:45:15,800
well, then there's not a whole lot more work
8070
06:45:15,800 --> 06:45:17,080
to be done here for this arc.
8071
06:45:17,080 --> 06:45:20,600
We can just move ahead to the next arc that's in the queue.
8072
06:45:20,600 --> 06:45:24,120
But if we did make a change, if we did reduce x's domain
8073
06:45:24,120 --> 06:45:28,680
by removing values from x's domain, well, then what we might realize
8074
06:45:28,680 --> 06:45:31,160
is that this creates potential problems later on,
8075
06:45:31,160 --> 06:45:35,800
that it might mean that some arc that was arc consistent with x,
8076
06:45:35,800 --> 06:45:38,680
that node might no longer be arc consistent with x,
8077
06:45:38,680 --> 06:45:41,640
because while there used to be an option that we could choose for x,
8078
06:45:41,640 --> 06:45:44,560
now there might not be, because now we might have removed something
8079
06:45:44,560 --> 06:45:49,240
from x that was necessary for some other arc to be arc consistent.
8080
06:45:49,240 --> 06:45:52,040
And so if ever we did revise x's domain,
8081
06:45:52,040 --> 06:45:55,800
we're going to need to add some things to the queue, some additional arcs
8082
06:45:55,800 --> 06:45:57,320
that we might want to check.
8083
06:45:57,320 --> 06:45:58,640
How do we do that?
8084
06:45:58,640 --> 06:46:02,960
Well, first thing we want to check is to make sure that x's domain is not 0.
8085
06:46:02,960 --> 06:46:07,160
If x's domain is 0, that means there are no available options for x at all.
8086
06:46:07,160 --> 06:46:10,360
And that means that there's no way you can solve the constraint satisfaction
8087
06:46:10,360 --> 06:46:10,860
problem.
8088
06:46:10,860 --> 06:46:13,240
If we've removed everything from x's domain,
8089
06:46:13,240 --> 06:46:15,600
we'll go ahead and just return false here to indicate there's
8090
06:46:15,600 --> 06:46:19,640
no way to solve the problem, because there's nothing left in x's domain.
8091
06:46:19,640 --> 06:46:23,640
But otherwise, if there are things left in x's domain,
8092
06:46:23,640 --> 06:46:26,920
but fewer things than before, well, then what we'll do
8093
06:46:26,920 --> 06:46:31,800
is we'll loop over each variable z that is in all of x's neighbors,
8094
06:46:31,800 --> 06:46:33,680
except for y, y we already handled.
8095
06:46:33,680 --> 06:46:37,120
But we'll consider all of x's other's neighbors and ask ourselves,
8096
06:46:37,120 --> 06:46:41,040
all right, will that arc from each of those z's to x,
8097
06:46:41,040 --> 06:46:43,400
that arc might no longer be arc consistent,
8098
06:46:43,400 --> 06:46:46,840
because while for each z, there might have been a possible option
8099
06:46:46,840 --> 06:46:50,400
we could choose for x to correspond with each of z's possible values,
8100
06:46:50,400 --> 06:46:54,680
now there might not be, because we removed some elements from x's domain.
8101
06:46:54,680 --> 06:46:57,400
And so what we'll do here is we'll go ahead and enqueue,
8102
06:46:57,400 --> 06:47:02,800
adding something to the queue, this arc zx for all of those neighbors z.
8103
06:47:02,800 --> 06:47:05,320
So we need to add back some arcs to the queue
8104
06:47:05,320 --> 06:47:08,960
in order to continue to enforce arc consistency.
8105
06:47:08,960 --> 06:47:11,400
At the very end, if we make it through all this process,
8106
06:47:11,400 --> 06:47:13,760
then we can return true.
8107
06:47:13,760 --> 06:47:18,360
But this now is AC3, this algorithm for enforcing arc consistency
8108
06:47:18,360 --> 06:47:20,200
on a constraint satisfaction problem.
8109
06:47:20,200 --> 06:47:23,360
And the big idea is really just keep track of all of the arcs
8110
06:47:23,360 --> 06:47:25,600
that we might need to make arc consistent,
8111
06:47:25,600 --> 06:47:28,560
make it arc consistent by calling the revise function.
8112
06:47:28,560 --> 06:47:31,400
And if we did revise it, then there are some new arcs
8113
06:47:31,400 --> 06:47:33,600
that might need to be added to the queue in order
8114
06:47:33,600 --> 06:47:36,840
to make sure that everything is still arc consistent, even
8115
06:47:36,840 --> 06:47:40,680
after we've removed some of the elements from a particular variable's
8116
06:47:40,680 --> 06:47:42,000
domain.
8117
06:47:42,000 --> 06:47:46,080
So what then would happen if we tried to enforce arc consistency
8118
06:47:46,080 --> 06:47:48,680
on a graph like this, on a graph where each of these variables
8119
06:47:48,680 --> 06:47:51,680
has a domain of Monday, Tuesday, and Wednesday?
8120
06:47:51,680 --> 06:47:55,400
Well, it turns out that by enforcing arc consistency on this graph,
8121
06:47:55,400 --> 06:47:57,520
well, it can solve some types of problems.
8122
06:47:57,520 --> 06:47:59,680
Nothing actually changes here.
8123
06:47:59,680 --> 06:48:03,200
For any particular arc, just considering two variables,
8124
06:48:03,200 --> 06:48:05,960
there's always a way for me to just, for any of the choices
8125
06:48:05,960 --> 06:48:08,840
I make for one of them, make a choice for the other one,
8126
06:48:08,840 --> 06:48:11,440
because there are three options, and I just need the two
8127
06:48:11,440 --> 06:48:12,720
to be different from each other.
8128
06:48:12,720 --> 06:48:15,040
So this is actually quite easy to just take an arc
8129
06:48:15,040 --> 06:48:17,160
and just declare that it is arc consistent,
8130
06:48:17,160 --> 06:48:19,920
because if I pick Monday for D, then I just
8131
06:48:19,920 --> 06:48:23,640
pick something that isn't Monday for B. In arc consistency,
8132
06:48:23,640 --> 06:48:28,680
we only consider consistency between a binary constraint between two nodes,
8133
06:48:28,680 --> 06:48:32,600
and we're not really considering all of the rest of the nodes yet.
8134
06:48:32,600 --> 06:48:36,600
So just using AC3, the enforcement of arc consistency,
8135
06:48:36,600 --> 06:48:39,240
that can sometimes have the effect of reducing domains
8136
06:48:39,240 --> 06:48:42,880
to make it easier to find solutions, but it will not always actually
8137
06:48:42,880 --> 06:48:44,200
solve the problem.
8138
06:48:44,200 --> 06:48:48,360
We might still need to somehow search to try and find a solution.
8139
06:48:48,360 --> 06:48:52,000
And we can use classical traditional search algorithms to try to do so.
8140
06:48:52,000 --> 06:48:55,280
You'll recall that a search problem generally consists of these parts.
8141
06:48:55,280 --> 06:48:59,000
We have some initial state, some actions, a transition model
8142
06:48:59,000 --> 06:49:01,280
that takes me from one state to another state,
8143
06:49:01,280 --> 06:49:05,640
a goal test to tell me have I satisfied my objective correctly,
8144
06:49:05,640 --> 06:49:09,200
and then some path cost function, because in the case of like maze solving,
8145
06:49:09,200 --> 06:49:12,240
I was trying to get to my goal as quickly as possible.
8146
06:49:12,240 --> 06:49:16,840
So you could formulate a CSP, or a constraint satisfaction problem,
8147
06:49:16,840 --> 06:49:18,800
as one of these types of search problems.
8148
06:49:18,800 --> 06:49:22,240
The initial state will just be an empty assignment,
8149
06:49:22,240 --> 06:49:26,120
where an assignment is just a way for me to assign any particular variable
8150
06:49:26,120 --> 06:49:27,760
to any particular value.
8151
06:49:27,760 --> 06:49:30,960
So if an empty assignment is no variables that are assigned to any values
8152
06:49:30,960 --> 06:49:37,240
yet, then the action I can take is adding some new variable equals value
8153
06:49:37,240 --> 06:49:40,000
pair to that assignment, saying for this assignment,
8154
06:49:40,000 --> 06:49:43,040
let me add a new value for this variable.
8155
06:49:43,040 --> 06:49:46,360
And the transition model just defines what happens when you take that action.
8156
06:49:46,360 --> 06:49:50,200
You get a new assignment that has that variable equal to that value inside
8157
06:49:50,200 --> 06:49:51,080
of it.
8158
06:49:51,080 --> 06:49:54,840
The goal test is just checking to make sure all the variables have been assigned
8159
06:49:54,840 --> 06:49:57,720
and making sure all the constraints have been satisfied.
8160
06:49:57,720 --> 06:50:00,680
And the path cost function is sort of irrelevant.
8161
06:50:00,680 --> 06:50:02,840
I don't really care about what the path really is.
8162
06:50:02,840 --> 06:50:06,280
I just care about finding some assignment that actually satisfies
8163
06:50:06,280 --> 06:50:07,640
all of the constraints.
8164
06:50:07,640 --> 06:50:09,640
So really, all the paths have the same cost.
8165
06:50:09,640 --> 06:50:12,240
I don't really care about the path to the goal.
8166
06:50:12,240 --> 06:50:17,280
I just care about the solution itself, much as we've talked about now before.
8167
06:50:17,280 --> 06:50:20,440
The problem here, though, is that if we just implement this naive search
8168
06:50:20,440 --> 06:50:23,280
algorithm just by implementing like breadth-first search or depth-first
8169
06:50:23,280 --> 06:50:25,920
search, this is going to be very, very inefficient.
8170
06:50:25,920 --> 06:50:28,600
And there are ways we can take advantage of efficiencies
8171
06:50:28,600 --> 06:50:31,960
in the structure of a constraint satisfaction problem itself.
8172
06:50:31,960 --> 06:50:37,200
And one of the key ideas is that we can really just order these variables.
8173
06:50:37,200 --> 06:50:39,840
And it doesn't matter what order we assign variables in.
8174
06:50:39,840 --> 06:50:43,480
The assignment a equals 2 and then b equals 8
8175
06:50:43,480 --> 06:50:47,480
is identical to the assignment of b equals 8 and then a equals 2.
8176
06:50:47,480 --> 06:50:50,240
Switching the order doesn't really change anything
8177
06:50:50,240 --> 06:50:53,360
about the fundamental nature of that assignment.
8178
06:50:53,360 --> 06:50:56,240
And so there are some ways that we can try and revise
8179
06:50:56,240 --> 06:50:59,400
this idea of a search algorithm to apply it specifically
8180
06:50:59,400 --> 06:51:02,000
for a problem like a constraint satisfaction problem.
8181
06:51:02,000 --> 06:51:04,160
And it turns out the search algorithm we'll generally
8182
06:51:04,160 --> 06:51:06,880
use when talking about constraint satisfaction problems
8183
06:51:06,880 --> 06:51:09,400
is something known as backtracking search.
8184
06:51:09,400 --> 06:51:11,760
And the big idea of backtracking search is we'll
8185
06:51:11,760 --> 06:51:14,920
go ahead and make assignments from variables to values.
8186
06:51:14,920 --> 06:51:17,640
And if ever we get stuck, we arrive at a place
8187
06:51:17,640 --> 06:51:20,800
where there is no way we can make any forward progress while still
8188
06:51:20,800 --> 06:51:23,640
preserving the constraints that we need to enforce,
8189
06:51:23,640 --> 06:51:27,720
we'll go ahead and backtrack and try something else instead.
8190
06:51:27,720 --> 06:51:30,760
So the very basic sketch of what backtracking search looks like
8191
06:51:30,760 --> 06:51:32,000
is it looks like this.
8192
06:51:32,000 --> 06:51:35,800
Function called backtrack that takes as input an assignment
8193
06:51:35,800 --> 06:51:37,840
and a constraint satisfaction problem.
8194
06:51:37,840 --> 06:51:40,400
So initially, we don't have any assigned variables.
8195
06:51:40,400 --> 06:51:42,880
So when we begin backtracking search, this assignment
8196
06:51:42,880 --> 06:51:46,120
is just going to be the empty assignment with no variables inside of it.
8197
06:51:46,120 --> 06:51:49,400
But we'll see later this is going to be a recursive function.
8198
06:51:49,400 --> 06:51:53,320
So backtrack takes as input the assignment and the problem.
8199
06:51:53,320 --> 06:51:57,760
If the assignment is complete, meaning all of the variables have been assigned,
8200
06:51:57,760 --> 06:51:59,280
we just return that assignment.
8201
06:51:59,280 --> 06:52:00,960
That, of course, won't be true initially,
8202
06:52:00,960 --> 06:52:02,800
because we start with an empty assignment.
8203
06:52:02,800 --> 06:52:05,080
But over time, we might add things to that assignment.
8204
06:52:05,080 --> 06:52:08,040
So if ever the assignment actually is complete, then we're done.
8205
06:52:08,040 --> 06:52:10,880
Then just go ahead and return that assignment.
8206
06:52:10,880 --> 06:52:13,400
But otherwise, there is some work to be done.
8207
06:52:13,400 --> 06:52:17,480
So what we'll need to do is select an unassigned variable
8208
06:52:17,480 --> 06:52:18,760
for this particular problem.
8209
06:52:18,760 --> 06:52:21,520
So we need to take the problem, look at the variables that have already
8210
06:52:21,520 --> 06:52:26,400
been assigned, and pick a variable that has not yet been assigned.
8211
06:52:26,400 --> 06:52:28,280
And I'll go ahead and take that variable.
8212
06:52:28,280 --> 06:52:32,440
And then I need to consider all of the values in that variable's domain.
8213
06:52:32,440 --> 06:52:34,720
So we'll go ahead and call this domain values function.
8214
06:52:34,720 --> 06:52:37,600
We'll talk a little more about that later, that takes a variable
8215
06:52:37,600 --> 06:52:42,000
and just gives me back an ordered list of all of the values in its domain.
8216
06:52:42,000 --> 06:52:44,480
So I've taken a random unselected variable.
8217
06:52:44,480 --> 06:52:47,200
I'm going to loop over all of the possible values.
8218
06:52:47,200 --> 06:52:50,400
And the idea is, let me just try all of these values
8219
06:52:50,400 --> 06:52:53,120
as possible values for the variable.
8220
06:52:53,120 --> 06:52:56,880
So if the value is consistent with the assignment so far,
8221
06:52:56,880 --> 06:52:59,360
it doesn't violate any of the constraints,
8222
06:52:59,360 --> 06:53:02,720
well then let's go ahead and add variable equals value to the assignment
8223
06:53:02,720 --> 06:53:04,680
because it's so far consistent.
8224
06:53:04,680 --> 06:53:08,080
And now let's recursively call backtrack to try and make
8225
06:53:08,080 --> 06:53:10,880
the rest of the assignments also consistent.
8226
06:53:10,880 --> 06:53:13,920
So I'll go ahead and call backtrack on this new assignment
8227
06:53:13,920 --> 06:53:17,400
that I've added the variable equals value to.
8228
06:53:17,400 --> 06:53:20,720
And now I recursively call backtrack and see what the result is.
8229
06:53:20,720 --> 06:53:27,000
And if the result isn't a failure, well then let me just return that result.
8230
06:53:27,000 --> 06:53:30,120
And otherwise, what else could happen?
8231
06:53:30,120 --> 06:53:32,680
Well, if it turns out the result was a failure, well then
8232
06:53:32,680 --> 06:53:35,200
that means this value was probably a bad choice
8233
06:53:35,200 --> 06:53:37,680
for this particular variable because when I assigned
8234
06:53:37,680 --> 06:53:41,120
this variable equal to that value, eventually down the road
8235
06:53:41,120 --> 06:53:43,720
I ran into a situation where I violated constraints.
8236
06:53:43,720 --> 06:53:45,160
There was nothing more I could do.
8237
06:53:45,160 --> 06:53:48,800
So now I'll remove variable equals value from the assignment,
8238
06:53:48,800 --> 06:53:52,080
effectively backtracking to say, all right, that value didn't work.
8239
06:53:52,080 --> 06:53:55,200
Let's try another value instead.
8240
06:53:55,200 --> 06:53:57,000
And then at the very end, if we were never
8241
06:53:57,000 --> 06:54:00,760
able to return a complete assignment, we'll just go ahead and return failure
8242
06:54:00,760 --> 06:54:04,000
because that means that none of the values worked for this particular
8243
06:54:04,000 --> 06:54:05,560
variable.
8244
06:54:05,560 --> 06:54:07,760
This now is the idea for backtracking search,
8245
06:54:07,760 --> 06:54:10,840
to take each of the variables, try values for them,
8246
06:54:10,840 --> 06:54:14,200
and recursively try backtracking search, see if we can make progress.
8247
06:54:14,200 --> 06:54:16,000
And if ever we run into a dead end, we run
8248
06:54:16,000 --> 06:54:19,160
into a situation where there is no possible value we can choose
8249
06:54:19,160 --> 06:54:22,280
that satisfies the constraints, we return failure.
8250
06:54:22,280 --> 06:54:24,400
And that propagates up, and eventually we
8251
06:54:24,400 --> 06:54:29,080
make a different choice by going back and trying something else instead.
8252
06:54:29,080 --> 06:54:31,120
So let's put this algorithm into practice.
8253
06:54:31,120 --> 06:54:35,000
Let's actually try and use backtracking search to solve this problem now,
8254
06:54:35,000 --> 06:54:37,520
where I need to figure out how to assign each of these courses
8255
06:54:37,520 --> 06:54:41,080
to an exam slot on Monday or Tuesday or Wednesday in such a way
8256
06:54:41,080 --> 06:54:44,120
that it satisfies these constraints, that each of these edges
8257
06:54:44,120 --> 06:54:47,880
mean those two classes cannot have an exam on the same day.
8258
06:54:47,880 --> 06:54:50,080
So I can start by just starting at a node.
8259
06:54:50,080 --> 06:54:51,800
It doesn't really matter which I start with,
8260
06:54:51,800 --> 06:54:54,120
but in this case, I'll just start with A.
8261
06:54:54,120 --> 06:54:57,840
And I'll ask the question, all right, let me loop over the values in the domain.
8262
06:54:57,840 --> 06:55:00,200
And maybe in this case, I'll just start with Monday and say, all right,
8263
06:55:00,200 --> 06:55:02,120
let's go ahead and assign A to Monday.
8264
06:55:02,120 --> 06:55:04,800
We'll just go and order Monday, Tuesday, Wednesday.
8265
06:55:04,800 --> 06:55:08,320
And now let's consider node B. So I've made an assignment to A,
8266
06:55:08,320 --> 06:55:11,480
so I recursively call backtrack with this new part of the assignment.
8267
06:55:11,480 --> 06:55:14,320
And now I'm looking to pick another unassigned variable like B.
8268
06:55:14,320 --> 06:55:16,320
And I'll say, all right, maybe I'll start with Monday,
8269
06:55:16,320 --> 06:55:18,960
because that's the very first value in B's domain.
8270
06:55:18,960 --> 06:55:22,240
And I ask, all right, does Monday violate any constraints?
8271
06:55:22,240 --> 06:55:23,440
And it turns out, yes, it does.
8272
06:55:23,440 --> 06:55:26,240
It violates this constraint here between A and B,
8273
06:55:26,240 --> 06:55:29,200
because A and B are now both on Monday, and that doesn't work,
8274
06:55:29,200 --> 06:55:33,600
because B can't be on the same day as A. So that doesn't work.
8275
06:55:33,600 --> 06:55:37,200
So we might instead try Tuesday, try the next value in B's domain.
8276
06:55:37,200 --> 06:55:39,960
And is that consistent with the assignment so far?
8277
06:55:39,960 --> 06:55:43,160
Well, yeah, B, Tuesday, A, Monday, that is consistent so far,
8278
06:55:43,160 --> 06:55:44,800
because they're not on the same day.
8279
06:55:44,800 --> 06:55:45,400
So that's good.
8280
06:55:45,400 --> 06:55:47,440
Now we can recursively call backtrack.
8281
06:55:47,440 --> 06:55:48,280
Try again.
8282
06:55:48,280 --> 06:55:51,400
Pick another unassigned variable, something like D, and say, all right,
8283
06:55:51,400 --> 06:55:53,160
let's go through its possible values.
8284
06:55:53,160 --> 06:55:55,600
Is Monday consistent with this assignment?
8285
06:55:55,600 --> 06:55:56,520
Well, yes, it is.
8286
06:55:56,520 --> 06:55:59,480
B and D are on different days, Monday versus Tuesday.
8287
06:55:59,480 --> 06:56:02,520
And A and B are also on different days, Monday versus Tuesday.
8288
06:56:02,520 --> 06:56:04,200
So that's fine so far, too.
8289
06:56:04,200 --> 06:56:05,440
We'll go ahead and try again.
8290
06:56:05,440 --> 06:56:09,080
Maybe we'll go to this variable here, E. Say, can we make that consistent?
8291
06:56:09,080 --> 06:56:10,680
Let's go through the possible values.
8292
06:56:10,680 --> 06:56:12,560
We've recursively called backtrack.
8293
06:56:12,560 --> 06:56:15,800
We might start with Monday and say, all right, that's not consistent,
8294
06:56:15,800 --> 06:56:19,120
because D and E now have exams on the same day.
8295
06:56:19,120 --> 06:56:21,760
So we might try Tuesday instead, going to the next one.
8296
06:56:21,760 --> 06:56:23,440
Ask, is that consistent?
8297
06:56:23,440 --> 06:56:27,240
Well, no, it's not, because B and E, those have exams on the same day.
8298
06:56:27,240 --> 06:56:29,760
And so we try, all right, is Wednesday consistent?
8299
06:56:29,760 --> 06:56:31,120
And in turn, it's like, all right, yes, it is.
8300
06:56:31,120 --> 06:56:33,080
Wednesday is consistent, because D and E now
8301
06:56:33,080 --> 06:56:34,680
have exams on different days.
8302
06:56:34,680 --> 06:56:37,240
B and E now have exams on different days.
8303
06:56:37,240 --> 06:56:38,760
All seems to be well so far.
8304
06:56:38,760 --> 06:56:43,440
I recursively call backtrack, select another unassigned variable,
8305
06:56:43,440 --> 06:56:45,960
we'll say maybe choose C this time, and say, all right,
8306
06:56:45,960 --> 06:56:48,240
let's try the values that C could take on.
8307
06:56:48,240 --> 06:56:49,600
Let's start with Monday.
8308
06:56:49,600 --> 06:56:53,320
And it turns out that's not consistent, because now A and C both
8309
06:56:53,320 --> 06:56:55,040
have exams on the same day.
8310
06:56:55,040 --> 06:56:57,560
So I try Tuesday and say, that's not consistent either,
8311
06:56:57,560 --> 06:57:00,760
because B and C now have exams on the same day.
8312
06:57:00,760 --> 06:57:04,280
And then I say, all right, let's go ahead and try Wednesday.
8313
06:57:04,280 --> 06:57:08,120
But that's not consistent either, because C and E each have
8314
06:57:08,120 --> 06:57:09,880
exams on the same day too.
8315
06:57:09,880 --> 06:57:13,200
So now we've gone through all the possible values for C, Monday, Tuesday,
8316
06:57:13,200 --> 06:57:14,080
and Wednesday.
8317
06:57:14,080 --> 06:57:15,440
And none of them are consistent.
8318
06:57:15,440 --> 06:57:18,360
There is no way we can have a consistent assignment.
8319
06:57:18,360 --> 06:57:21,480
Backtrack, in this case, will return a failure.
8320
06:57:21,480 --> 06:57:24,920
And so then we'd say, all right, we have to backtrack back to here.
8321
06:57:24,920 --> 06:57:28,800
Well, now for E, we've tried all of Monday, Tuesday, and Wednesday.
8322
06:57:28,800 --> 06:57:31,400
And none of those work, because Wednesday, which seemed to work,
8323
06:57:31,400 --> 06:57:33,480
turned out to be a failure.
8324
06:57:33,480 --> 06:57:36,200
So that means there's no possible way we can assign E.
8325
06:57:36,200 --> 06:57:37,240
So that's a failure too.
8326
06:57:37,240 --> 06:57:41,000
We have to go back up to D, which means that Monday assignment to D,
8327
06:57:41,000 --> 06:57:41,920
that must be wrong.
8328
06:57:41,920 --> 06:57:43,320
We must try something else.
8329
06:57:43,320 --> 06:57:47,880
So we can try, all right, what if instead of Monday, we try Tuesday?
8330
06:57:47,880 --> 06:57:49,640
Tuesday, it turns out, is not consistent,
8331
06:57:49,640 --> 06:57:51,960
because B and D now have an exam on the same day.
8332
06:57:51,960 --> 06:57:55,360
But Wednesday, as it turns out, works.
8333
06:57:55,360 --> 06:57:57,560
And now we can begin to mix and forward progress again.
8334
06:57:57,560 --> 06:58:00,640
We go back to E and say, all right, which of these values works?
8335
06:58:00,640 --> 06:58:03,800
Monday turns out to work by not violating any constraints.
8336
06:58:03,800 --> 06:58:05,440
Then we go up to C now.
8337
06:58:05,440 --> 06:58:08,080
Monday doesn't work, because it violates a constraint.
8338
06:58:08,080 --> 06:58:09,600
Violates two, actually.
8339
06:58:09,600 --> 06:58:12,160
Tuesday doesn't work, because it violates a constraint as well.
8340
06:58:12,160 --> 06:58:13,600
But Wednesday does work.
8341
06:58:13,600 --> 06:58:16,520
Then we can go to the next variable, F, and say, all right, does Monday work?
8342
06:58:16,520 --> 06:58:17,020
We'll know.
8343
06:58:17,020 --> 06:58:18,320
It violates a constraint.
8344
06:58:18,320 --> 06:58:19,800
But Tuesday does work.
8345
06:58:19,800 --> 06:58:21,880
And then finally, we can look at the last variable, G,
8346
06:58:21,880 --> 06:58:24,280
recursively calling backtrack one more time.
8347
06:58:24,280 --> 06:58:25,640
Monday is inconsistent.
8348
06:58:25,640 --> 06:58:27,320
That violates a constraint.
8349
06:58:27,320 --> 06:58:29,680
Tuesday also violates a constraint.
8350
06:58:29,680 --> 06:58:33,120
But Wednesday, that doesn't violate a constraint.
8351
06:58:33,120 --> 06:58:36,840
And so now at this point, we recursively call backtrack one last time.
8352
06:58:36,840 --> 06:58:40,240
We now have a satisfactory assignment of all of the variables.
8353
06:58:40,240 --> 06:58:42,640
And at this point, we can say that we are now done.
8354
06:58:42,640 --> 06:58:47,240
We have now been able to successfully assign a variable or a value
8355
06:58:47,240 --> 06:58:49,080
to each one of these variables in such a way
8356
06:58:49,080 --> 06:58:51,480
that we're not violating any constraints.
8357
06:58:51,480 --> 06:58:55,520
We're going to go ahead and have classes A and E have their exams on Monday.
8358
06:58:55,520 --> 06:58:58,560
Classes B and F can have their exams on Tuesday.
8359
06:58:58,560 --> 06:59:02,440
And classes C, D, and G can have their exams on Wednesday.
8360
06:59:02,440 --> 06:59:06,280
And there's no violated constraints that might come up there.
8361
06:59:06,280 --> 06:59:08,840
So that then was a graphical look at how this might work.
8362
06:59:08,840 --> 06:59:11,640
Let's now take a look at some code we could use to actually try
8363
06:59:11,640 --> 06:59:14,640
and solve this problem as well.
8364
06:59:14,640 --> 06:59:20,160
So here I'll go ahead and go into the scheduling directory.
8365
06:59:20,160 --> 06:59:21,160
We're here now.
8366
06:59:21,160 --> 06:59:24,120
We'll start by looking at schedule0.py.
8367
06:59:24,120 --> 06:59:25,160
We're here.
8368
06:59:25,160 --> 06:59:28,560
I define a list of variables, A, B, C, D, E, F, G.
8369
06:59:28,560 --> 06:59:31,280
Those are all different classes.
8370
06:59:31,280 --> 06:59:34,480
Then underneath that, I define my list of constraints.
8371
06:59:34,480 --> 06:59:36,520
So constraint A and B. That is a constraint
8372
06:59:36,520 --> 06:59:38,160
because they can't be on the same day.
8373
06:59:38,160 --> 06:59:40,960
Likewise, A and C, B and C, so on and so forth,
8374
06:59:40,960 --> 06:59:43,760
enforcing those exact same constraints.
8375
06:59:43,760 --> 06:59:47,640
And here then is what the backtracking function might look like.
8376
06:59:47,640 --> 06:59:50,800
First, if the assignment is complete, if I've
8377
06:59:50,800 --> 06:59:54,000
made an assignment of every variable to a value,
8378
06:59:54,000 --> 06:59:56,760
go ahead and just return that assignment.
8379
06:59:56,760 --> 07:00:00,160
Then we'll select an unassigned variable from that assignment.
8380
07:00:00,160 --> 07:00:03,280
Then for each of the possible values in the domain, Monday, Tuesday,
8381
07:00:03,280 --> 07:00:06,640
Wednesday, let's go ahead and create a new assignment that
8382
07:00:06,640 --> 07:00:09,200
assigns the variable to that value.
8383
07:00:09,200 --> 07:00:11,700
I'll call this consistent function, which I'll show you in a moment,
8384
07:00:11,700 --> 07:00:14,640
that just checks to make sure this new assignment is consistent.
8385
07:00:14,640 --> 07:00:17,160
But if it is consistent, we'll go ahead and call backtrack
8386
07:00:17,160 --> 07:00:20,480
to go ahead and continue trying to run backtracking search.
8387
07:00:20,480 --> 07:00:24,200
And as long as the result is not none, meaning it wasn't a failure,
8388
07:00:24,200 --> 07:00:26,920
we can go ahead and return that result.
8389
07:00:26,920 --> 07:00:31,160
But if we make it through all the values and nothing works, then it is a failure.
8390
07:00:31,160 --> 07:00:32,400
There's no solution.
8391
07:00:32,400 --> 07:00:35,200
We go ahead and return none here.
8392
07:00:35,200 --> 07:00:36,400
What do these functions do?
8393
07:00:36,400 --> 07:00:40,440
Select unassigned variable is just going to choose a variable not yet assigned.
8394
07:00:40,440 --> 07:00:42,440
So it's going to loop over all the variables.
8395
07:00:42,440 --> 07:00:46,400
And if it's not already assigned, we'll go ahead and just return that variable.
8396
07:00:46,400 --> 07:00:48,440
And what does the consistent function do?
8397
07:00:48,440 --> 07:00:51,840
Well, the consistent function goes through all the constraints.
8398
07:00:51,840 --> 07:00:56,240
And if we have a situation where we've assigned both of those values
8399
07:00:56,240 --> 07:00:59,040
to variables, but they are the same, well,
8400
07:00:59,040 --> 07:01:03,120
then that is a violation of the constraint, in which case we'll return false.
8401
07:01:03,120 --> 07:01:06,360
But if nothing is inconsistent, then the assignment is consistent
8402
07:01:06,360 --> 07:01:08,760
and will return true.
8403
07:01:08,760 --> 07:01:12,000
And then all the program does is it calls backtrack
8404
07:01:12,000 --> 07:01:15,440
on an empty assignment, an empty dictionary that has no variable assigned
8405
07:01:15,440 --> 07:01:18,680
and no values yet, save that inside a solution,
8406
07:01:18,680 --> 07:01:21,120
and then print out that solution.
8407
07:01:21,120 --> 07:01:27,160
So by running this now, I can run Python schedule0.py.
8408
07:01:27,160 --> 07:01:29,960
And what I get as a result of that is an assignment
8409
07:01:29,960 --> 07:01:31,560
of all these variables to values.
8410
07:01:31,560 --> 07:01:35,080
And it turns out we assign a to Monday as we would expect, b to Tuesday,
8411
07:01:35,080 --> 07:01:37,280
c to Wednesday, exactly the same type of thing
8412
07:01:37,280 --> 07:01:40,280
we were talking about before, an assignment of each of these variables
8413
07:01:40,280 --> 07:01:43,960
to values that doesn't violate any constraints.
8414
07:01:43,960 --> 07:01:45,800
And I had to do a fair amount of work in order
8415
07:01:45,800 --> 07:01:47,360
to implement this idea myself.
8416
07:01:47,360 --> 07:01:49,520
I had to write the backtrack function that went ahead
8417
07:01:49,520 --> 07:01:51,880
and went through this process of recursively trying
8418
07:01:51,880 --> 07:01:53,600
to do this backtracking search.
8419
07:01:53,600 --> 07:01:56,840
But it turns out the constraint satisfaction problems are so popular
8420
07:01:56,840 --> 07:02:00,960
that there exist many libraries that already implement this type of idea.
8421
07:02:00,960 --> 07:02:03,280
Again, as with before, the specific library
8422
07:02:03,280 --> 07:02:06,360
is not as important as the fact that libraries do exist.
8423
07:02:06,360 --> 07:02:09,600
This is just one example of a Python constraint library,
8424
07:02:09,600 --> 07:02:13,320
where now, rather than having to do all the work from scratch
8425
07:02:13,320 --> 07:02:15,960
inside of schedule1.py, I'm just taking advantage
8426
07:02:15,960 --> 07:02:19,200
of a library that implements a lot of these ideas already.
8427
07:02:19,200 --> 07:02:22,520
So here, I create a new problem, add variables to it
8428
07:02:22,520 --> 07:02:24,160
with particular domains.
8429
07:02:24,160 --> 07:02:27,200
I add a whole bunch of these individual constraints,
8430
07:02:27,200 --> 07:02:30,860
where I call addConstraint and pass in a function describing
8431
07:02:30,860 --> 07:02:32,160
what the constraint is.
8432
07:02:32,160 --> 07:02:35,240
And the constraint basically says the function that takes two variables, x
8433
07:02:35,240 --> 07:02:38,480
and y, and makes sure that x is not equal to y,
8434
07:02:38,480 --> 07:02:43,480
enforcing the idea that these two classes cannot have exams on the same day.
8435
07:02:43,480 --> 07:02:46,760
And then, for any constraint satisfaction problem,
8436
07:02:46,760 --> 07:02:50,640
I can call getSolutions to get all the solutions to that problem.
8437
07:02:50,640 --> 07:02:53,160
And then, for each of those solutions, print out
8438
07:02:53,160 --> 07:02:55,520
what that solution happens to be.
8439
07:02:55,520 --> 07:02:59,320
And if I run python schedule1.py, and now see,
8440
07:02:59,320 --> 07:03:01,880
there are actually a number of different solutions
8441
07:03:01,880 --> 07:03:03,640
that can be used to solve the problem.
8442
07:03:03,640 --> 07:03:06,720
There are, in fact, six different solutions, assignments of variables
8443
07:03:06,720 --> 07:03:10,920
to values that will give me a satisfactory answer to this constraint
8444
07:03:10,920 --> 07:03:13,080
satisfaction problem.
8445
07:03:13,080 --> 07:03:17,200
So this then was an implementation of a very basic backtracking search method,
8446
07:03:17,200 --> 07:03:19,560
where really we just went through each of the variables,
8447
07:03:19,560 --> 07:03:22,480
picked one that wasn't assigned, tried the possible values
8448
07:03:22,480 --> 07:03:23,880
the variable could take on.
8449
07:03:23,880 --> 07:03:27,240
And then, if it worked, if it didn't violate any constraints,
8450
07:03:27,240 --> 07:03:28,960
then we kept trying other variables.
8451
07:03:28,960 --> 07:03:31,480
And if ever we hit a dead end, we had to backtrack.
8452
07:03:31,480 --> 07:03:34,080
But ultimately, we might be able to be a little bit more
8453
07:03:34,080 --> 07:03:36,280
intelligent about how we do this in order
8454
07:03:36,280 --> 07:03:39,520
to improve the efficiency of how we solve these sorts of problems.
8455
07:03:39,520 --> 07:03:41,640
And one thing we might imagine trying to do
8456
07:03:41,640 --> 07:03:44,280
is going back to this idea of inference, using the knowledge we
8457
07:03:44,280 --> 07:03:47,200
know to be able to draw conclusions in order
8458
07:03:47,200 --> 07:03:51,200
to make the rest of the problem solving process a little bit easier.
8459
07:03:51,200 --> 07:03:55,320
And let's now go back to where we got stuck in this problem the first time.
8460
07:03:55,320 --> 07:03:59,320
When we were solving this constraint satisfaction problem, we dealt with B.
8461
07:03:59,320 --> 07:04:03,040
And then we went on to D. And we went ahead and just assigned D to Monday,
8462
07:04:03,040 --> 07:04:05,240
because that seemed to work with the assignment so far.
8463
07:04:05,240 --> 07:04:07,600
It didn't violate any constraints.
8464
07:04:07,600 --> 07:04:11,480
But it turned out that later on that choice turned out to be a bad one,
8465
07:04:11,480 --> 07:04:15,040
that that choice wasn't consistent with the rest of the values
8466
07:04:15,040 --> 07:04:16,920
that we could take on here.
8467
07:04:16,920 --> 07:04:18,640
And the question is, is there anything we
8468
07:04:18,640 --> 07:04:21,600
could do to avoid getting into a situation like this,
8469
07:04:21,600 --> 07:04:25,240
avoid trying to go down a path that's ultimately not going to lead anywhere
8470
07:04:25,240 --> 07:04:28,360
by taking advantage of knowledge that we have initially?
8471
07:04:28,360 --> 07:04:30,680
And it turns out we do have that kind of knowledge.
8472
07:04:30,680 --> 07:04:33,720
We can look at just the structure of this graph so far.
8473
07:04:33,720 --> 07:04:37,720
And we can say that right now C's domain, for example,
8474
07:04:37,720 --> 07:04:41,360
contains values Monday, Tuesday, and Wednesday.
8475
07:04:41,360 --> 07:04:46,160
And based on those values, we can say that this graph is not arc consistent.
8476
07:04:46,160 --> 07:04:49,140
Recall that arc consistency is all about making sure
8477
07:04:49,140 --> 07:04:52,480
that for every possible value for a particular node,
8478
07:04:52,480 --> 07:04:55,600
that there is some other value that we are able to choose.
8479
07:04:55,600 --> 07:04:58,200
And as we can see here, Monday and Tuesday
8480
07:04:58,200 --> 07:05:01,640
are not going to be possible values that we can choose for C.
8481
07:05:01,640 --> 07:05:06,120
They're not going to be consistent with a node like B, for example,
8482
07:05:06,120 --> 07:05:09,800
because B is equal to Tuesday, which means that C cannot be Tuesday.
8483
07:05:09,800 --> 07:05:13,560
And because A is equal to Monday, C also cannot be Monday.
8484
07:05:13,560 --> 07:05:18,400
So using that information, by making C arc consistent with A and B,
8485
07:05:18,400 --> 07:05:21,600
we could remove Monday and Tuesday from C's domain
8486
07:05:21,600 --> 07:05:25,440
and just leave C with Wednesday, for example.
8487
07:05:25,440 --> 07:05:28,800
And if we continued to try and enforce arc consistency,
8488
07:05:28,800 --> 07:05:31,400
we'd see there are some other conclusions we can draw as well.
8489
07:05:31,400 --> 07:05:35,360
We see that B's only option is Tuesday and C's only option is Wednesday.
8490
07:05:35,360 --> 07:05:38,800
And so if we want to make E arc consistent,
8491
07:05:38,800 --> 07:05:42,160
well, E can't be Tuesday, because that wouldn't be arc consistent with B.
8492
07:05:42,160 --> 07:05:45,440
And E can't be Wednesday, because that wouldn't be arc consistent with C.
8493
07:05:45,440 --> 07:05:49,120
So we can go ahead and say E and just set that equal to Monday, for example.
8494
07:05:49,120 --> 07:05:51,560
And then we can begin to do this process again and again,
8495
07:05:51,560 --> 07:05:54,640
that in order to make D arc consistent with B and E,
8496
07:05:54,640 --> 07:05:56,120
then D would have to be Wednesday.
8497
07:05:56,120 --> 07:05:57,880
That's the only possible option.
8498
07:05:57,880 --> 07:06:01,480
And likewise, we can make the same judgments for F and G as well.
8499
07:06:01,480 --> 07:06:04,680
And it turns out that without having to do any additional search,
8500
07:06:04,680 --> 07:06:07,920
just by enforcing arc consistency, we were
8501
07:06:07,920 --> 07:06:10,920
able to actually figure out what the assignment of all the variables
8502
07:06:10,920 --> 07:06:14,360
should be without needing to backtrack at all.
8503
07:06:14,360 --> 07:06:18,360
And the way we did that is by interleaving this search process
8504
07:06:18,360 --> 07:06:22,920
and the inference step, by this step of trying to enforce arc consistency.
8505
07:06:22,920 --> 07:06:26,120
And the algorithm to do this is often called just the maintaining arc
8506
07:06:26,120 --> 07:06:30,840
consistency algorithm, which just enforces arc consistency every time
8507
07:06:30,840 --> 07:06:34,880
we make a new assignment of a value to an existing variable.
8508
07:06:34,880 --> 07:06:38,760
So sometimes we can enforce our consistency using that AC3 algorithm
8509
07:06:38,760 --> 07:06:41,920
at the very beginning of the problem before we even begin searching
8510
07:06:41,920 --> 07:06:43,880
in order to limit the domain of the variables
8511
07:06:43,880 --> 07:06:45,640
in order to make it easier to search.
8512
07:06:45,640 --> 07:06:48,720
But we can also take advantage of the interleaving
8513
07:06:48,720 --> 07:06:52,560
of enforcing our consistency with search such that every time in the search
8514
07:06:52,560 --> 07:06:56,720
process we make a new assignment, we go ahead and enforce arc consistency
8515
07:06:56,720 --> 07:06:59,440
as well to make sure that we're just eliminating
8516
07:06:59,440 --> 07:07:02,680
possible values from domains whenever possible.
8517
07:07:02,680 --> 07:07:03,840
And how do we do this?
8518
07:07:03,840 --> 07:07:06,440
Well, this is really equivalent to just every time
8519
07:07:06,440 --> 07:07:09,160
we make a new assignment to a variable x.
8520
07:07:09,160 --> 07:07:12,280
We'll go ahead and call our AC3 algorithm,
8521
07:07:12,280 --> 07:07:15,680
this algorithm that enforces arc consistency on a constraint satisfaction
8522
07:07:15,680 --> 07:07:16,680
problem.
8523
07:07:16,680 --> 07:07:18,640
And we go ahead and call that, starting it
8524
07:07:18,640 --> 07:07:22,120
with a Q, not of all of the arcs, which we did originally,
8525
07:07:22,120 --> 07:07:26,600
but just of all of the arcs that we want to make arc consistent with x,
8526
07:07:26,600 --> 07:07:28,920
this thing that we have just made an assignment to.
8527
07:07:28,920 --> 07:07:33,280
So all arcs yx, where y is a neighbor of x, something
8528
07:07:33,280 --> 07:07:36,560
that shares a constraint with x, for example.
8529
07:07:36,560 --> 07:07:40,760
And by maintaining arc consistency in the backtracking search process,
8530
07:07:40,760 --> 07:07:44,040
we can ultimately make our search process a little bit more efficient.
8531
07:07:44,040 --> 07:07:47,680
And so this is the revised version of this backtrack function.
8532
07:07:47,680 --> 07:07:50,480
Same as before, the changes here are highlighted in yellow.
8533
07:07:50,480 --> 07:07:54,320
Every time we add a new variable equals value to our assignment,
8534
07:07:54,320 --> 07:07:56,520
we'll go ahead and run this inference procedure, which
8535
07:07:56,520 --> 07:07:57,920
might do a number of different things.
8536
07:07:57,920 --> 07:08:00,880
But one thing it could do is call the maintaining arc consistency
8537
07:08:00,880 --> 07:08:05,240
algorithm to make sure we're able to enforce arc consistency on the problem.
8538
07:08:05,240 --> 07:08:09,360
And we might be able to draw new inferences as a result of that process.
8539
07:08:09,360 --> 07:08:13,600
Get new guarantees of this variable needs to be equal to that value,
8540
07:08:13,600 --> 07:08:14,360
for example.
8541
07:08:14,360 --> 07:08:15,360
That might happen one time.
8542
07:08:15,360 --> 07:08:16,720
It might happen many times.
8543
07:08:16,720 --> 07:08:19,320
And so long as those inferences are not a failure,
8544
07:08:19,320 --> 07:08:22,200
as long as they don't lead to a situation where there is no possible way
8545
07:08:22,200 --> 07:08:26,080
to make forward progress, well, then we can go ahead and add those inferences,
8546
07:08:26,080 --> 07:08:28,120
those new knowledge, that new pieces of knowledge
8547
07:08:28,120 --> 07:08:31,200
I know about what variables should be assigned to what values,
8548
07:08:31,200 --> 07:08:35,040
I can add those to the assignment in order to more quickly make forward
8549
07:08:35,040 --> 07:08:38,720
progress by taking advantage of information that I can just deduce,
8550
07:08:38,720 --> 07:08:41,960
information I know based on the rest of the structure
8551
07:08:41,960 --> 07:08:44,240
of the constraint satisfaction problem.
8552
07:08:44,240 --> 07:08:46,040
And the only other change I'll need to make now
8553
07:08:46,040 --> 07:08:49,240
is if it turns out this value doesn't work, well, then down here,
8554
07:08:49,240 --> 07:08:52,320
I'll go ahead and need to remove not only variable equals value,
8555
07:08:52,320 --> 07:08:54,920
but also any of those inferences that I made,
8556
07:08:54,920 --> 07:08:57,480
remove that from the assignment as well.
8557
07:08:57,480 --> 07:09:01,480
So here, then, we're often able to solve the problem by backtracking less
8558
07:09:01,480 --> 07:09:03,560
than we might originally have needed to, just
8559
07:09:03,560 --> 07:09:05,880
by taking advantage of the fact that every time we
8560
07:09:05,880 --> 07:09:08,480
make a new assignment of one variable to one value,
8561
07:09:08,480 --> 07:09:12,000
that might reduce the domains of other variables as well.
8562
07:09:12,000 --> 07:09:15,520
And we can use that information to begin to more quickly draw conclusions
8563
07:09:15,520 --> 07:09:19,440
in order to try and solve the problem more efficiently as well.
8564
07:09:19,440 --> 07:09:21,560
And it turns out there are other heuristics
8565
07:09:21,560 --> 07:09:25,240
we can use to try and improve the efficiency of our search process
8566
07:09:25,240 --> 07:09:25,920
as well.
8567
07:09:25,920 --> 07:09:28,800
And it really boils down to a couple of these functions
8568
07:09:28,800 --> 07:09:30,800
that I've talked about, but we haven't really
8569
07:09:30,800 --> 07:09:32,280
talked about how they're working.
8570
07:09:32,280 --> 07:09:37,000
And one of them is this function here, select unassigned variable,
8571
07:09:37,000 --> 07:09:40,360
where we're selecting some variable in the constraint satisfaction problem
8572
07:09:40,360 --> 07:09:42,240
that has not yet been assigned.
8573
07:09:42,240 --> 07:09:45,080
So far, I've sort of just been selecting variables randomly,
8574
07:09:45,080 --> 07:09:48,320
just like picking one variable and one unassigned variable in order
8575
07:09:48,320 --> 07:09:50,240
to decide, all right, this is the variable
8576
07:09:50,240 --> 07:09:53,240
that we're going to assign next, and then going from there.
8577
07:09:53,240 --> 07:09:55,480
But it turns out that by being a little bit intelligent,
8578
07:09:55,480 --> 07:09:57,720
by following certain heuristics, we might be
8579
07:09:57,720 --> 07:10:00,400
able to make the search process much more efficient just
8580
07:10:00,400 --> 07:10:05,560
by choosing very carefully which variable we should explore next.
8581
07:10:05,560 --> 07:10:09,320
So some of those heuristics include the minimum remaining values,
8582
07:10:09,320 --> 07:10:12,240
or MRV heuristic, which generally says that if I
8583
07:10:12,240 --> 07:10:14,880
have a choice between which variable I should select,
8584
07:10:14,880 --> 07:10:18,000
I should select the variable with the smallest domain,
8585
07:10:18,000 --> 07:10:21,480
the variable that has the fewest number of remaining values left.
8586
07:10:21,480 --> 07:10:24,640
With the idea being, if there are only two remaining values left,
8587
07:10:24,640 --> 07:10:27,720
well, I may as well prune one of them very quickly in order
8588
07:10:27,720 --> 07:10:30,920
to get to the other, because one of those two has got to be the solution,
8589
07:10:30,920 --> 07:10:33,640
if a solution does exist.
8590
07:10:33,640 --> 07:10:37,600
Sometimes minimum remaining values might not give a conclusive result
8591
07:10:37,600 --> 07:10:40,920
if all the nodes have the same number of remaining values, for example.
8592
07:10:40,920 --> 07:10:43,800
And in that case, another heuristic that can be helpful to look at
8593
07:10:43,800 --> 07:10:45,680
is the degree heuristic.
8594
07:10:45,680 --> 07:10:49,440
The degree of a node is the number of nodes that are attached to that node,
8595
07:10:49,440 --> 07:10:52,880
the number of nodes that are constrained by that particular node.
8596
07:10:52,880 --> 07:10:54,960
And if you imagine which variable should I choose,
8597
07:10:54,960 --> 07:10:57,240
should I choose a variable that has a high degree that
8598
07:10:57,240 --> 07:10:59,240
is connected to a lot of different things,
8599
07:10:59,240 --> 07:11:01,120
or a variable with a low degree that is not
8600
07:11:01,120 --> 07:11:03,120
connected to a lot of different things, well,
8601
07:11:03,120 --> 07:11:06,240
it can often make sense to choose the variable that
8602
07:11:06,240 --> 07:11:09,800
has the highest degree that is connected to the most other nodes
8603
07:11:09,800 --> 07:11:11,760
as the thing you would search first.
8604
07:11:11,760 --> 07:11:12,920
Why is that the case?
8605
07:11:12,920 --> 07:11:16,320
Well, it's because by choosing a variable with a high degree,
8606
07:11:16,320 --> 07:11:20,040
that is immediately going to constrain the rest of the variables more,
8607
07:11:20,040 --> 07:11:23,760
and it's more likely to be able to eliminate large sections of the state
8608
07:11:23,760 --> 07:11:26,960
space that you don't need to search through at all.
8609
07:11:26,960 --> 07:11:29,440
So what could this actually look like?
8610
07:11:29,440 --> 07:11:31,440
Let's go back to this search problem here.
8611
07:11:31,440 --> 07:11:34,320
In this particular case, I've made an assignment here.
8612
07:11:34,320 --> 07:11:35,720
I've made an assignment here.
8613
07:11:35,720 --> 07:11:38,840
And the question is, what should I look at next?
8614
07:11:38,840 --> 07:11:41,600
And according to the minimum remaining values heuristic,
8615
07:11:41,600 --> 07:11:44,320
what I should choose is the variable that has the fewest
8616
07:11:44,320 --> 07:11:46,240
remaining possible values.
8617
07:11:46,240 --> 07:11:48,160
And in this case, that's this node here, node
8618
07:11:48,160 --> 07:11:51,720
C, that only has one variable left in this domain, which in this case
8619
07:11:51,720 --> 07:11:55,360
is Wednesday, which is a very reasonable choice of a next assignment
8620
07:11:55,360 --> 07:11:58,240
to make, because I know it's the only option, for example.
8621
07:11:58,240 --> 07:12:01,480
I know that the only possible option for C is Wednesday,
8622
07:12:01,480 --> 07:12:04,760
so I may as well make that assignment and then potentially explore
8623
07:12:04,760 --> 07:12:07,440
the rest of the space after that.
8624
07:12:07,440 --> 07:12:09,520
But meanwhile, at the very start of the problem,
8625
07:12:09,520 --> 07:12:12,960
when I didn't have any knowledge of what nodes should have what values yet,
8626
07:12:12,960 --> 07:12:16,840
I still had to pick what node should be the first one that I try and assign
8627
07:12:16,840 --> 07:12:17,640
a value to.
8628
07:12:17,640 --> 07:12:20,960
And I arbitrarily just chose the one at the top, node A originally.
8629
07:12:20,960 --> 07:12:23,480
But we can be more intelligent about that.
8630
07:12:23,480 --> 07:12:25,480
We can look at this particular graph.
8631
07:12:25,480 --> 07:12:28,240
All of them have domains of the same size, domain of size 3.
8632
07:12:28,240 --> 07:12:31,240
So minimum remaining values doesn't really help us there.
8633
07:12:31,240 --> 07:12:34,760
But we might notice that node E has the highest degree.
8634
07:12:34,760 --> 07:12:37,040
It is connected to the most things.
8635
07:12:37,040 --> 07:12:39,800
And so perhaps it makes sense to begin our search,
8636
07:12:39,800 --> 07:12:41,880
rather than starting at node A at the very top,
8637
07:12:41,880 --> 07:12:43,720
start with the node with the highest degree.
8638
07:12:43,720 --> 07:12:46,760
Start by searching from node E, because from there,
8639
07:12:46,760 --> 07:12:49,160
that's going to much more easily allow us to enforce
8640
07:12:49,160 --> 07:12:51,600
the constraints that are nearby, eliminating
8641
07:12:51,600 --> 07:12:55,400
large portions of the search space that I might not need to search through.
8642
07:12:55,400 --> 07:12:59,480
And in fact, by starting with E, we can immediately then assign other variables.
8643
07:12:59,480 --> 07:13:02,160
And following that, we can actually assign the rest of the variables
8644
07:13:02,160 --> 07:13:04,600
without needing to do any backtracking at all,
8645
07:13:04,600 --> 07:13:06,880
even if I'm not using this inference procedure.
8646
07:13:06,880 --> 07:13:09,360
Just by starting with a node that has a high degree,
8647
07:13:09,360 --> 07:13:12,560
that is going to very quickly restrict the possible values
8648
07:13:12,560 --> 07:13:14,960
that other nodes can take on.
8649
07:13:14,960 --> 07:13:17,360
So that then is how we can go about selecting
8650
07:13:17,360 --> 07:13:19,840
an unassigned variable in a particular order.
8651
07:13:19,840 --> 07:13:22,160
Rather than randomly picking a variable, if we're
8652
07:13:22,160 --> 07:13:24,200
a little bit intelligent about how we choose it,
8653
07:13:24,200 --> 07:13:26,960
we can make our search process much, much more efficient
8654
07:13:26,960 --> 07:13:30,040
by making sure we don't have to search through portions of the search space
8655
07:13:30,040 --> 07:13:32,640
that ultimately aren't going to matter.
8656
07:13:32,640 --> 07:13:34,600
The other variable we haven't really talked about,
8657
07:13:34,600 --> 07:13:37,880
the other function here, is this domain values function.
8658
07:13:37,880 --> 07:13:40,520
This domain values function that takes a variable
8659
07:13:40,520 --> 07:13:43,040
and gives me back a sequence of all of the values
8660
07:13:43,040 --> 07:13:45,880
inside of that variable's domain.
8661
07:13:45,880 --> 07:13:47,960
The naive way to approach it is what we did before,
8662
07:13:47,960 --> 07:13:51,880
which is just go in order, go Monday, then Tuesday, then Wednesday.
8663
07:13:51,880 --> 07:13:53,560
But the problem is that going in that order
8664
07:13:53,560 --> 07:13:55,760
might not be the most efficient order to search in,
8665
07:13:55,760 --> 07:13:59,560
that sometimes it might be more efficient to choose values
8666
07:13:59,560 --> 07:14:04,320
that are likely to be solutions first and then go to other values.
8667
07:14:04,320 --> 07:14:06,320
Now, how do you assess whether a value is
8668
07:14:06,320 --> 07:14:10,200
likelier to lead to a solution or less likely to lead to a solution?
8669
07:14:10,200 --> 07:14:15,160
Well, one thing you can take a look at is how many constraints get added,
8670
07:14:15,160 --> 07:14:17,880
how many things get removed from domains as you
8671
07:14:17,880 --> 07:14:21,520
make this new assignment of a variable to this particular value.
8672
07:14:21,520 --> 07:14:26,080
And the heuristic we can use here is the least constraining value heuristic,
8673
07:14:26,080 --> 07:14:28,960
which is the idea that we should return variables in order
8674
07:14:28,960 --> 07:14:32,840
based on the number of choices that are ruled out for neighboring values.
8675
07:14:32,840 --> 07:14:36,440
And I want to start with the least constraining value, the value that
8676
07:14:36,440 --> 07:14:40,080
rules out the fewest possible options.
8677
07:14:40,080 --> 07:14:43,280
And the idea there is that if all I care about doing
8678
07:14:43,280 --> 07:14:47,520
is finding a solution, if I start with a value that
8679
07:14:47,520 --> 07:14:51,400
rules out a lot of other choices, I'm ruling out a lot of possibilities
8680
07:14:51,400 --> 07:14:55,160
that maybe is going to make it less likely that this particular choice
8681
07:14:55,160 --> 07:14:56,480
leads to a solution.
8682
07:14:56,480 --> 07:14:58,640
Whereas on the other hand, if I have a variable
8683
07:14:58,640 --> 07:15:02,160
and I start by choosing a value that doesn't rule out very much,
8684
07:15:02,160 --> 07:15:05,320
well, then I still have a lot of space where there might be a solution
8685
07:15:05,320 --> 07:15:06,640
that I could ultimately find.
8686
07:15:06,640 --> 07:15:09,680
And this might seem a little bit counterintuitive and a little bit at odds
8687
07:15:09,680 --> 07:15:12,080
with what we were talking about before, where I said,
8688
07:15:12,080 --> 07:15:14,000
when you're picking a variable, you should
8689
07:15:14,000 --> 07:15:18,360
pick the variable that is going to have the fewest possible values remaining.
8690
07:15:18,360 --> 07:15:20,480
But here, I want to pick the value for the variable
8691
07:15:20,480 --> 07:15:22,160
that is the least constraining.
8692
07:15:22,160 --> 07:15:25,040
But the general idea is that when I am picking a variable,
8693
07:15:25,040 --> 07:15:27,720
I would like to prune large portions of the search space
8694
07:15:27,720 --> 07:15:30,960
by just choosing a variable that is going to allow me to quickly eliminate
8695
07:15:30,960 --> 07:15:32,560
possible options.
8696
07:15:32,560 --> 07:15:34,880
Whereas here, within a particular variable,
8697
07:15:34,880 --> 07:15:37,880
as I'm considering values that that variable could take on,
8698
07:15:37,880 --> 07:15:40,360
I would like to just find a solution.
8699
07:15:40,360 --> 07:15:42,640
And so what I want to do is ultimately choose
8700
07:15:42,640 --> 07:15:46,680
a value that still leaves open the possibility of me finding a solution
8701
07:15:46,680 --> 07:15:48,040
to be as likely as possible.
8702
07:15:48,040 --> 07:15:51,800
By not ruling out many options, I leave open the possibility
8703
07:15:51,800 --> 07:15:54,120
that I can still find a solution without needing
8704
07:15:54,120 --> 07:15:56,080
to go back later and backtrack.
8705
07:15:56,080 --> 07:15:59,360
So an example of that might be in this particular situation here,
8706
07:15:59,360 --> 07:16:03,360
if I'm trying to choose a variable for a value for node C here,
8707
07:16:03,360 --> 07:16:06,080
that C is equal to either Tuesday or Wednesday.
8708
07:16:06,080 --> 07:16:09,280
We know it can't be Monday because it conflicts with this domain here,
8709
07:16:09,280 --> 07:16:13,360
where we already know that A is Monday, so C must be Tuesday or Wednesday.
8710
07:16:13,360 --> 07:16:16,120
And the question is, should I try Tuesday first,
8711
07:16:16,120 --> 07:16:18,120
or should I try Wednesday first?
8712
07:16:18,120 --> 07:16:21,280
And if I try Tuesday, what gets ruled out?
8713
07:16:21,280 --> 07:16:25,760
Well, one option gets ruled out here, a second option gets ruled out here,
8714
07:16:25,760 --> 07:16:27,720
and a third option gets ruled out here.
8715
07:16:27,720 --> 07:16:30,920
So choosing Tuesday would rule out three possible options.
8716
07:16:30,920 --> 07:16:32,600
And what about choosing Wednesday?
8717
07:16:32,600 --> 07:16:35,140
Well, choosing Wednesday would rule out one option here,
8718
07:16:35,140 --> 07:16:37,400
and it would rule out one option there.
8719
07:16:37,400 --> 07:16:38,740
And so I have two choices.
8720
07:16:38,740 --> 07:16:41,480
I can choose Tuesday that rules out three options,
8721
07:16:41,480 --> 07:16:43,760
or Wednesday that rules out two options.
8722
07:16:43,760 --> 07:16:46,600
And according to the least constraining value heuristic,
8723
07:16:46,600 --> 07:16:49,300
what I should probably do is go ahead and choose Wednesday,
8724
07:16:49,300 --> 07:16:52,240
the one that rules out the fewest number of possible options,
8725
07:16:52,240 --> 07:16:55,040
leaving open as many chances as possible for me
8726
07:16:55,040 --> 07:16:58,320
to eventually find the solution inside of the state space.
8727
07:16:58,320 --> 07:17:00,280
And ultimately, if you continue this process,
8728
07:17:00,280 --> 07:17:05,520
we will find the solution, an assignment of variables, two values,
8729
07:17:05,520 --> 07:17:09,520
that allows us to give each of these exams, each of these classes,
8730
07:17:09,520 --> 07:17:12,240
an exam date that doesn't conflict with anyone
8731
07:17:12,240 --> 07:17:16,320
that happens to be enrolled in two classes at the same time.
8732
07:17:16,320 --> 07:17:18,400
So the big takeaway now with all of this is
8733
07:17:18,400 --> 07:17:21,760
that there are a number of different ways we can formulate a problem.
8734
07:17:21,760 --> 07:17:24,520
The ways we've looked at today are we can formulate a problem
8735
07:17:24,520 --> 07:17:27,840
as a local search problem, a problem where we're looking at a current node
8736
07:17:27,840 --> 07:17:30,720
and moving to a neighbor based on whether that neighbor is better
8737
07:17:30,720 --> 07:17:33,200
or worse than the current node that we are looking at.
8738
07:17:33,200 --> 07:17:35,640
We looked at formulating problems as linear programs,
8739
07:17:35,640 --> 07:17:38,920
where just by putting things in terms of equations and constraints,
8740
07:17:38,920 --> 07:17:41,880
we're able to solve problems a little bit more efficiently.
8741
07:17:41,880 --> 07:17:45,600
And we saw formulating a problem as a constraint satisfaction problem,
8742
07:17:45,600 --> 07:17:48,200
creating this graph of all of the constraints
8743
07:17:48,200 --> 07:17:51,320
that connect two variables that have some constraint between them,
8744
07:17:51,320 --> 07:17:54,080
and using that information to be able to figure out
8745
07:17:54,080 --> 07:17:56,360
what the solution should be.
8746
07:17:56,360 --> 07:17:58,320
And so the takeaway of all of this now is
8747
07:17:58,320 --> 07:18:00,800
that if we have some problem in artificial intelligence
8748
07:18:00,800 --> 07:18:03,200
that we would like to use AI to be able to solve them,
8749
07:18:03,200 --> 07:18:05,540
whether that's trying to figure out where hospitals should be
8750
07:18:05,540 --> 07:18:07,880
or trying to solve the traveling salesman problem,
8751
07:18:07,880 --> 07:18:10,560
trying to optimize productions and costs and whatnot,
8752
07:18:10,560 --> 07:18:13,200
or trying to figure out how to satisfy certain constraints,
8753
07:18:13,200 --> 07:18:15,440
whether that's in a Sudoku puzzle, or whether that's
8754
07:18:15,440 --> 07:18:18,440
in trying to figure out how to schedule exams for a university,
8755
07:18:18,440 --> 07:18:21,200
or any number of a wide variety of types of problems,
8756
07:18:21,200 --> 07:18:24,920
if we can formulate that problem as one of these sorts of problems,
8757
07:18:24,920 --> 07:18:27,640
then we can use these known algorithms, these algorithms
8758
07:18:27,640 --> 07:18:30,640
for enforcing art consistency and backtracking search,
8759
07:18:30,640 --> 07:18:33,240
these hill climbing and simulated annealing algorithms,
8760
07:18:33,240 --> 07:18:36,240
these simplex algorithms and interior point algorithms that
8761
07:18:36,240 --> 07:18:38,220
can be used to solve linear programs, that we
8762
07:18:38,220 --> 07:18:42,400
can use those techniques to begin to solve a whole wide variety of problems
8763
07:18:42,400 --> 07:18:46,600
all in this world of optimization inside of artificial intelligence.
8764
07:18:46,600 --> 07:18:49,760
This was an introduction to artificial intelligence with Python for today.
8765
07:18:49,760 --> 07:18:52,320
We will see you next time.
8766
07:18:52,320 --> 07:18:53,320
["
8767
07:19:11,120 --> 07:19:11,620
All right.
8768
07:19:11,620 --> 07:19:13,360
Welcome back, everyone, to an introduction
8769
07:19:13,360 --> 07:19:15,440
to artificial intelligence with Python.
8770
07:19:15,440 --> 07:19:17,920
Now, so far in this class, we've used AI to solve
8771
07:19:17,920 --> 07:19:20,400
a number of different problems, giving AI instructions
8772
07:19:20,400 --> 07:19:24,520
for how to search for a solution, or how to satisfy certain constraints in order
8773
07:19:24,520 --> 07:19:27,480
to find its way from some input point to some output point
8774
07:19:27,480 --> 07:19:29,720
in order to solve some sort of problem.
8775
07:19:29,720 --> 07:19:31,760
Today, we're going to turn to the world of learning,
8776
07:19:31,760 --> 07:19:34,880
in particular the idea of machine learning, which generally refers
8777
07:19:34,880 --> 07:19:38,620
to the idea where we are not going to give the computer explicit instructions
8778
07:19:38,620 --> 07:19:42,160
for how to perform a task, but rather we are going to give the computer access
8779
07:19:42,160 --> 07:19:45,560
to information in the form of data, or patterns that it can learn from,
8780
07:19:45,560 --> 07:19:48,740
and let the computer try and figure out what those patterns are,
8781
07:19:48,740 --> 07:19:52,520
try and understand that data to be able to perform a task on its own.
8782
07:19:52,520 --> 07:19:54,860
Now, machine learning comes in a number of different forms,
8783
07:19:54,860 --> 07:19:56,120
and it's a very wide field.
8784
07:19:56,120 --> 07:20:00,000
So today, we'll explore some of the foundational algorithms and ideas
8785
07:20:00,000 --> 07:20:03,320
that are behind a lot of the different areas within machine learning.
8786
07:20:03,320 --> 07:20:07,200
And one of the most popular is the idea of supervised machine learning,
8787
07:20:07,200 --> 07:20:08,780
or just supervised learning.
8788
07:20:08,780 --> 07:20:11,480
And supervised learning is a particular type of task.
8789
07:20:11,480 --> 07:20:14,480
It refers to the task where we give the computer access
8790
07:20:14,480 --> 07:20:19,120
to a data set, where that data set consists of input-output pairs.
8791
07:20:19,120 --> 07:20:21,000
And what we would like the computer to do
8792
07:20:21,000 --> 07:20:23,360
is we would like our AI to be able to figure out
8793
07:20:23,360 --> 07:20:27,200
some function that maps inputs to outputs.
8794
07:20:27,200 --> 07:20:29,580
So we have a whole bunch of data that generally consists
8795
07:20:29,580 --> 07:20:32,000
of some kind of input, some evidence, some information
8796
07:20:32,000 --> 07:20:33,800
that the computer will have access to.
8797
07:20:33,800 --> 07:20:36,720
And we would like the computer, based on that input information,
8798
07:20:36,720 --> 07:20:40,000
to predict what some output is going to be.
8799
07:20:40,000 --> 07:20:43,280
And we'll give it some data so that the computer can train its model on
8800
07:20:43,280 --> 07:20:46,220
and begin to understand how it is that this information works
8801
07:20:46,220 --> 07:20:49,520
and how it is that the inputs and outputs relate to each other.
8802
07:20:49,520 --> 07:20:51,280
But ultimately, we hope that our computer
8803
07:20:51,280 --> 07:20:54,400
will be able to figure out some function that, given those inputs,
8804
07:20:54,400 --> 07:20:56,640
is able to get those outputs.
8805
07:20:56,640 --> 07:20:59,480
There are a couple of different tasks within supervised learning.
8806
07:20:59,480 --> 07:21:02,840
The one we'll focus on and start with is known as classification.
8807
07:21:02,840 --> 07:21:07,200
And classification is the problem where, if I give you a whole bunch of inputs,
8808
07:21:07,200 --> 07:21:11,560
you need to figure out some way to map those inputs into discrete categories,
8809
07:21:11,560 --> 07:21:13,600
where you can decide what those categories are,
8810
07:21:13,600 --> 07:21:16,800
and it's the job of the computer to predict what those categories are
8811
07:21:16,800 --> 07:21:17,360
going to be.
8812
07:21:17,360 --> 07:21:19,960
So that might be, for example, I give you information
8813
07:21:19,960 --> 07:21:23,560
about a bank note, like a US dollar, and I'm asking you to predict for me,
8814
07:21:23,560 --> 07:21:26,480
does it belong to the category of authentic bank notes,
8815
07:21:26,480 --> 07:21:29,400
or does it belong to the category of counterfeit bank notes?
8816
07:21:29,400 --> 07:21:31,520
You need to categorize the input, and we want
8817
07:21:31,520 --> 07:21:33,840
to train the computer to figure out some function
8818
07:21:33,840 --> 07:21:36,160
to be able to do that calculation.
8819
07:21:36,160 --> 07:21:38,280
Another example might be the case of weather,
8820
07:21:38,280 --> 07:21:40,840
someone we've talked about a little bit so far in this class,
8821
07:21:40,840 --> 07:21:43,440
where we would like to predict on a given day,
8822
07:21:43,440 --> 07:21:44,960
is it going to rain on that day?
8823
07:21:44,960 --> 07:21:46,600
Is it going to be cloudy on that day?
8824
07:21:46,600 --> 07:21:49,800
And before we've seen how we could do this, if we really give the computer
8825
07:21:49,800 --> 07:21:53,200
all the exact probabilities for if these are the conditions,
8826
07:21:53,200 --> 07:21:54,800
what's the probability of rain?
8827
07:21:54,800 --> 07:21:57,520
Oftentimes, we don't have access to that information, though.
8828
07:21:57,520 --> 07:22:00,440
But what we do have access to is a whole bunch of data.
8829
07:22:00,440 --> 07:22:02,640
So if we wanted to be able to predict something like,
8830
07:22:02,640 --> 07:22:04,560
is it going to rain or is it not going to rain,
8831
07:22:04,560 --> 07:22:07,880
we would give the computer historical information about days
8832
07:22:07,880 --> 07:22:10,320
when it was raining and days when it was not raining
8833
07:22:10,320 --> 07:22:14,200
and ask the computer to look for patterns in that data.
8834
07:22:14,200 --> 07:22:15,800
So what might that data look like?
8835
07:22:15,800 --> 07:22:18,320
Well, we could structure that data in a table like this.
8836
07:22:18,320 --> 07:22:21,440
This might be what our table looks like, where for any particular day,
8837
07:22:21,440 --> 07:22:24,720
going back, we have information about that day's humidity,
8838
07:22:24,720 --> 07:22:28,120
that day's air pressure, and then importantly, we have a label,
8839
07:22:28,120 --> 07:22:31,000
something where the human has said that on this particular day,
8840
07:22:31,000 --> 07:22:33,040
it was raining or it was not raining.
8841
07:22:33,040 --> 07:22:35,680
So you could fill in this table with a whole bunch of data.
8842
07:22:35,680 --> 07:22:39,280
And what makes this what we would call a supervised learning exercise
8843
07:22:39,280 --> 07:22:42,360
is that a human has gone in and labeled each of these data points,
8844
07:22:42,360 --> 07:22:45,920
said that on this day, when these were the values for the humidity and pressure,
8845
07:22:45,920 --> 07:22:49,280
that day was a rainy day and this day was a not rainy day.
8846
07:22:49,280 --> 07:22:51,760
And what we would like the computer to be able to do then
8847
07:22:51,760 --> 07:22:55,320
is to be able to figure out, given these inputs, given the humidity
8848
07:22:55,320 --> 07:22:58,360
and the pressure, can the computer predict what label
8849
07:22:58,360 --> 07:22:59,840
should be associated with that day?
8850
07:22:59,840 --> 07:23:02,840
Does that day look more like it's going to be a day that rains
8851
07:23:02,840 --> 07:23:06,600
or does it look more like a day when it's not going to rain?
8852
07:23:06,600 --> 07:23:10,440
Put a little bit more mathematically, you can think of this as a function
8853
07:23:10,440 --> 07:23:13,280
that takes two inputs, the inputs being the data points
8854
07:23:13,280 --> 07:23:16,520
that our computer will have access to, things like humidity and pressure.
8855
07:23:16,520 --> 07:23:18,400
So we could write a function f that takes
8856
07:23:18,400 --> 07:23:20,560
as input both humidity and pressure.
8857
07:23:20,560 --> 07:23:24,080
And then the output is going to be what category
8858
07:23:24,080 --> 07:23:27,520
we would ascribe to these particular input points, what label
8859
07:23:27,520 --> 07:23:29,240
we would associate with that input.
8860
07:23:29,240 --> 07:23:31,480
So we've seen a couple of example data points here,
8861
07:23:31,480 --> 07:23:34,160
where given this value for humidity and this value for pressure,
8862
07:23:34,160 --> 07:23:37,560
we predict, is it going to rain or is it not going to rain?
8863
07:23:37,560 --> 07:23:40,520
And that's information that we just gathered from the world.
8864
07:23:40,520 --> 07:23:44,000
We measured on various different days what the humidity and pressure were.
8865
07:23:44,000 --> 07:23:48,120
We observed whether or not we saw rain or no rain on that particular day.
8866
07:23:48,120 --> 07:23:51,880
And this function f is what we would like to approximate.
8867
07:23:51,880 --> 07:23:53,840
Now, the computer and we humans don't really
8868
07:23:53,840 --> 07:23:55,640
know exactly how this function f works.
8869
07:23:55,640 --> 07:23:57,920
It's probably quite a complex function.
8870
07:23:57,920 --> 07:24:01,000
So what we're going to do instead is attempt to estimate it.
8871
07:24:01,000 --> 07:24:03,960
We would like to come up with a hypothesis function.
8872
07:24:03,960 --> 07:24:08,240
h, which is going to try to approximate what f does.
8873
07:24:08,240 --> 07:24:12,200
We want to come up with some function h that will also take the same inputs
8874
07:24:12,200 --> 07:24:15,720
and will also produce an output, rain or no rain.
8875
07:24:15,720 --> 07:24:20,080
And ideally, we'd like these two functions to agree as much as possible.
8876
07:24:20,080 --> 07:24:23,720
So the goal then of the supervised learning classification tasks
8877
07:24:23,720 --> 07:24:26,880
is going to be to figure out, what does that function h look like?
8878
07:24:26,880 --> 07:24:30,880
How can we begin to estimate, given all of this information, all of this data,
8879
07:24:30,880 --> 07:24:35,280
what category or what label should be assigned to a particular data point?
8880
07:24:35,280 --> 07:24:37,400
So where could you begin doing this?
8881
07:24:37,400 --> 07:24:39,960
Well, a reasonable thing to do, especially in this situation,
8882
07:24:39,960 --> 07:24:42,240
I have two numerical values, is I could try
8883
07:24:42,240 --> 07:24:47,040
to plot this on a graph that has two axes, an x-axis and a y-axis.
8884
07:24:47,040 --> 07:24:50,440
And in this case, we're just going to be using two numerical values as input.
8885
07:24:50,440 --> 07:24:54,120
But these same types of ideas scale as you add more and more inputs as well.
8886
07:24:54,120 --> 07:24:56,000
We'll be plotting things in two dimensions.
8887
07:24:56,000 --> 07:24:58,440
But as we soon see, you could add more inputs
8888
07:24:58,440 --> 07:25:00,720
and just imagine things in multiple dimensions.
8889
07:25:00,720 --> 07:25:04,040
And while we humans have trouble conceptualizing anything really
8890
07:25:04,040 --> 07:25:06,320
beyond three dimensions, at least visually,
8891
07:25:06,320 --> 07:25:08,800
a computer has no problem with trying to imagine things
8892
07:25:08,800 --> 07:25:11,280
in many, many more dimensions, that for a computer,
8893
07:25:11,280 --> 07:25:14,440
each dimension is just some separate number that it is keeping track of.
8894
07:25:14,440 --> 07:25:17,600
So it wouldn't be unreasonable for a computer to think in 10 dimensions
8895
07:25:17,600 --> 07:25:20,840
or 100 dimensions to be able to try to solve a problem.
8896
07:25:20,840 --> 07:25:22,320
But for now, we've got two inputs.
8897
07:25:22,320 --> 07:25:25,440
So we'll graph things along two axes, an x-axis, which will here
8898
07:25:25,440 --> 07:25:29,400
represent humidity, and a y-axis, which here represents pressure.
8899
07:25:29,400 --> 07:25:32,280
And what we might do is say, let's take all of the days
8900
07:25:32,280 --> 07:25:35,200
that were raining and just try to plot them on this graph
8901
07:25:35,200 --> 07:25:37,080
and see where they fall on this graph.
8902
07:25:37,080 --> 07:25:40,540
And here might be all of the rainy days, where each rainy day is
8903
07:25:40,540 --> 07:25:42,800
one of these blue dots here that corresponds
8904
07:25:42,800 --> 07:25:46,440
to a particular value for humidity and a particular value for pressure.
8905
07:25:46,440 --> 07:25:49,280
And then I might do the same thing with the days that were not rainy.
8906
07:25:49,280 --> 07:25:51,320
So take all the not rainy days, figure out
8907
07:25:51,320 --> 07:25:53,960
what their values were for each of these two inputs,
8908
07:25:53,960 --> 07:25:56,680
and go ahead and plot them on this graph as well.
8909
07:25:56,680 --> 07:25:58,080
And I've here plotted them in red.
8910
07:25:58,080 --> 07:26:00,320
So blue here stands for a rainy day.
8911
07:26:00,320 --> 07:26:02,800
Red here stands for a not rainy day.
8912
07:26:02,800 --> 07:26:04,880
And this then is the input that my computer
8913
07:26:04,880 --> 07:26:07,080
has access to all of this input.
8914
07:26:07,080 --> 07:26:09,560
And what I would like the computer to be able to do
8915
07:26:09,560 --> 07:26:13,440
is to train a model such that if I'm ever presented with a new input that
8916
07:26:13,440 --> 07:26:18,080
doesn't have a label associated with it, something like this white dot here,
8917
07:26:18,080 --> 07:26:21,440
I would like to predict, given those values for each of the two inputs,
8918
07:26:21,440 --> 07:26:24,800
should we classify it as a blue dot, a rainy day,
8919
07:26:24,800 --> 07:26:28,120
or should we classify it as a red dot, a not rainy day?
8920
07:26:28,120 --> 07:26:30,800
And if you're just looking at this picture graphically, trying to say,
8921
07:26:30,800 --> 07:26:34,080
all right, this white dot, does it look like it belongs to the blue category,
8922
07:26:34,080 --> 07:26:36,480
or does it look like it belongs to the red category,
8923
07:26:36,480 --> 07:26:40,360
I think most people would agree that it probably belongs to the blue category.
8924
07:26:40,360 --> 07:26:41,120
And why is that?
8925
07:26:41,120 --> 07:26:45,280
Well, it looks like it's close to other blue dots.
8926
07:26:45,280 --> 07:26:47,280
And that's not a very formal notion, but it's a notion
8927
07:26:47,280 --> 07:26:49,120
that we'll formalize in just a moment.
8928
07:26:49,120 --> 07:26:52,120
That because it seems to be close to this blue dot here,
8929
07:26:52,120 --> 07:26:54,280
nothing else is closer to it, then we might
8930
07:26:54,280 --> 07:26:56,640
say that it should be categorized as blue.
8931
07:26:56,640 --> 07:26:58,960
It should fall into that category of, I think
8932
07:26:58,960 --> 07:27:01,680
that day is going to be a rainy day based on that input.
8933
07:27:01,680 --> 07:27:04,840
Might not be totally accurate, but it's a pretty good guess.
8934
07:27:04,840 --> 07:27:08,040
And this type of algorithm is actually a very popular and common machine
8935
07:27:08,040 --> 07:27:11,640
learning algorithm known as nearest neighbor classification.
8936
07:27:11,640 --> 07:27:14,720
It's an algorithm for solving these classification-type problems.
8937
07:27:14,720 --> 07:27:18,360
And in nearest neighbor classification, it's going to perform this algorithm.
8938
07:27:18,360 --> 07:27:20,360
What it will do is, given an input, it will
8939
07:27:20,360 --> 07:27:24,560
choose the class of the nearest data point to that input.
8940
07:27:24,560 --> 07:27:27,800
By class, we just here mean category, like rain or no rain,
8941
07:27:27,800 --> 07:27:29,760
counterfeit or not counterfeit.
8942
07:27:29,760 --> 07:27:34,600
And we choose the category or the class based on the nearest data point.
8943
07:27:34,600 --> 07:27:36,480
So given all that data, we just looked at,
8944
07:27:36,480 --> 07:27:39,960
is the nearest data point a blue point or is it a red point?
8945
07:27:39,960 --> 07:27:42,600
And depending on the answer to that question,
8946
07:27:42,600 --> 07:27:44,600
we were able to make some sort of judgment.
8947
07:27:44,600 --> 07:27:47,320
We were able to say something like, we think it's going to be blue
8948
07:27:47,320 --> 07:27:49,360
or we think it's going to be red.
8949
07:27:49,360 --> 07:27:51,480
So likewise, we could apply this to other data points
8950
07:27:51,480 --> 07:27:52,800
that we encounter as well.
8951
07:27:52,800 --> 07:27:56,800
If suddenly this data point comes about, well, its nearest data is red.
8952
07:27:56,800 --> 07:28:00,240
So we would go ahead and classify this as a red point, not raining.
8953
07:28:00,240 --> 07:28:03,480
Things get a little bit trickier, though, when you look at a point
8954
07:28:03,480 --> 07:28:07,160
like this white point over here and you ask the same sort of question.
8955
07:28:07,160 --> 07:28:10,640
Should it belong to the category of blue points, the rainy days?
8956
07:28:10,640 --> 07:28:14,800
Or should it belong to the category of red points, the not rainy days?
8957
07:28:14,800 --> 07:28:18,760
Now, nearest neighbor classification would say the way you solve this problem
8958
07:28:18,760 --> 07:28:21,000
is look at which point is nearest to that point.
8959
07:28:21,000 --> 07:28:23,000
You look at this nearest point and say it's red.
8960
07:28:23,000 --> 07:28:24,240
It's a not rainy day.
8961
07:28:24,240 --> 07:28:27,080
And therefore, according to nearest neighbor classification,
8962
07:28:27,080 --> 07:28:30,400
I would say that this unlabeled point, well, that should also be red.
8963
07:28:30,400 --> 07:28:33,720
It should also be classified as a not rainy day.
8964
07:28:33,720 --> 07:28:37,080
But your intuition might think that that's a reasonable judgment to make,
8965
07:28:37,080 --> 07:28:39,280
that it's the closest thing is a not rainy day.
8966
07:28:39,280 --> 07:28:41,480
So may as well guess that it's a not rainy day.
8967
07:28:41,480 --> 07:28:44,640
But it's probably also reasonable to look at the bigger picture of things
8968
07:28:44,640 --> 07:28:49,480
to say, yes, it is true that the nearest point to it was a red point.
8969
07:28:49,480 --> 07:28:52,920
But it's surrounded by a whole bunch of other blue points.
8970
07:28:52,920 --> 07:28:55,160
So looking at the bigger picture, there's potentially
8971
07:28:55,160 --> 07:28:59,160
an argument to be made that this point should actually be blue.
8972
07:28:59,160 --> 07:29:01,440
And with only this data, we actually don't know for sure.
8973
07:29:01,440 --> 07:29:04,080
We are given some input, something we're trying to predict.
8974
07:29:04,080 --> 07:29:07,240
And we don't necessarily know what the output is going to be.
8975
07:29:07,240 --> 07:29:10,320
So in this case, which one is correct is difficult to say.
8976
07:29:10,320 --> 07:29:13,560
But oftentimes, considering more than just a single neighbor,
8977
07:29:13,560 --> 07:29:18,080
considering multiple neighbors can sometimes give us a better result.
8978
07:29:18,080 --> 07:29:21,800
And so there's a variant on the nearest neighbor classification algorithm
8979
07:29:21,800 --> 07:29:25,400
that is known as the K nearest neighbor classification algorithm,
8980
07:29:25,400 --> 07:29:28,320
where K is some parameter, some number that we choose,
8981
07:29:28,320 --> 07:29:30,920
for how many neighbors are we going to look at.
8982
07:29:30,920 --> 07:29:34,280
So one nearest neighbor classification is what we saw before.
8983
07:29:34,280 --> 07:29:37,600
Just pick the one nearest neighbor and use that category.
8984
07:29:37,600 --> 07:29:39,640
But with K nearest neighbor classification,
8985
07:29:39,640 --> 07:29:44,840
where K might be 3, or 5, or 7, to say look at the 3, or 5, or 7 closest
8986
07:29:44,840 --> 07:29:48,760
neighbors, closest data points to that point, works a little bit differently.
8987
07:29:48,760 --> 07:29:50,600
This algorithm, we'll give it an input.
8988
07:29:50,600 --> 07:29:55,520
Choose the most common class out of the K nearest data points to that input.
8989
07:29:55,520 --> 07:29:59,560
So if we look at the five nearest points, and three of them say it's raining,
8990
07:29:59,560 --> 07:30:01,360
and two of them say it's not raining, we'll
8991
07:30:01,360 --> 07:30:05,320
go with the three instead of the two, because each one effectively
8992
07:30:05,320 --> 07:30:09,280
gets one vote towards what they believe the category ought to be.
8993
07:30:09,280 --> 07:30:12,760
And ultimately, you choose the category that has the most votes
8994
07:30:12,760 --> 07:30:14,680
as a consequence of that.
8995
07:30:14,680 --> 07:30:17,640
So K nearest neighbor classification, fairly straightforward one
8996
07:30:17,640 --> 07:30:18,880
to understand intuitively.
8997
07:30:18,880 --> 07:30:21,800
You just look at the neighbors and figure out what the answer might be.
8998
07:30:21,800 --> 07:30:24,120
And it turns out this can work very, very well
8999
07:30:24,120 --> 07:30:28,360
for solving a whole variety of different types of classification problems.
9000
07:30:28,360 --> 07:30:31,020
But not every model is going to work under every situation.
9001
07:30:31,020 --> 07:30:33,520
And so one of the things we'll take a look at today, especially
9002
07:30:33,520 --> 07:30:35,600
in the context of supervised machine learning,
9003
07:30:35,600 --> 07:30:38,400
is that there are a number of different approaches to machine learning,
9004
07:30:38,400 --> 07:30:40,760
a number of different algorithms that we can apply,
9005
07:30:40,760 --> 07:30:44,880
all solving the same type of problem, all solving some kind of classification
9006
07:30:44,880 --> 07:30:47,820
problem where we want to take inputs and organize it
9007
07:30:47,820 --> 07:30:49,080
into different categories.
9008
07:30:49,080 --> 07:30:51,280
And no one algorithm is necessarily always
9009
07:30:51,280 --> 07:30:53,440
going to be better than some other algorithm.
9010
07:30:53,440 --> 07:30:54,640
They each have their trade-offs.
9011
07:30:54,640 --> 07:30:57,480
And maybe depending on the data, one type of algorithm
9012
07:30:57,480 --> 07:30:59,360
is going to be better suited to trying to model
9013
07:30:59,360 --> 07:31:01,320
that information than some other algorithm.
9014
07:31:01,320 --> 07:31:04,440
And so this is what a lot of machine learning research ends up being about,
9015
07:31:04,440 --> 07:31:06,740
that when you're trying to apply machine learning techniques,
9016
07:31:06,740 --> 07:31:09,400
you're often looking not just at one particular algorithm,
9017
07:31:09,400 --> 07:31:11,160
but trying multiple different algorithms,
9018
07:31:11,160 --> 07:31:14,520
trying to see what is going to give you the best results for trying
9019
07:31:14,520 --> 07:31:18,720
to predict some function that maps inputs to outputs.
9020
07:31:18,720 --> 07:31:22,320
So what then are the drawbacks of K nearest neighbor classification?
9021
07:31:22,320 --> 07:31:23,560
Well, there are a couple.
9022
07:31:23,560 --> 07:31:27,200
One might be that in a naive approach, at least, it could be fairly slow
9023
07:31:27,200 --> 07:31:30,000
to have to go through and measure the distance between a point
9024
07:31:30,000 --> 07:31:32,240
and every single one of these points that exist here.
9025
07:31:32,240 --> 07:31:33,740
Now, there are ways of trying to get around that.
9026
07:31:33,740 --> 07:31:36,320
There are data structures that can help to make it more quickly
9027
07:31:36,320 --> 07:31:38,080
to be able to find these neighbors.
9028
07:31:38,080 --> 07:31:41,440
There are also techniques you can use to try and prune some of this data,
9029
07:31:41,440 --> 07:31:43,600
remove some of the data points so that you're only
9030
07:31:43,600 --> 07:31:47,320
left with the relevant data points just to make it a little bit easier.
9031
07:31:47,320 --> 07:31:49,840
But ultimately, what we might like to do is come up
9032
07:31:49,840 --> 07:31:53,320
with another way of trying to do this classification.
9033
07:31:53,320 --> 07:31:55,500
And one way of trying to do the classification
9034
07:31:55,500 --> 07:31:57,680
was looking at what are the neighboring points.
9035
07:31:57,680 --> 07:32:01,120
But another way might be to try to look at all of the data
9036
07:32:01,120 --> 07:32:05,240
and see if we can come up with some decision boundary, some boundary that
9037
07:32:05,240 --> 07:32:08,760
will separate the rainy days from the not rainy days.
9038
07:32:08,760 --> 07:32:11,720
And in the case of two dimensions, we can do that by drawing a line,
9039
07:32:11,720 --> 07:32:12,560
for example.
9040
07:32:12,560 --> 07:32:15,840
So what we might want to try to do is just find some line,
9041
07:32:15,840 --> 07:32:20,440
find some separator that divides the rainy days, the blue points over here,
9042
07:32:20,440 --> 07:32:22,960
from the not rainy days, the red points over there.
9043
07:32:22,960 --> 07:32:25,600
We're now trying a different approach in contrast
9044
07:32:25,600 --> 07:32:27,840
with the nearest neighbor approach, which just
9045
07:32:27,840 --> 07:32:31,320
looked at local data around the input data point that we cared about.
9046
07:32:31,320 --> 07:32:35,080
Now what we're doing is trying to use a technique known as linear regression
9047
07:32:35,080 --> 07:32:39,800
to find some sort of line that will separate the two halves from each other.
9048
07:32:39,800 --> 07:32:42,080
Now sometimes it'll actually be possible to come up
9049
07:32:42,080 --> 07:32:45,120
with some line that perfectly separates all the rainy days
9050
07:32:45,120 --> 07:32:46,520
from the not rainy days.
9051
07:32:46,520 --> 07:32:49,040
Realistically, though, this is probably cleaner
9052
07:32:49,040 --> 07:32:50,960
than many data sets will actually be.
9053
07:32:50,960 --> 07:32:52,400
Oftentimes, data is messier.
9054
07:32:52,400 --> 07:32:53,280
There are outliers.
9055
07:32:53,280 --> 07:32:56,760
There's random noise that happens inside of a particular system.
9056
07:32:56,760 --> 07:32:59,160
And what we'd like to do is still be able to figure out
9057
07:32:59,160 --> 07:33:00,560
what a line might look like.
9058
07:33:00,560 --> 07:33:04,960
So in practice, the data will not always be linearly separable.
9059
07:33:04,960 --> 07:33:07,680
Or linearly separable refers to some data set
9060
07:33:07,680 --> 07:33:11,680
where I could draw a line just to separate the two halves of it perfectly.
9061
07:33:11,680 --> 07:33:13,520
Instead, you might have a situation like this,
9062
07:33:13,520 --> 07:33:16,960
where there are some rainy points that are on this side of the line
9063
07:33:16,960 --> 07:33:19,480
and some not rainy points that are on that side of the line.
9064
07:33:19,480 --> 07:33:23,800
And there may not be a line that perfectly separates
9065
07:33:23,800 --> 07:33:25,920
what path of the inputs from the other half,
9066
07:33:25,920 --> 07:33:29,440
that perfectly separates all the rainy days from the not rainy days.
9067
07:33:29,440 --> 07:33:33,000
But we can still say that this line does a pretty good job.
9068
07:33:33,000 --> 07:33:34,880
And we'll try to formalize a little bit later
9069
07:33:34,880 --> 07:33:38,400
what we mean when we say something like this line does a pretty good job
9070
07:33:38,400 --> 07:33:40,000
of trying to make that prediction.
9071
07:33:40,000 --> 07:33:42,640
But for now, let's just say we're looking for a line that
9072
07:33:42,640 --> 07:33:47,680
does as good of a job as we can at trying to separate one category of things
9073
07:33:47,680 --> 07:33:49,560
from another category of things.
9074
07:33:49,560 --> 07:33:53,080
So let's now try to formalize this a little bit more mathematically.
9075
07:33:53,080 --> 07:33:56,400
We want to come up with some sort of function, some way we can define this
9076
07:33:56,400 --> 07:33:57,200
line.
9077
07:33:57,200 --> 07:34:01,840
And our inputs are things like humidity and pressure in this case.
9078
07:34:01,840 --> 07:34:05,760
So our inputs we might call x1 is going to represent humidity,
9079
07:34:05,760 --> 07:34:08,320
and x2 is going to represent pressure.
9080
07:34:08,320 --> 07:34:11,440
These are inputs that we are going to provide to our machine learning
9081
07:34:11,440 --> 07:34:12,160
algorithm.
9082
07:34:12,160 --> 07:34:14,800
And given those inputs, we would like for our model
9083
07:34:14,800 --> 07:34:17,160
to be able to predict some sort of output.
9084
07:34:17,160 --> 07:34:20,360
And we are going to predict that using our hypothesis function, which
9085
07:34:20,360 --> 07:34:21,520
we called h.
9086
07:34:21,520 --> 07:34:26,600
Our hypothesis function is going to take as input x1 and x2, humidity
9087
07:34:26,600 --> 07:34:27,720
and pressure in this case.
9088
07:34:27,720 --> 07:34:29,680
And you can imagine if we didn't just have two inputs,
9089
07:34:29,680 --> 07:34:31,760
we had three or four or five inputs or more,
9090
07:34:31,760 --> 07:34:35,200
we could have this hypothesis function take all of those as input.
9091
07:34:35,200 --> 07:34:38,440
And we'll see examples of that a little bit later as well.
9092
07:34:38,440 --> 07:34:42,280
And now the question is, what does this hypothesis function do?
9093
07:34:42,280 --> 07:34:46,880
Well, it really just needs to measure, is this data point
9094
07:34:46,880 --> 07:34:51,560
on one side of the boundary, or is it on the other side of the boundary?
9095
07:34:51,560 --> 07:34:53,520
And how do we formalize that boundary?
9096
07:34:53,520 --> 07:34:55,920
Well, the boundary is generally going to be
9097
07:34:55,920 --> 07:34:59,600
a linear combination of these input variables,
9098
07:34:59,600 --> 07:35:01,120
at least in this particular case.
9099
07:35:01,120 --> 07:35:03,840
So what we're trying to do when we say linear combination
9100
07:35:03,840 --> 07:35:06,440
is take each of these inputs and multiply them
9101
07:35:06,440 --> 07:35:08,520
by some number that we're going to have to figure out.
9102
07:35:08,520 --> 07:35:11,600
We'll generally call that number a weight for how important
9103
07:35:11,600 --> 07:35:14,760
should these variables be in trying to determine the answer.
9104
07:35:14,760 --> 07:35:17,400
So we'll weight each of these variables with some weight,
9105
07:35:17,400 --> 07:35:19,880
and we might add a constant to it just to try and make
9106
07:35:19,880 --> 07:35:21,560
the function a little bit different.
9107
07:35:21,560 --> 07:35:23,240
And the result, we just need to compare.
9108
07:35:23,240 --> 07:35:26,300
Is it greater than 0, or is it less than 0 to say,
9109
07:35:26,300 --> 07:35:30,120
does it belong on one side of the line or the other side of the line?
9110
07:35:30,120 --> 07:35:33,960
So what that mathematical expression might look like is this.
9111
07:35:33,960 --> 07:35:38,920
We would take each of my variables, x1 and x2, multiply them by some weight.
9112
07:35:38,920 --> 07:35:40,600
I don't yet know what that weight is, but it's
9113
07:35:40,600 --> 07:35:43,600
going to be some number, weight 1 and weight 2.
9114
07:35:43,600 --> 07:35:46,440
And maybe we just want to add some other weight 0 to it,
9115
07:35:46,440 --> 07:35:50,080
because the function might require us to shift the entire value up or down
9116
07:35:50,080 --> 07:35:51,480
by a certain amount.
9117
07:35:51,480 --> 07:35:52,720
And then we just compare.
9118
07:35:52,720 --> 07:35:55,760
If we do all this math, is it greater than or equal to 0?
9119
07:35:55,760 --> 07:35:58,720
If so, we might categorize that data point as a rainy day.
9120
07:35:58,720 --> 07:36:02,360
And otherwise, we might say, no rain.
9121
07:36:02,360 --> 07:36:05,160
So the key here, then, is that this expression
9122
07:36:05,160 --> 07:36:08,540
is how we are going to calculate whether it's a rainy day or not.
9123
07:36:08,540 --> 07:36:11,520
We're going to do a bunch of math where we take each of the variables,
9124
07:36:11,520 --> 07:36:14,560
multiply them by a weight, maybe add an extra weight to it,
9125
07:36:14,560 --> 07:36:17,000
see if the result is greater than or equal to 0.
9126
07:36:17,000 --> 07:36:19,160
And using that result of that expression,
9127
07:36:19,160 --> 07:36:22,580
we're able to determine whether it's raining or not raining.
9128
07:36:22,580 --> 07:36:26,000
This expression here is in this case going to refer to just some line.
9129
07:36:26,000 --> 07:36:29,040
If you were to plot that graphically, it would just be some line.
9130
07:36:29,040 --> 07:36:33,240
And what the line actually looks like depends upon these weights.
9131
07:36:33,240 --> 07:36:35,640
x1 and x2 are the inputs, but these weights
9132
07:36:35,640 --> 07:36:39,160
are really what determine the shape of that line, the slope of that line,
9133
07:36:39,160 --> 07:36:42,040
and what that line actually looks like.
9134
07:36:42,040 --> 07:36:45,200
So we then would like to figure out what these weights should be.
9135
07:36:45,200 --> 07:36:47,460
We can choose whatever weights we want, but we
9136
07:36:47,460 --> 07:36:51,280
want to choose weights in such a way that if you pass in a rainy day's
9137
07:36:51,280 --> 07:36:53,800
humidity and pressure, then you end up with a result that
9138
07:36:53,800 --> 07:36:55,240
is greater than or equal to 0.
9139
07:36:55,240 --> 07:36:57,960
And we would like it such that if we passed into our hypothesis
9140
07:36:57,960 --> 07:37:01,880
function a not rainy day's inputs, then the output that we get
9141
07:37:01,880 --> 07:37:03,880
should be not raining.
9142
07:37:03,880 --> 07:37:06,880
So before we get there, let's try and formalize this a little bit more
9143
07:37:06,880 --> 07:37:10,280
mathematically just to get a sense for how it is that you'll often see this
9144
07:37:10,280 --> 07:37:12,960
if you ever go further into supervised machine learning
9145
07:37:12,960 --> 07:37:14,320
and explore this idea.
9146
07:37:14,320 --> 07:37:16,480
One thing is that generally for these categories,
9147
07:37:16,480 --> 07:37:20,240
we'll sometimes just use the names of the categories like rain and not rain.
9148
07:37:20,240 --> 07:37:23,520
Often mathematically, if we're trying to do comparisons between these things,
9149
07:37:23,520 --> 07:37:25,960
it's easier just to deal in the world of numbers.
9150
07:37:25,960 --> 07:37:30,600
So we could just say 1 and 0, 1 for raining, 0 for not raining.
9151
07:37:30,600 --> 07:37:31,880
So we do all this math.
9152
07:37:31,880 --> 07:37:34,360
And if the result is greater than or equal to 0,
9153
07:37:34,360 --> 07:37:37,960
we'll go ahead and say our hypothesis function outputs 1, meaning raining.
9154
07:37:37,960 --> 07:37:41,040
And otherwise, it outputs 0, meaning not raining.
9155
07:37:41,040 --> 07:37:45,000
And oftentimes, this type of expression will instead
9156
07:37:45,000 --> 07:37:47,840
express using vector mathematics.
9157
07:37:47,840 --> 07:37:50,240
And all a vector is, if you're not familiar with the term,
9158
07:37:50,240 --> 07:37:53,240
is it refers to a sequence of numerical values.
9159
07:37:53,240 --> 07:37:56,480
You could represent that in Python using a list of numerical values
9160
07:37:56,480 --> 07:37:59,160
or a tuple with numerical values.
9161
07:37:59,160 --> 07:38:02,920
And here, we have a couple of sequences of numerical values.
9162
07:38:02,920 --> 07:38:06,160
One of our vectors, one of our sequences of numerical values,
9163
07:38:06,160 --> 07:38:11,200
are all of these individual weights, w0, w1, and w2.
9164
07:38:11,200 --> 07:38:14,240
So we could construct what we'll call a weight vector,
9165
07:38:14,240 --> 07:38:16,400
and we'll see why this is useful in a moment,
9166
07:38:16,400 --> 07:38:19,960
called w, generally represented using a boldface w, that
9167
07:38:19,960 --> 07:38:23,080
is just a sequence of these three weights, weight 0, weight 1,
9168
07:38:23,080 --> 07:38:24,480
and weight 2.
9169
07:38:24,480 --> 07:38:26,720
And to be able to calculate, based on those weights,
9170
07:38:26,720 --> 07:38:30,440
whether we think a day is raining or not raining,
9171
07:38:30,440 --> 07:38:35,320
we're going to multiply each of those weights by one of our input variables.
9172
07:38:35,320 --> 07:38:39,480
That w2, this weight, is going to be multiplied by input variable x2.
9173
07:38:39,480 --> 07:38:42,640
w1 is going to be multiplied by input variable x1.
9174
07:38:42,640 --> 07:38:46,120
And w0, well, it's not being multiplied by anything.
9175
07:38:46,120 --> 07:38:48,120
But to make sure the vectors are the same length,
9176
07:38:48,120 --> 07:38:50,200
and we'll see why that's useful in just a second,
9177
07:38:50,200 --> 07:38:54,080
we'll just go ahead and say w0 is being multiplied by 1.
9178
07:38:54,080 --> 07:38:55,840
Because you can multiply by something by 1,
9179
07:38:55,840 --> 07:38:58,040
and you end up getting the exact same number.
9180
07:38:58,040 --> 07:39:00,480
So in addition to the weight vector w, we'll
9181
07:39:00,480 --> 07:39:05,480
also have an input vector that we'll call x that has three values, 1,
9182
07:39:05,480 --> 07:39:11,080
again, because we're just multiplying w0 by 1 eventually, and then x1 and x2.
9183
07:39:11,080 --> 07:39:14,800
So here, then, we've represented two distinct vectors, a vector of weights
9184
07:39:14,800 --> 07:39:16,520
that we need to somehow learn.
9185
07:39:16,520 --> 07:39:18,640
The goal of our machine learning algorithm
9186
07:39:18,640 --> 07:39:21,160
is to learn what this weight vector is supposed to be.
9187
07:39:21,160 --> 07:39:23,440
We could choose any arbitrary set of numbers,
9188
07:39:23,440 --> 07:39:26,400
and it would produce a function that tries to predict rain or not rain,
9189
07:39:26,400 --> 07:39:28,080
but it probably wouldn't be very good.
9190
07:39:28,080 --> 07:39:32,120
What we want to do is come up with a good choice of these weights
9191
07:39:32,120 --> 07:39:34,920
so that we're able to do the accurate predictions.
9192
07:39:34,920 --> 07:39:38,720
And then this input vector represents a particular input
9193
07:39:38,720 --> 07:39:41,880
to the function, a data point for which we would like to estimate,
9194
07:39:41,880 --> 07:39:45,200
is that day a rainy day, or is that day a not rainy day?
9195
07:39:45,200 --> 07:39:47,040
And so that's going to vary just depending
9196
07:39:47,040 --> 07:39:49,200
on what input is provided to our function, what
9197
07:39:49,200 --> 07:39:51,240
it is that we are trying to estimate.
9198
07:39:51,240 --> 07:39:55,240
And then to do the calculation, we want to calculate this expression here,
9199
07:39:55,240 --> 07:39:59,200
and it turns out that expression is what we would call the dot product
9200
07:39:59,200 --> 07:40:00,320
of these two vectors.
9201
07:40:00,320 --> 07:40:04,600
The dot product of two vectors just means taking each of the terms
9202
07:40:04,600 --> 07:40:08,120
in the vectors and multiplying them together, w0 multiply it by 1,
9203
07:40:08,120 --> 07:40:11,720
w1 multiply it by x1, w2 multiply it by x2,
9204
07:40:11,720 --> 07:40:14,360
and that's why these vectors need to be the same length.
9205
07:40:14,360 --> 07:40:17,400
And then we just add all of the results together.
9206
07:40:17,400 --> 07:40:22,960
So the dot product of w and x, our weight vector and our input vector,
9207
07:40:22,960 --> 07:40:26,640
that's just going to be w0 times 1, or just w0,
9208
07:40:26,640 --> 07:40:30,680
plus w1 times x1, multiplying these two terms together,
9209
07:40:30,680 --> 07:40:35,760
plus w2 times x2, multiplying those terms together.
9210
07:40:35,760 --> 07:40:38,120
So we have our weight vector, which we need to figure out.
9211
07:40:38,120 --> 07:40:39,960
We need our machine learning algorithm to figure out
9212
07:40:39,960 --> 07:40:41,200
what the weights should be.
9213
07:40:41,200 --> 07:40:44,280
We have the input vector representing the data point
9214
07:40:44,280 --> 07:40:47,560
that we're trying to predict a category for, predict a label for.
9215
07:40:47,560 --> 07:40:51,080
And we're able to do that calculation by taking this dot product, which
9216
07:40:51,080 --> 07:40:53,120
you'll often see represented in vector form.
9217
07:40:53,120 --> 07:40:54,880
But if you haven't seen vectors before, you
9218
07:40:54,880 --> 07:40:57,760
can think of it as identical to just this mathematical expression,
9219
07:40:57,760 --> 07:41:01,120
just doing the multiplication, adding the results together,
9220
07:41:01,120 --> 07:41:04,400
and then seeing whether the result is greater than or equal to 0 or not.
9221
07:41:04,400 --> 07:41:07,480
This expression here is identical to the expression
9222
07:41:07,480 --> 07:41:09,760
that we're calculating to see whether or not
9223
07:41:09,760 --> 07:41:14,200
that answer is greater than or equal to 0 in this case.
9224
07:41:14,200 --> 07:41:17,280
And so for that reason, you'll often see the hypothesis function
9225
07:41:17,280 --> 07:41:20,520
written as something like this, a simpler representation where
9226
07:41:20,520 --> 07:41:25,360
the hypothesis takes as input some input vector x, some humidity
9227
07:41:25,360 --> 07:41:26,960
and pressure for some day.
9228
07:41:26,960 --> 07:41:30,720
And we want to predict an output like rain or no rain or 1 or 0
9229
07:41:30,720 --> 07:41:33,360
if we choose to represent things numerically.
9230
07:41:33,360 --> 07:41:37,520
And the way we do that is by taking the dot product of the weights
9231
07:41:37,520 --> 07:41:38,640
and our input.
9232
07:41:38,640 --> 07:41:42,080
If it's greater than or equal to 0, we'll go ahead and say the output is 1.
9233
07:41:42,080 --> 07:41:44,960
Otherwise, the output is going to be 0.
9234
07:41:44,960 --> 07:41:49,080
And this hypothesis, we say, is parameterized by the weights.
9235
07:41:49,080 --> 07:41:51,280
Depending on what weights we choose, we'll
9236
07:41:51,280 --> 07:41:53,400
end up getting a different hypothesis.
9237
07:41:53,400 --> 07:41:55,480
If we choose the weights randomly, we're probably
9238
07:41:55,480 --> 07:41:57,480
not going to get a very good hypothesis function.
9239
07:41:57,480 --> 07:41:58,840
We'll get a 1 or a 0.
9240
07:41:58,840 --> 07:42:01,120
But it's probably not accurately going to reflect
9241
07:42:01,120 --> 07:42:04,280
whether we think a day is going to be rainy or not rainy.
9242
07:42:04,280 --> 07:42:06,860
But if we choose the weights right, we can often
9243
07:42:06,860 --> 07:42:09,960
do a pretty good job of trying to estimate whether we think
9244
07:42:09,960 --> 07:42:13,800
the output of the function should be a 1 or a 0.
9245
07:42:13,800 --> 07:42:16,080
And so the question, then, is how to figure out
9246
07:42:16,080 --> 07:42:19,800
what these weights should be, how to be able to tune those parameters.
9247
07:42:19,800 --> 07:42:21,800
And there are a number of ways you can do that.
9248
07:42:21,800 --> 07:42:25,680
One of the most common is known as the perceptron learning rule.
9249
07:42:25,680 --> 07:42:27,160
And we'll see more of this later.
9250
07:42:27,160 --> 07:42:29,120
But the idea of the perceptron learning rule,
9251
07:42:29,120 --> 07:42:30,860
and we're not going to get too deep into the mathematics,
9252
07:42:30,860 --> 07:42:33,240
we'll mostly just introduce it more conceptually,
9253
07:42:33,240 --> 07:42:37,640
is to say that given some data point that we would like to learn from,
9254
07:42:37,640 --> 07:42:41,520
some data point that has an input x and an output y, where
9255
07:42:41,520 --> 07:42:46,180
y is like 1 for rain or 0 for not rain, then we're going to update the weights.
9256
07:42:46,180 --> 07:42:48,100
And we'll look at the formula in just a moment.
9257
07:42:48,100 --> 07:42:51,880
But the big picture idea is that we can start with random weights,
9258
07:42:51,880 --> 07:42:53,720
but then learn from the data.
9259
07:42:53,720 --> 07:42:55,640
Take the data points one at a time.
9260
07:42:55,640 --> 07:42:58,600
And for each one of the data points, figure out, all right,
9261
07:42:58,600 --> 07:43:02,200
what parameters do we need to change inside of the weights
9262
07:43:02,200 --> 07:43:05,120
in order to better match that input point.
9263
07:43:05,120 --> 07:43:07,840
And so that is the value of having access to a lot of data
9264
07:43:07,840 --> 07:43:09,800
in the supervised machine learning algorithm,
9265
07:43:09,800 --> 07:43:13,080
is that you take each of the data points and maybe look at them multiple times
9266
07:43:13,080 --> 07:43:15,600
and constantly try and figure out whether you
9267
07:43:15,600 --> 07:43:19,920
need to shift your weights in order to better create some weight vector that
9268
07:43:19,920 --> 07:43:24,000
is able to correctly or more accurately try to estimate what the output should
9269
07:43:24,000 --> 07:43:25,840
be, whether we think it's going to be raining
9270
07:43:25,840 --> 07:43:28,640
or whether we think it's not going to be raining.
9271
07:43:28,640 --> 07:43:30,360
So what does that weight update look like?
9272
07:43:30,360 --> 07:43:32,240
Without going into too much of the mathematics,
9273
07:43:32,240 --> 07:43:35,960
we're going to update each of the weights to be the result of the original
9274
07:43:35,960 --> 07:43:39,360
weight plus some additional expression.
9275
07:43:39,360 --> 07:43:41,920
And to understand this expression, y, well,
9276
07:43:41,920 --> 07:43:44,720
y is what the actual output is.
9277
07:43:44,720 --> 07:43:50,200
And hypothesis of x, the input, that's going to be what we thought the input
9278
07:43:50,200 --> 07:43:51,000
was.
9279
07:43:51,000 --> 07:43:55,040
And so I can replace this by saying what the actual value was minus what
9280
07:43:55,040 --> 07:43:56,720
our estimate was.
9281
07:43:56,720 --> 07:44:01,360
And based on the difference between the actual value and what our estimate was,
9282
07:44:01,360 --> 07:44:04,120
we might want to change our hypothesis, change the way
9283
07:44:04,120 --> 07:44:06,240
that we do that estimation.
9284
07:44:06,240 --> 07:44:08,800
If the actual value and the estimate were the same thing,
9285
07:44:08,800 --> 07:44:11,440
meaning we were correctly able to predict what category
9286
07:44:11,440 --> 07:44:14,920
this data point belonged to, well, then actual value minus estimate,
9287
07:44:14,920 --> 07:44:18,280
that's just going to be 0, which means this whole term on the right-hand side
9288
07:44:18,280 --> 07:44:20,720
goes to be 0, and the weight doesn't change.
9289
07:44:20,720 --> 07:44:24,120
Weight i, where i is like weight 1 or weight 2 or weight 0,
9290
07:44:24,120 --> 07:44:26,440
weight i just stays at weight i.
9291
07:44:26,440 --> 07:44:29,840
And none of the weights change if we were able to correctly predict
9292
07:44:29,840 --> 07:44:32,240
what category the input belonged to.
9293
07:44:32,240 --> 07:44:36,040
But if our hypothesis didn't correctly predict what category the input
9294
07:44:36,040 --> 07:44:40,320
belonged to, well, then maybe then we need to make some changes, adjust
9295
07:44:40,320 --> 07:44:43,280
the weights so that we're better able to predict this kind of data
9296
07:44:43,280 --> 07:44:45,040
point in the future.
9297
07:44:45,040 --> 07:44:47,000
And what is the way we might do that?
9298
07:44:47,000 --> 07:44:51,080
Well, if the actual value was bigger than the estimate, then,
9299
07:44:51,080 --> 07:44:54,520
and for now we'll go ahead and assume that these x's are positive values,
9300
07:44:54,520 --> 07:44:57,360
then if the actual value was bigger than the estimate,
9301
07:44:57,360 --> 07:45:00,280
well, that means we need to increase the weight in order
9302
07:45:00,280 --> 07:45:02,480
to make it such that the output is bigger,
9303
07:45:02,480 --> 07:45:06,040
and therefore we're more likely to get to the right actual value.
9304
07:45:06,040 --> 07:45:08,400
And so if the actual value is bigger than the estimate,
9305
07:45:08,400 --> 07:45:11,320
then actual value minus estimate, that'll be a positive number.
9306
07:45:11,320 --> 07:45:14,200
And so you imagine we're just adding some positive number to the weight
9307
07:45:14,200 --> 07:45:16,680
just to increase it ever so slightly.
9308
07:45:16,680 --> 07:45:19,640
And likewise, the inverse case is true, that if the actual value
9309
07:45:19,640 --> 07:45:23,400
was less than the estimate, the actual value was 0,
9310
07:45:23,400 --> 07:45:26,400
but we estimated 1, meaning it actually was not raining,
9311
07:45:26,400 --> 07:45:28,520
but we predicted it was going to be raining.
9312
07:45:28,520 --> 07:45:31,120
Well, then we want to decrease the value of the weight,
9313
07:45:31,120 --> 07:45:33,880
because then in that case, we want to try and lower
9314
07:45:33,880 --> 07:45:36,520
the total value of computing that dot product in order
9315
07:45:36,520 --> 07:45:39,640
to make it less likely that we would predict that it would actually
9316
07:45:39,640 --> 07:45:40,920
be raining.
9317
07:45:40,920 --> 07:45:43,680
So no need to get too deep into the mathematics of that,
9318
07:45:43,680 --> 07:45:46,840
but the general idea is that every time we encounter some data point,
9319
07:45:46,840 --> 07:45:49,600
we can adjust these weights accordingly to try and make
9320
07:45:49,600 --> 07:45:53,760
the weights better line up with the actual data that we have access to.
9321
07:45:53,760 --> 07:45:56,520
And you can repeat this process with data point after data point
9322
07:45:56,520 --> 07:45:58,600
until eventually, hopefully, your algorithm
9323
07:45:58,600 --> 07:46:02,360
converges to some set of weights that do a pretty good job of trying
9324
07:46:02,360 --> 07:46:05,960
to figure out whether a day is going to be rainy or not raining.
9325
07:46:05,960 --> 07:46:08,640
And just as a final point about this particular equation,
9326
07:46:08,640 --> 07:46:12,400
this value alpha here is generally what we'll call the learning rate.
9327
07:46:12,400 --> 07:46:15,040
It's just some parameter, some number we choose
9328
07:46:15,040 --> 07:46:18,600
for how quickly we're actually going to be updating these weight values.
9329
07:46:18,600 --> 07:46:20,360
So that if alpha is bigger, then we're going
9330
07:46:20,360 --> 07:46:22,280
to update these weight values by a lot.
9331
07:46:22,280 --> 07:46:25,280
And if alpha is smaller, then we'll update the weight values by less.
9332
07:46:25,280 --> 07:46:26,840
And you can choose a value of alpha.
9333
07:46:26,840 --> 07:46:29,080
Depending on the problem, different values
9334
07:46:29,080 --> 07:46:32,880
might suit the situation better or worse than others.
9335
07:46:32,880 --> 07:46:36,360
So after all of that, after we've done this training process of take
9336
07:46:36,360 --> 07:46:38,800
all this data and using this learning rule,
9337
07:46:38,800 --> 07:46:43,160
look at all the pieces of data and use each piece of data as an indication
9338
07:46:43,160 --> 07:46:45,960
to us of do the weights stay the same, do we increase the weights,
9339
07:46:45,960 --> 07:46:48,880
do we decrease the weights, and if so, by how much?
9340
07:46:48,880 --> 07:46:52,840
What you end up with is effectively a threshold function.
9341
07:46:52,840 --> 07:46:56,120
And we can look at what the threshold function looks like like this.
9342
07:46:56,120 --> 07:46:58,800
On the x-axis here, we have the output of that function,
9343
07:46:58,800 --> 07:47:03,080
taking the weights, taking the dot product of it with the input.
9344
07:47:03,080 --> 07:47:05,880
And on the y-axis, we have what the output is going to be,
9345
07:47:05,880 --> 07:47:08,880
0, which in this case represented not raining,
9346
07:47:08,880 --> 07:47:11,880
and 1, which in this case represented raining.
9347
07:47:11,880 --> 07:47:16,480
And the way that our hypothesis function works is it calculates this value.
9348
07:47:16,480 --> 07:47:20,320
And if it's greater than 0 or greater than some threshold value,
9349
07:47:20,320 --> 07:47:22,400
then we declare that it's a rainy day.
9350
07:47:22,400 --> 07:47:25,280
And otherwise, we declare that it's a not rainy day.
9351
07:47:25,280 --> 07:47:28,600
And this then graphically is what that function looks like,
9352
07:47:28,600 --> 07:47:32,280
that initially when the value of this dot product is small, it's not raining,
9353
07:47:32,280 --> 07:47:33,800
it's not raining, it's not raining.
9354
07:47:33,800 --> 07:47:36,220
But as soon as it crosses that threshold,
9355
07:47:36,220 --> 07:47:39,600
we suddenly say, OK, now it's raining, now it's raining, now it's raining.
9356
07:47:39,600 --> 07:47:42,160
And the way to interpret this kind of representation
9357
07:47:42,160 --> 07:47:44,600
is that anything on this side of the line, that
9358
07:47:44,600 --> 07:47:47,960
would be the category of data points where we say, yes, it's raining.
9359
07:47:47,960 --> 07:47:49,880
Anything that falls on this side of the line
9360
07:47:49,880 --> 07:47:52,440
are the data points where we would say, it's not raining.
9361
07:47:52,440 --> 07:47:54,920
And again, we want to choose some value for the weights
9362
07:47:54,920 --> 07:47:57,840
that results in a function that does a pretty good job of trying
9363
07:47:57,840 --> 07:48:00,200
to do this estimation.
9364
07:48:00,200 --> 07:48:04,040
But one tricky thing with this type of hard threshold
9365
07:48:04,040 --> 07:48:07,240
is that it only leaves two possible outcomes.
9366
07:48:07,240 --> 07:48:09,800
We plug in some data as input.
9367
07:48:09,800 --> 07:48:13,080
And the output we get is raining or not raining.
9368
07:48:13,080 --> 07:48:15,840
And there's no room for anywhere in between.
9369
07:48:15,840 --> 07:48:17,080
And maybe that's what you want.
9370
07:48:17,080 --> 07:48:19,440
Maybe all you want is given some data point,
9371
07:48:19,440 --> 07:48:22,520
you would like to be able to classify it into one or two or more
9372
07:48:22,520 --> 07:48:24,920
of these various different categories.
9373
07:48:24,920 --> 07:48:28,200
But it might also be the case that you care about knowing
9374
07:48:28,200 --> 07:48:31,040
how strong that prediction is, for example.
9375
07:48:31,040 --> 07:48:34,040
So if we go back to this instance here, where we have rainy days
9376
07:48:34,040 --> 07:48:38,040
on this side of the line, not rainy days on that side of the line,
9377
07:48:38,040 --> 07:48:41,900
you might imagine that let's look now at these two white data points.
9378
07:48:41,900 --> 07:48:46,040
This data point here that we would like to predict a label or a category for.
9379
07:48:46,040 --> 07:48:48,380
And this data point over here that we would also
9380
07:48:48,380 --> 07:48:51,440
like to predict a label or a category for.
9381
07:48:51,440 --> 07:48:53,560
It seems likely that you could pretty confidently
9382
07:48:53,560 --> 07:48:56,360
say that this data point, that should be a rainy day.
9383
07:48:56,360 --> 07:48:58,400
Seems close to the other rainy days if we're
9384
07:48:58,400 --> 07:49:00,240
going by the nearest neighbor strategy.
9385
07:49:00,240 --> 07:49:04,720
It's on this side of the line if we're going by the strategy of just saying,
9386
07:49:04,720 --> 07:49:07,040
which side of the line does it fall on by figuring out
9387
07:49:07,040 --> 07:49:08,600
what those weights should be.
9388
07:49:08,600 --> 07:49:11,520
And if we're using the line strategy of just which side of the line
9389
07:49:11,520 --> 07:49:14,400
does it fall on, which side of this decision boundary,
9390
07:49:14,400 --> 07:49:18,240
well, we'd also say that this point here is also a rainy day
9391
07:49:18,240 --> 07:49:23,560
because it falls on the side of the line that corresponds to rainy days.
9392
07:49:23,560 --> 07:49:25,920
But it's likely that even in this case, we
9393
07:49:25,920 --> 07:49:29,680
would know that we don't feel nearly as confident about this data
9394
07:49:29,680 --> 07:49:33,120
point on the left as compared to this data point on the right.
9395
07:49:33,120 --> 07:49:35,520
That for this one on the right, we can feel very confident
9396
07:49:35,520 --> 07:49:37,000
that yes, it's a rainy day.
9397
07:49:37,000 --> 07:49:41,360
This one, it's pretty close to the line if we're judging just by distance.
9398
07:49:41,360 --> 07:49:44,200
And so you might be less sure.
9399
07:49:44,200 --> 07:49:48,320
But our threshold function doesn't allow for a notion of less sure
9400
07:49:48,320 --> 07:49:50,000
or more sure about something.
9401
07:49:50,000 --> 07:49:51,920
It's what we would call a hard threshold.
9402
07:49:51,920 --> 07:49:55,000
It's once you've crossed this line, then immediately we say,
9403
07:49:55,000 --> 07:49:57,480
yes, this is going to be a rainy day.
9404
07:49:57,480 --> 07:50:00,520
Anywhere before it, we're going to say it's not a rainy day.
9405
07:50:00,520 --> 07:50:03,160
And that may not be helpful in a number of cases.
9406
07:50:03,160 --> 07:50:06,440
One, this is not a particularly easy function to deal with.
9407
07:50:06,440 --> 07:50:08,640
As you get deeper into the world of machine learning
9408
07:50:08,640 --> 07:50:11,280
and are trying to do things like taking derivatives of these curves
9409
07:50:11,280 --> 07:50:14,160
with this type of function makes things challenging.
9410
07:50:14,160 --> 07:50:16,120
But the other challenge is that we don't really
9411
07:50:16,120 --> 07:50:17,960
have any notion of gradation between things.
9412
07:50:17,960 --> 07:50:21,400
We don't have a notion of yes, this is a very strong belief
9413
07:50:21,400 --> 07:50:25,560
that it's going to be raining as opposed to it's probably more likely than not
9414
07:50:25,560 --> 07:50:30,040
that it's going to be raining, but maybe not totally sure about that either.
9415
07:50:30,040 --> 07:50:32,560
So what we can do by taking advantage of a technique known
9416
07:50:32,560 --> 07:50:36,160
as logistic regression is instead of using this hard threshold
9417
07:50:36,160 --> 07:50:39,920
type of function, we can use instead a logistic function, something
9418
07:50:39,920 --> 07:50:41,840
we might call a soft threshold.
9419
07:50:41,840 --> 07:50:45,000
And that's going to transform this into looking something
9420
07:50:45,000 --> 07:50:48,160
a little more like this, something that more nicely curves.
9421
07:50:48,160 --> 07:50:52,760
And as a result, the possible output values are no longer just 0 and 1,
9422
07:50:52,760 --> 07:50:55,000
0 for not raining, 1 for raining.
9423
07:50:55,000 --> 07:50:59,320
But you can actually get any real numbered value between 0 and 1.
9424
07:50:59,320 --> 07:51:03,080
But if you're way over on this side, then you get a value of 0.
9425
07:51:03,080 --> 07:51:05,600
OK, it's not going to be raining, and we're pretty sure about that.
9426
07:51:05,600 --> 07:51:07,680
And if you're over on this side, you get a value of 1.
9427
07:51:07,680 --> 07:51:10,280
And yes, we're very sure that it's going to be raining.
9428
07:51:10,280 --> 07:51:13,040
But in between, you could get some real numbered value,
9429
07:51:13,040 --> 07:51:17,200
where a value like 0.7 might mean we think it's going to rain.
9430
07:51:17,200 --> 07:51:20,680
It's more probable that it's going to rain than not based on the data.
9431
07:51:20,680 --> 07:51:25,080
But we're not as confident as some of the other data points might be.
9432
07:51:25,080 --> 07:51:27,520
So one of the advantages of the soft threshold
9433
07:51:27,520 --> 07:51:30,880
is that it allows us to have an output that could be some real number that
9434
07:51:30,880 --> 07:51:34,400
potentially reflects some sort of probability, the likelihood that we
9435
07:51:34,400 --> 07:51:39,480
think that this particular data point belongs to that particular category.
9436
07:51:39,480 --> 07:51:43,920
And there are some other nice mathematical properties of that as well.
9437
07:51:43,920 --> 07:51:46,100
So that then is two different approaches to trying
9438
07:51:46,100 --> 07:51:48,680
to solve this type of classification problem.
9439
07:51:48,680 --> 07:51:51,400
One is this nearest neighbor type of approach,
9440
07:51:51,400 --> 07:51:54,800
where you just take a data point and look at the data points that are nearby
9441
07:51:54,800 --> 07:51:58,440
to try and estimate what category we think it belongs to.
9442
07:51:58,440 --> 07:52:01,160
And the other approach is the approach of saying, all right,
9443
07:52:01,160 --> 07:52:03,600
let's just try and use linear regression,
9444
07:52:03,600 --> 07:52:06,400
figure out what these weights should be, adjust the weights in order
9445
07:52:06,400 --> 07:52:09,920
to figure out what line or what decision boundary is going
9446
07:52:09,920 --> 07:52:12,720
to best separate these two categories.
9447
07:52:12,720 --> 07:52:15,480
It turns out that another popular approach, a very popular approach
9448
07:52:15,480 --> 07:52:17,440
if you just have a data set and you want to start
9449
07:52:17,440 --> 07:52:20,800
trying to do some learning on it, is what we call the support vector machine.
9450
07:52:20,800 --> 07:52:23,600
And we're not going to go too much into the mathematics of the support vector
9451
07:52:23,600 --> 07:52:26,480
machine, but we'll at least explore it graphically to see what it is
9452
07:52:26,480 --> 07:52:27,600
that it looks like.
9453
07:52:27,600 --> 07:52:31,160
And the idea or the motivation behind the support vector machine
9454
07:52:31,160 --> 07:52:34,320
is the idea that there are actually a lot of different lines
9455
07:52:34,320 --> 07:52:37,000
that we could draw, a lot of different decision boundaries
9456
07:52:37,000 --> 07:52:39,240
that we could draw to separate two groups.
9457
07:52:39,240 --> 07:52:41,960
So for example, I had the red data points over here
9458
07:52:41,960 --> 07:52:43,640
and the blue data points over here.
9459
07:52:43,640 --> 07:52:47,560
One possible line I could draw is a line like this,
9460
07:52:47,560 --> 07:52:50,520
that this line here would separate the red points from the blue points.
9461
07:52:50,520 --> 07:52:51,600
And it does so perfectly.
9462
07:52:51,600 --> 07:52:54,000
All the red points are on one side of the line.
9463
07:52:54,000 --> 07:52:56,760
All the blue points are on the other side of the line.
9464
07:52:56,760 --> 07:52:59,800
But this should probably make you a little bit nervous.
9465
07:52:59,800 --> 07:53:02,240
If you come up with a model and the model comes up
9466
07:53:02,240 --> 07:53:03,760
with a line that looks like this.
9467
07:53:03,760 --> 07:53:06,440
And the reason why is that you worry about how well
9468
07:53:06,440 --> 07:53:10,520
it's going to generalize to other data points that are not necessarily
9469
07:53:10,520 --> 07:53:12,720
in the data set that we have access to.
9470
07:53:12,720 --> 07:53:15,440
For example, if there was a point that fell like right here,
9471
07:53:15,440 --> 07:53:19,400
for example, on the right side of the line, well, then based on that,
9472
07:53:19,400 --> 07:53:23,120
we might want to guess that it is, in fact, a red point,
9473
07:53:23,120 --> 07:53:25,760
but it falls on the side of the line where instead we
9474
07:53:25,760 --> 07:53:29,160
would estimate that it's a blue point instead.
9475
07:53:29,160 --> 07:53:32,680
And so based on that, this line is probably not a great choice
9476
07:53:32,680 --> 07:53:36,600
just because it is so close to these various data points.
9477
07:53:36,600 --> 07:53:38,640
We might instead prefer like a diagonal line
9478
07:53:38,640 --> 07:53:41,680
that just goes diagonally through the data set like we've seen before.
9479
07:53:41,680 --> 07:53:44,800
But there too, there's a lot of diagonal lines that we could draw as well.
9480
07:53:44,800 --> 07:53:48,960
For example, I could draw this diagonal line here, which also successfully
9481
07:53:48,960 --> 07:53:51,720
separates all the red points from all of the blue points.
9482
07:53:51,720 --> 07:53:54,480
From the perspective of something like just trying
9483
07:53:54,480 --> 07:53:56,400
to figure out some setting of weights that allows
9484
07:53:56,400 --> 07:53:58,680
us to predict the correct output, this line
9485
07:53:58,680 --> 07:54:02,000
will predict the correct output for this particular set of data
9486
07:54:02,000 --> 07:54:04,800
every single time because the red points are on one side,
9487
07:54:04,800 --> 07:54:06,400
the blue points are on the other.
9488
07:54:06,400 --> 07:54:08,640
But yet again, you should probably be a little nervous
9489
07:54:08,640 --> 07:54:11,480
because this line is so close to these red points,
9490
07:54:11,480 --> 07:54:15,280
even though we're able to correctly predict on the input data,
9491
07:54:15,280 --> 07:54:18,840
if there was a point that fell somewhere in this general area,
9492
07:54:18,840 --> 07:54:22,720
our algorithm, this model, would say that, yeah, we think it's a blue point,
9493
07:54:22,720 --> 07:54:26,880
when in actuality, it might belong to the red category instead
9494
07:54:26,880 --> 07:54:29,760
just because it looks like it's close to the other red points.
9495
07:54:29,760 --> 07:54:33,600
What we really want to be able to say, given this data, how can you generalize
9496
07:54:33,600 --> 07:54:37,160
this as best as possible, is to come up with a line like this that
9497
07:54:37,160 --> 07:54:39,240
seems like the intuitive line to draw.
9498
07:54:39,240 --> 07:54:41,680
And the reason why it's intuitive is because it
9499
07:54:41,680 --> 07:54:47,240
seems to be as far apart as possible from the red data and the blue data.
9500
07:54:47,240 --> 07:54:49,280
So that if we generalize a little bit and assume
9501
07:54:49,280 --> 07:54:51,920
that maybe we have some points that are different from the input
9502
07:54:51,920 --> 07:54:54,120
but still slightly further away, we can still
9503
07:54:54,120 --> 07:54:58,000
say that something on this side probably red, something on that side
9504
07:54:58,000 --> 07:55:01,480
probably blue, and we can make those judgments that way.
9505
07:55:01,480 --> 07:55:04,040
And that is what support vector machines are designed to do.
9506
07:55:04,040 --> 07:55:08,520
They're designed to try and find what we call the maximum margin separator,
9507
07:55:08,520 --> 07:55:10,720
where the maximum margin separator is just
9508
07:55:10,720 --> 07:55:14,680
some boundary that maximizes the distance between the groups of points
9509
07:55:14,680 --> 07:55:16,600
rather than come up with some boundary that's
9510
07:55:16,600 --> 07:55:19,480
very close to one set or the other, where in the case
9511
07:55:19,480 --> 07:55:20,720
before, we wouldn't have cared.
9512
07:55:20,720 --> 07:55:24,000
As long as we're categorizing the input well, that seems all we need to do.
9513
07:55:24,000 --> 07:55:28,720
The support vector machine will try and find this maximum margin separator,
9514
07:55:28,720 --> 07:55:31,920
some way of trying to maximize that particular distance.
9515
07:55:31,920 --> 07:55:35,520
And it does so by finding what we call the support vectors, which
9516
07:55:35,520 --> 07:55:37,600
are the vectors that are closest to the line,
9517
07:55:37,600 --> 07:55:40,880
and trying to maximize the distance between the line
9518
07:55:40,880 --> 07:55:42,760
and those particular points.
9519
07:55:42,760 --> 07:55:44,520
And it works that way in two dimensions.
9520
07:55:44,520 --> 07:55:46,520
It also works in higher dimensions, where we're not
9521
07:55:46,520 --> 07:55:49,560
looking for some line that separates the two data points,
9522
07:55:49,560 --> 07:55:52,640
but instead looking for what we generally call a hyperplane,
9523
07:55:52,640 --> 07:55:57,760
some decision boundary, effectively, that separates one set of data
9524
07:55:57,760 --> 07:55:59,080
from the other set of data.
9525
07:55:59,080 --> 07:56:00,800
And this ability of support vector machines
9526
07:56:00,800 --> 07:56:04,000
to work in higher dimensions actually has a number of other applications
9527
07:56:04,000 --> 07:56:04,600
as well.
9528
07:56:04,600 --> 07:56:07,520
But one is that it helpfully deals with cases
9529
07:56:07,520 --> 07:56:10,560
where data may not be linearly separable.
9530
07:56:10,560 --> 07:56:12,720
So we talked about linear separability before,
9531
07:56:12,720 --> 07:56:16,880
this idea that you can take data and just draw a line or some linear
9532
07:56:16,880 --> 07:56:20,040
combination of the inputs that allows us to perfectly separate
9533
07:56:20,040 --> 07:56:21,560
the two sets from each other.
9534
07:56:21,560 --> 07:56:24,880
There are some data sets that are not linearly separable.
9535
07:56:24,880 --> 07:56:26,560
And some were even two.
9536
07:56:26,560 --> 07:56:29,760
You would not be able to find a good line at all
9537
07:56:29,760 --> 07:56:32,200
that would try to do that kind of separation.
9538
07:56:32,200 --> 07:56:34,320
Something like this, for example.
9539
07:56:34,320 --> 07:56:37,440
Or if you imagine here are the red points and the blue points
9540
07:56:37,440 --> 07:56:38,720
around it.
9541
07:56:38,720 --> 07:56:43,480
If you try to find a line that divides the red points from the blue points,
9542
07:56:43,480 --> 07:56:45,920
it's actually going to be difficult, if not impossible,
9543
07:56:45,920 --> 07:56:49,480
to do that any line you choose, well, if you draw a line here,
9544
07:56:49,480 --> 07:56:52,160
then you ignore all of these blue points that should actually
9545
07:56:52,160 --> 07:56:53,360
be blue and not red.
9546
07:56:53,360 --> 07:56:56,160
Anywhere else you draw a line, there's going to be a lot of error,
9547
07:56:56,160 --> 07:56:58,200
a lot of mistakes, a lot of what we'll soon
9548
07:56:58,200 --> 07:57:02,360
call loss to that line that you draw, a lot of points
9549
07:57:02,360 --> 07:57:04,960
that you're going to categorize incorrectly.
9550
07:57:04,960 --> 07:57:08,080
What we really want is to be able to find a better decision boundary that
9551
07:57:08,080 --> 07:57:12,680
may not be just a straight line through this two dimensional space.
9552
07:57:12,680 --> 07:57:14,760
And what support vector machines can do is
9553
07:57:14,760 --> 07:57:16,860
they can begin to operate in higher dimensions
9554
07:57:16,860 --> 07:57:19,840
and be able to find some other decision boundary,
9555
07:57:19,840 --> 07:57:21,800
like the circle in this case, that actually
9556
07:57:21,800 --> 07:57:24,880
is able to separate one of these sets of data
9557
07:57:24,880 --> 07:57:26,800
from the other set of data a lot better.
9558
07:57:26,800 --> 07:57:30,400
So oftentimes in data sets where the data is not linearly separable,
9559
07:57:30,400 --> 07:57:33,080
support vector machines by working in higher dimensions
9560
07:57:33,080 --> 07:57:37,240
can actually figure out a way to solve that kind of problem effectively.
9561
07:57:37,240 --> 07:57:39,600
So that then, three different approaches to trying
9562
07:57:39,600 --> 07:57:41,280
to solve these sorts of problems.
9563
07:57:41,280 --> 07:57:42,960
We've seen support vector machines.
9564
07:57:42,960 --> 07:57:46,640
We've seen trying to use linear regression and the perceptron learning
9565
07:57:46,640 --> 07:57:49,840
rule to be able to figure out how to categorize inputs and outputs.
9566
07:57:49,840 --> 07:57:51,520
We've seen the nearest neighbor approach.
9567
07:57:51,520 --> 07:57:54,160
No one necessarily better than any other again.
9568
07:57:54,160 --> 07:57:57,440
It's going to depend on the data set, the information you have access to.
9569
07:57:57,440 --> 07:58:00,560
It's going to depend on what the function looks like that you're ultimately
9570
07:58:00,560 --> 07:58:01,280
trying to predict.
9571
07:58:01,280 --> 07:58:04,080
And this is where a lot of research and experimentation
9572
07:58:04,080 --> 07:58:06,600
can be involved in trying to figure out how it
9573
07:58:06,600 --> 07:58:09,640
is to best perform that kind of estimation.
9574
07:58:09,640 --> 07:58:12,180
But classification is only one of the tasks
9575
07:58:12,180 --> 07:58:14,720
that you might encounter in supervised machine learning.
9576
07:58:14,720 --> 07:58:17,720
Because in classification, what we're trying to predict
9577
07:58:17,720 --> 07:58:19,520
is some discrete category.
9578
07:58:19,520 --> 07:58:22,800
We're trying to predict red or blue, rain or not rain,
9579
07:58:22,800 --> 07:58:24,920
authentic or counterfeit.
9580
07:58:24,920 --> 07:58:28,360
But sometimes what we want to predict is a real numbered value.
9581
07:58:28,360 --> 07:58:31,280
And for that, we have a related problem, not classification,
9582
07:58:31,280 --> 07:58:33,440
but instead known as regression.
9583
07:58:33,440 --> 07:58:35,880
And regression is the supervised learning problem
9584
07:58:35,880 --> 07:58:39,680
where we try and learn a function mapping inputs to outputs same as before.
9585
07:58:39,680 --> 07:58:43,000
But instead of the outputs being discrete categories, things
9586
07:58:43,000 --> 07:58:46,160
like rain or not rain, in a regression problem,
9587
07:58:46,160 --> 07:58:50,520
the output values are generally continuous values, some real number
9588
07:58:50,520 --> 07:58:51,960
that we would like to predict.
9589
07:58:51,960 --> 07:58:53,480
This happens all the time as well.
9590
07:58:53,480 --> 07:58:55,680
You might imagine that a company might take this approach
9591
07:58:55,680 --> 07:58:58,080
if it's trying to figure out, for instance, what
9592
07:58:58,080 --> 07:58:59,840
the effect of its advertising is.
9593
07:58:59,840 --> 07:59:02,800
How do advertising dollars spent translate
9594
07:59:02,800 --> 07:59:05,960
into sales for the company's product, for example?
9595
07:59:05,960 --> 07:59:08,960
And so they might like to try to predict some function that
9596
07:59:08,960 --> 07:59:11,680
takes as input the amount of money spent on advertising.
9597
07:59:11,680 --> 07:59:13,160
And here, we're just going to use one input.
9598
07:59:13,160 --> 07:59:15,900
But again, you could scale this up to many more inputs as well
9599
07:59:15,900 --> 07:59:18,720
if you have a lot of different kinds of data you have access to.
9600
07:59:18,720 --> 07:59:21,400
And the goal is to learn a function that given this amount of spending
9601
07:59:21,400 --> 07:59:23,800
on advertising, we're going to get this amount in sales.
9602
07:59:23,800 --> 07:59:27,040
And you might judge, based on having access to a whole bunch of data,
9603
07:59:27,040 --> 07:59:30,760
like for every past month, here is how much we spent on advertising,
9604
07:59:30,760 --> 07:59:32,320
and here is what sales were.
9605
07:59:32,320 --> 07:59:36,280
And we would like to predict some sort of hypothesis function
9606
07:59:36,280 --> 07:59:39,200
that, again, given the amount spent on advertising,
9607
07:59:39,200 --> 07:59:43,000
we can predict, in this case, some real number, some number estimate
9608
07:59:43,000 --> 07:59:47,800
of how much sales we expect that company to do in this month
9609
07:59:47,800 --> 07:59:49,880
or in this quarter or whatever unit of time
9610
07:59:49,880 --> 07:59:51,920
we're choosing to measure things in.
9611
07:59:51,920 --> 07:59:54,760
And so again, the approach to solving this type of problem,
9612
07:59:54,760 --> 07:59:58,760
we could try using a linear regression type approach where we take this data
9613
07:59:58,760 --> 07:59:59,960
and we just plot it.
9614
07:59:59,960 --> 08:00:02,680
On the x-axis, we have advertising dollars spent.
9615
08:00:02,680 --> 08:00:04,440
On the y-axis, we have sales.
9616
08:00:04,440 --> 08:00:07,080
And we might just want to try and draw a line that
9617
08:00:07,080 --> 08:00:09,600
does a pretty good job of trying to estimate
9618
08:00:09,600 --> 08:00:12,880
this relationship between advertising and sales.
9619
08:00:12,880 --> 08:00:14,760
And in this case, unlike before, we're not
9620
08:00:14,760 --> 08:00:17,960
trying to separate the data points into discrete categories.
9621
08:00:17,960 --> 08:00:19,760
But instead, in this case, we're just trying
9622
08:00:19,760 --> 08:00:24,360
to find a line that approximates this relationship between advertising
9623
08:00:24,360 --> 08:00:27,760
and sales so that if we want to figure out what the estimated sales are
9624
08:00:27,760 --> 08:00:31,440
for a particular advertising budget, you just look it up in this line,
9625
08:00:31,440 --> 08:00:33,360
figure out for this amount of advertising,
9626
08:00:33,360 --> 08:00:35,680
we would have this amount of sales and just try
9627
08:00:35,680 --> 08:00:37,440
and make the estimate that way.
9628
08:00:37,440 --> 08:00:39,720
And so you can try and come up with a line, again,
9629
08:00:39,720 --> 08:00:42,760
figuring out how to modify the weights using various different techniques
9630
08:00:42,760 --> 08:00:47,800
to try and make it so that this line fits as well as possible.
9631
08:00:47,800 --> 08:00:51,040
So with all of these approaches, then, to trying to solve machine learning
9632
08:00:51,040 --> 08:00:54,840
style problems, the question becomes, how do we evaluate these approaches?
9633
08:00:54,840 --> 08:00:58,160
How do we evaluate the various different hypotheses
9634
08:00:58,160 --> 08:00:59,280
that we could come up with?
9635
08:00:59,280 --> 08:01:02,800
Because each of these algorithms will give us some sort of hypothesis,
9636
08:01:02,800 --> 08:01:05,520
some function that maps inputs to outputs,
9637
08:01:05,520 --> 08:01:09,640
and we want to know, how well does that function work?
9638
08:01:09,640 --> 08:01:11,920
And you can think of evaluating these hypotheses
9639
08:01:11,920 --> 08:01:16,400
and trying to get a better hypothesis as kind of like an optimization problem.
9640
08:01:16,400 --> 08:01:19,400
In an optimization problem, as you recall from before,
9641
08:01:19,400 --> 08:01:23,000
we were either trying to maximize some objective function
9642
08:01:23,000 --> 08:01:26,440
by trying to find a global maximum, or we
9643
08:01:26,440 --> 08:01:30,240
were trying to minimize some cost function by trying to find some global
9644
08:01:30,240 --> 08:01:31,040
minimum.
9645
08:01:31,040 --> 08:01:34,800
And in the case of evaluating these hypotheses, one thing we might say
9646
08:01:34,800 --> 08:01:38,200
is that this cost function, the thing we're trying to minimize,
9647
08:01:38,200 --> 08:01:42,120
we might be trying to minimize what we would call a loss function.
9648
08:01:42,120 --> 08:01:44,560
And what a loss function is, is it is a function
9649
08:01:44,560 --> 08:01:49,120
that is going to estimate for us how poorly our function performs.
9650
08:01:49,120 --> 08:01:51,160
More formally, it's like a loss of utility
9651
08:01:51,160 --> 08:01:55,680
by whenever we predict something that is wrong, that is a loss of utility.
9652
08:01:55,680 --> 08:01:59,360
That's going to add to the output of our loss function.
9653
08:01:59,360 --> 08:02:01,120
And you could come up with any loss function
9654
08:02:01,120 --> 08:02:03,960
that you want, just some mathematical way of estimating,
9655
08:02:03,960 --> 08:02:06,960
given each of these data points, given what the actual output is,
9656
08:02:06,960 --> 08:02:10,040
and given what our projected output is, our estimate,
9657
08:02:10,040 --> 08:02:12,800
you could calculate some sort of numerical loss for it.
9658
08:02:12,800 --> 08:02:14,920
But there are a couple of popular loss functions
9659
08:02:14,920 --> 08:02:18,160
that are worth discussing, just so that you've seen them before.
9660
08:02:18,160 --> 08:02:21,680
When it comes to discrete categories, things like rain or not rain,
9661
08:02:21,680 --> 08:02:26,520
counterfeit or not counterfeit, one approaches the 0, 1 loss function.
9662
08:02:26,520 --> 08:02:29,520
And the way that works is for each of the data points,
9663
08:02:29,520 --> 08:02:32,720
our loss function takes as input what the actual output is,
9664
08:02:32,720 --> 08:02:35,240
like whether it was actually raining or not raining,
9665
08:02:35,240 --> 08:02:37,560
and takes our prediction into account.
9666
08:02:37,560 --> 08:02:41,920
Did we predict, given this data point, that it was raining or not raining?
9667
08:02:41,920 --> 08:02:45,800
And if the actual value equals the prediction, well, then the 0, 1 loss
9668
08:02:45,800 --> 08:02:47,480
function will just say the loss is 0.
9669
08:02:47,480 --> 08:02:51,800
There was no loss of utility, because we were able to predict correctly.
9670
08:02:51,800 --> 08:02:54,760
And otherwise, if the actual value was not the same thing
9671
08:02:54,760 --> 08:02:58,160
as what we predicted, well, then in that case, our loss is 1.
9672
08:02:58,160 --> 08:03:01,800
We lost something, lost some utility, because what we predicted
9673
08:03:01,800 --> 08:03:05,480
was the output of the function, was not what it actually was.
9674
08:03:05,480 --> 08:03:07,360
And the goal, then, in a situation like this
9675
08:03:07,360 --> 08:03:11,160
would be to come up with some hypothesis that minimizes
9676
08:03:11,160 --> 08:03:14,520
the total empirical loss, the total amount that we've lost,
9677
08:03:14,520 --> 08:03:17,960
if you add up for all these data points what the actual output is
9678
08:03:17,960 --> 08:03:21,000
and what your hypothesis would have predicted.
9679
08:03:21,000 --> 08:03:24,520
So in this case, for example, if we go back to classifying days as raining
9680
08:03:24,520 --> 08:03:27,600
or not raining, and we came up with this decision boundary,
9681
08:03:27,600 --> 08:03:29,560
how would we evaluate this decision boundary?
9682
08:03:29,560 --> 08:03:33,360
How much better is it than drawing the line here or drawing the line there?
9683
08:03:33,360 --> 08:03:35,680
Well, we could take each of the input data points,
9684
08:03:35,680 --> 08:03:38,680
and each input data point has a label, whether it was raining
9685
08:03:38,680 --> 08:03:40,120
or whether it was not raining.
9686
08:03:40,120 --> 08:03:41,960
And we could compare it to the prediction,
9687
08:03:41,960 --> 08:03:44,440
whether we predicted it would be raining or not raining,
9688
08:03:44,440 --> 08:03:47,920
and assign it a numerical value as a result.
9689
08:03:47,920 --> 08:03:51,560
So for example, these points over here, they were all rainy days,
9690
08:03:51,560 --> 08:03:53,360
and we predicted they would be raining, because they
9691
08:03:53,360 --> 08:03:55,080
fall on the bottom side of the line.
9692
08:03:55,080 --> 08:03:58,400
So they have a loss of 0, nothing lost from those situations.
9693
08:03:58,400 --> 08:04:01,080
And likewise, same is true for some of these points over here,
9694
08:04:01,080 --> 08:04:05,160
where it was not raining and we predicted it would not be raining either.
9695
08:04:05,160 --> 08:04:09,760
Where we do have loss are points like this point here and that point there,
9696
08:04:09,760 --> 08:04:13,000
where we predicted that it would not be raining,
9697
08:04:13,000 --> 08:04:14,680
but in actuality, it's a blue point.
9698
08:04:14,680 --> 08:04:15,760
It was raining.
9699
08:04:15,760 --> 08:04:18,840
Or likewise here, we predicted that it would be raining,
9700
08:04:18,840 --> 08:04:20,760
but in actuality, it's a red point.
9701
08:04:20,760 --> 08:04:21,960
It was not raining.
9702
08:04:21,960 --> 08:04:25,160
And so as a result, we miscategorized these data points
9703
08:04:25,160 --> 08:04:27,120
that we were trying to train on.
9704
08:04:27,120 --> 08:04:29,240
And as a result, there is some loss here.
9705
08:04:29,240 --> 08:04:33,000
One loss here, there, here, and there, for a total loss of 4,
9706
08:04:33,000 --> 08:04:34,840
for example, in this case.
9707
08:04:34,840 --> 08:04:37,680
And that might be how we would estimate or how we would say
9708
08:04:37,680 --> 08:04:41,560
that this line is better than a line that goes somewhere else
9709
08:04:41,560 --> 08:04:45,680
or a line that's further down, because this line might minimize the loss.
9710
08:04:45,680 --> 08:04:50,040
So there is no way to do better than just these four points of loss
9711
08:04:50,040 --> 08:04:54,000
if you're just drawing a straight line through our space.
9712
08:04:54,000 --> 08:04:56,280
So the 0, 1 loss function checks.
9713
08:04:56,280 --> 08:04:57,040
Did we get it right?
9714
08:04:57,040 --> 08:04:57,960
Did we get it wrong?
9715
08:04:57,960 --> 08:05:00,600
If we got it right, the loss is 0, nothing lost.
9716
08:05:00,600 --> 08:05:04,400
If we got it wrong, then our loss function for that data point says 1.
9717
08:05:04,400 --> 08:05:07,680
And we add up all of those losses across all of our data points
9718
08:05:07,680 --> 08:05:10,240
to get some sort of empirical loss, how much we
9719
08:05:10,240 --> 08:05:13,360
have lost across all of these original data points
9720
08:05:13,360 --> 08:05:16,360
that our algorithm had access to.
9721
08:05:16,360 --> 08:05:19,480
There are other forms of loss as well that work especially well when
9722
08:05:19,480 --> 08:05:21,920
we deal with more real valued cases, cases
9723
08:05:21,920 --> 08:05:24,840
like the mapping between advertising budget and amount
9724
08:05:24,840 --> 08:05:26,680
that we do in sales, for example.
9725
08:05:26,680 --> 08:05:30,720
Because in that case, you care not just that you get the number exactly right,
9726
08:05:30,720 --> 08:05:33,600
but you care how close you were to the actual value.
9727
08:05:33,600 --> 08:05:37,640
If the actual value is you did like $2,800 in sales
9728
08:05:37,640 --> 08:05:40,880
and you predicted that you would do $2,900 in sales,
9729
08:05:40,880 --> 08:05:42,160
maybe that's pretty good.
9730
08:05:42,160 --> 08:05:45,320
That's much better than if you had predicted you'd do $1,000 in sales,
9731
08:05:45,320 --> 08:05:46,480
for example.
9732
08:05:46,480 --> 08:05:48,760
And so we would like our loss function to be
9733
08:05:48,760 --> 08:05:53,200
able to take that into account as well, take into account not just
9734
08:05:53,200 --> 08:05:57,640
whether the actual value and the expected value are exactly the same,
9735
08:05:57,640 --> 08:06:01,800
but also take into account how far apart they were.
9736
08:06:01,800 --> 08:06:05,360
And so for that one approach is what we call L1 loss.
9737
08:06:05,360 --> 08:06:08,040
L1 loss doesn't just look at whether actual and predicted
9738
08:06:08,040 --> 08:06:11,980
are equal to each other, but we take the absolute value
9739
08:06:11,980 --> 08:06:15,000
of the actual value minus the predicted value.
9740
08:06:15,000 --> 08:06:19,280
In other words, we just ask how far apart were the actual and predicted
9741
08:06:19,280 --> 08:06:23,000
values, and we sum that up across all of the data points
9742
08:06:23,000 --> 08:06:26,800
to be able to get what our answer ultimately is.
9743
08:06:26,800 --> 08:06:29,600
So what might this actually look like for our data set?
9744
08:06:29,600 --> 08:06:31,520
Well, if we go back to this representation
9745
08:06:31,520 --> 08:06:35,640
where we had advertising along the x-axis, sales along the y-axis,
9746
08:06:35,640 --> 08:06:38,840
our line was our prediction, our estimate for any given
9747
08:06:38,840 --> 08:06:42,920
amount of advertising, what we predicted sales was going to be.
9748
08:06:42,920 --> 08:06:48,240
And our L1 loss is just how far apart vertically along the sales axis
9749
08:06:48,240 --> 08:06:51,000
our prediction was from each of the data points.
9750
08:06:51,000 --> 08:06:53,240
So we could figure out exactly how far apart
9751
08:06:53,240 --> 08:06:55,200
our prediction was from each of the data points
9752
08:06:55,200 --> 08:06:59,120
and figure out as a result of that what our loss is overall
9753
08:06:59,120 --> 08:07:02,160
for this particular hypothesis just by adding up
9754
08:07:02,160 --> 08:07:05,440
all of these various different individual losses for each of these data
9755
08:07:05,440 --> 08:07:06,040
points.
9756
08:07:06,040 --> 08:07:08,720
And our goal then is to try and minimize that loss,
9757
08:07:08,720 --> 08:07:13,480
to try and come up with some line that minimizes what the utility loss is
9758
08:07:13,480 --> 08:07:16,200
by judging how far away our estimate amount of sales
9759
08:07:16,200 --> 08:07:18,920
is from the actual amount of sales.
9760
08:07:18,920 --> 08:07:21,080
And turns out there are other loss functions as well.
9761
08:07:21,080 --> 08:07:23,680
One that's quite popular is the L2 loss.
9762
08:07:23,680 --> 08:07:26,760
The L2 loss, instead of just using the absolute value,
9763
08:07:26,760 --> 08:07:30,280
like how far away the actual value is from the predicted value,
9764
08:07:30,280 --> 08:07:33,280
it uses the square of actual minus predicted.
9765
08:07:33,280 --> 08:07:36,160
So how far apart are the actual and predicted value?
9766
08:07:36,160 --> 08:07:41,520
And it squares that value, effectively penalizing much more harshly anything
9767
08:07:41,520 --> 08:07:43,120
that is a worse prediction.
9768
08:07:43,120 --> 08:07:45,560
So you imagine if you have two data points
9769
08:07:45,560 --> 08:07:50,080
that you predict as being one value away from their actual value,
9770
08:07:50,080 --> 08:07:53,760
as opposed to one data point that you predict as being two away
9771
08:07:53,760 --> 08:07:56,840
from its actual value, the L2 loss function
9772
08:07:56,840 --> 08:08:00,120
will more harshly penalize that one that is two away,
9773
08:08:00,120 --> 08:08:03,040
because it's going to square, however, much the differences
9774
08:08:03,040 --> 08:08:05,360
between the actual value and the predicted value.
9775
08:08:05,360 --> 08:08:07,280
And depending on the situation, you might
9776
08:08:07,280 --> 08:08:10,440
want to choose a loss function depending on what you care about minimizing.
9777
08:08:10,440 --> 08:08:14,040
If you really care about minimizing the error on more outlier cases,
9778
08:08:14,040 --> 08:08:15,880
then you might want to consider something like this.
9779
08:08:15,880 --> 08:08:18,040
But if you've got a lot of outliers, and you don't necessarily
9780
08:08:18,040 --> 08:08:21,560
care about modeling them, then maybe an L1 loss function is preferable.
9781
08:08:21,560 --> 08:08:23,720
But there are trade-offs here that you need to decide,
9782
08:08:23,720 --> 08:08:26,560
based on a particular set of data.
9783
08:08:26,560 --> 08:08:29,480
But what you do run the risk of with any of these loss functions,
9784
08:08:29,480 --> 08:08:33,320
with anything that we're trying to do, is a problem known as overfitting.
9785
08:08:33,320 --> 08:08:36,320
And overfitting is a big problem that you can encounter in machine learning,
9786
08:08:36,320 --> 08:08:41,280
which happens anytime a model fits too closely with a data set,
9787
08:08:41,280 --> 08:08:44,360
and as a result, fails to generalize.
9788
08:08:44,360 --> 08:08:48,040
We would like our model to be able to accurately predict
9789
08:08:48,040 --> 08:08:52,280
data and inputs and output pairs for the data that we have access to.
9790
08:08:52,280 --> 08:08:55,160
But the reason we wanted to do so is because we
9791
08:08:55,160 --> 08:08:59,520
want our model to generalize well to data that we haven't seen before.
9792
08:08:59,520 --> 08:09:01,760
I would like to take data from the past year
9793
08:09:01,760 --> 08:09:03,760
of whether it was raining or not raining,
9794
08:09:03,760 --> 08:09:06,360
and use that data to generalize it towards the future.
9795
08:09:06,360 --> 08:09:09,080
Say, in the future, is it going to be raining or not raining?
9796
08:09:09,080 --> 08:09:12,520
Or if I have a whole bunch of data on what counterfeit and not counterfeit
9797
08:09:12,520 --> 08:09:16,560
US dollar bills look like in the past when people have encountered them,
9798
08:09:16,560 --> 08:09:19,440
I'd like to train a computer to be able to, in the future,
9799
08:09:19,440 --> 08:09:24,840
generalize to other dollar bills that I might see as well.
9800
08:09:24,840 --> 08:09:28,080
And the problem with overfitting is that if you try and tie yourself
9801
08:09:28,080 --> 08:09:32,240
too closely to the data set that you're training your model on,
9802
08:09:32,240 --> 08:09:35,000
you can end up not generalizing very well.
9803
08:09:35,000 --> 08:09:36,120
So what does this look like?
9804
08:09:36,120 --> 08:09:38,520
Well, we might imagine the rainy day and not rainy day
9805
08:09:38,520 --> 08:09:41,640
example again from here, where the blue points indicate rainy days
9806
08:09:41,640 --> 08:09:43,920
and the red points indicate not rainy days.
9807
08:09:43,920 --> 08:09:47,160
And we decided that we felt pretty comfortable with drawing a line
9808
08:09:47,160 --> 08:09:52,000
like this as the decision boundary between rainy days and not rainy days.
9809
08:09:52,000 --> 08:09:55,000
So we can pretty comfortably say that points on this side
9810
08:09:55,000 --> 08:09:57,960
more likely to be rainy days, points on that side more
9811
08:09:57,960 --> 08:09:59,800
likely to be not rainy days.
9812
08:09:59,800 --> 08:10:04,360
But the loss, the empirical loss, isn't zero in this particular case
9813
08:10:04,360 --> 08:10:07,040
because we didn't categorize everything perfectly.
9814
08:10:07,040 --> 08:10:10,600
There was this one outlier, this one day that it wasn't raining,
9815
08:10:10,600 --> 08:10:13,520
but yet our model still predicts that it is raining.
9816
08:10:13,520 --> 08:10:15,640
But that doesn't necessarily mean our model is bad.
9817
08:10:15,640 --> 08:10:18,760
It just means the model isn't 100% accurate.
9818
08:10:18,760 --> 08:10:21,620
If you really wanted to try and find a hypothesis that
9819
08:10:21,620 --> 08:10:25,000
resulted in minimizing the loss, you could come up
9820
08:10:25,000 --> 08:10:26,500
with a different decision boundary.
9821
08:10:26,500 --> 08:10:30,040
It wouldn't be a line, but it would look something like this.
9822
08:10:30,040 --> 08:10:34,040
This decision boundary does separate all of the red points
9823
08:10:34,040 --> 08:10:37,720
from all of the blue points because the red points fall
9824
08:10:37,720 --> 08:10:40,320
on this side of this decision boundary, the blue points
9825
08:10:40,320 --> 08:10:42,640
fall on the other side of the decision boundary.
9826
08:10:42,640 --> 08:10:47,480
But this, we would probably argue, is not as good of a prediction.
9827
08:10:47,480 --> 08:10:50,400
Even though it seems to be more accurate based
9828
08:10:50,400 --> 08:10:53,120
on all of the available training data that we
9829
08:10:53,120 --> 08:10:55,520
have for training this machine learning model,
9830
08:10:55,520 --> 08:10:58,280
we might say that it's probably not going to generalize well.
9831
08:10:58,280 --> 08:11:00,680
That if there were other data points like here and there,
9832
08:11:00,680 --> 08:11:03,600
we might still want to consider those to be rainy days
9833
08:11:03,600 --> 08:11:06,600
because we think this was probably just an outlier.
9834
08:11:06,600 --> 08:11:10,400
So if the only thing you care about is minimizing the loss on the data
9835
08:11:10,400 --> 08:11:13,280
you have available to you, you run the risk of overfitting.
9836
08:11:13,280 --> 08:11:15,480
And this can happen in the classification case.
9837
08:11:15,480 --> 08:11:18,400
It can also happen in the regression case,
9838
08:11:18,400 --> 08:11:21,720
that here we predicted what we thought was a pretty good line relating
9839
08:11:21,720 --> 08:11:24,600
advertising to sales, trying to predict what sales were going
9840
08:11:24,600 --> 08:11:26,840
to be for a given amount of advertising.
9841
08:11:26,840 --> 08:11:29,560
But I could come up with a line that does a better job of predicting
9842
08:11:29,560 --> 08:11:32,640
the training data, and it would be something that looks like this,
9843
08:11:32,640 --> 08:11:35,560
just connecting all of the various different data points.
9844
08:11:35,560 --> 08:11:37,680
And now there is no loss at all.
9845
08:11:37,680 --> 08:11:41,520
Now I've perfectly predicted, given any advertising, what sales are.
9846
08:11:41,520 --> 08:11:45,360
And for all the data available to me, it's going to be accurate.
9847
08:11:45,360 --> 08:11:47,920
But it's probably not going to generalize very well.
9848
08:11:47,920 --> 08:11:52,920
I have overfit my model on the training data that is available to me.
9849
08:11:52,920 --> 08:11:54,960
And so in general, we want to avoid overfitting.
9850
08:11:54,960 --> 08:11:58,480
We'd like strategies to make sure that we haven't overfit our model
9851
08:11:58,480 --> 08:12:00,120
to a particular data set.
9852
08:12:00,120 --> 08:12:02,720
And there are a number of ways that you could try to do this.
9853
08:12:02,720 --> 08:12:05,760
One way is by examining what it is that we're optimizing for.
9854
08:12:05,760 --> 08:12:10,000
In an optimization problem, all we do is we say, there is some cost,
9855
08:12:10,000 --> 08:12:12,520
and I want to minimize that cost.
9856
08:12:12,520 --> 08:12:17,360
And so far, we've defined that cost function, the cost of a hypothesis,
9857
08:12:17,360 --> 08:12:21,160
just as being equal to the empirical loss of that hypothesis,
9858
08:12:21,160 --> 08:12:25,120
like how far away are the actual data points, the outputs,
9859
08:12:25,120 --> 08:12:29,440
away from what I predicted them to be based on that particular hypothesis.
9860
08:12:29,440 --> 08:12:32,400
And if all you're trying to do is minimize cost, meaning minimizing
9861
08:12:32,400 --> 08:12:36,960
the loss in this case, then the result is going to be that you might overfit,
9862
08:12:36,960 --> 08:12:41,360
that to minimize cost, you're going to try and find a way to perfectly match
9863
08:12:41,360 --> 08:12:42,760
all the input data.
9864
08:12:42,760 --> 08:12:46,000
And that might happen as a result of overfitting
9865
08:12:46,000 --> 08:12:48,560
on that particular input data.
9866
08:12:48,560 --> 08:12:52,600
So in order to address this, you could add something to the cost function.
9867
08:12:52,600 --> 08:12:56,440
What counts as cost will not just loss, but also
9868
08:12:56,440 --> 08:12:59,560
some measure of the complexity of the hypothesis.
9869
08:12:59,560 --> 08:13:02,040
The word the complexity of the hypothesis is something
9870
08:13:02,040 --> 08:13:06,160
that you would need to define for how complicated does our line look.
9871
08:13:06,160 --> 08:13:08,600
This is sort of an Occam's razor-style approach
9872
08:13:08,600 --> 08:13:12,080
where we want to give preference to a simpler decision boundary,
9873
08:13:12,080 --> 08:13:15,920
like a straight line, for example, some simpler curve, as opposed
9874
08:13:15,920 --> 08:13:19,760
to something far more complex that might represent the training data better
9875
08:13:19,760 --> 08:13:21,400
but might not generalize as well.
9876
08:13:21,400 --> 08:13:26,280
We'll generally say that a simpler solution is probably the better solution
9877
08:13:26,280 --> 08:13:31,280
and probably the one that is more likely to generalize well to other inputs.
9878
08:13:31,280 --> 08:13:34,960
So we measure what the loss is, but we also measure the complexity.
9879
08:13:34,960 --> 08:13:38,720
And now that all gets taken into account when we consider the overall cost,
9880
08:13:38,720 --> 08:13:42,000
that yes, something might have less loss if it better predicts the training
9881
08:13:42,000 --> 08:13:45,400
data, but if it's much more complex, it still
9882
08:13:45,400 --> 08:13:48,080
might not be the best option that we have.
9883
08:13:48,080 --> 08:13:51,880
And we need to come up with some balance between loss and complexity.
9884
08:13:51,880 --> 08:13:54,120
And for that reason, you'll often see this represented
9885
08:13:54,120 --> 08:13:58,400
as multiplying the complexity by some parameter that we have to choose,
9886
08:13:58,400 --> 08:14:02,520
parameter lambda in this case, where we're saying if lambda is a greater
9887
08:14:02,520 --> 08:14:06,840
value, then we really want to penalize more complex hypotheses.
9888
08:14:06,840 --> 08:14:10,560
Whereas if lambda is smaller, we're going to penalize more complex hypotheses
9889
08:14:10,560 --> 08:14:14,560
a little bit, and it's up to the machine learning programmer
9890
08:14:14,560 --> 08:14:17,400
to decide where they want to set that value of lambda
9891
08:14:17,400 --> 08:14:21,360
for how much do I want to penalize a more complex hypothesis that
9892
08:14:21,360 --> 08:14:23,360
might fit the data a little better.
9893
08:14:23,360 --> 08:14:25,920
And again, there's no one right answer to a lot of these things,
9894
08:14:25,920 --> 08:14:29,320
but depending on the data set, depending on the data you have available to you
9895
08:14:29,320 --> 08:14:32,240
and the problem you're trying to solve, your choice of these parameters
9896
08:14:32,240 --> 08:14:34,600
may vary, and you may need to experiment a little bit
9897
08:14:34,600 --> 08:14:38,720
to figure out what the right choice of that is ultimately going to be.
9898
08:14:38,720 --> 08:14:41,600
This process, then, of considering not only loss,
9899
08:14:41,600 --> 08:14:45,920
but also some measure of the complexity is known as regularization.
9900
08:14:45,920 --> 08:14:49,680
Regularization is the process of penalizing a hypothesis that
9901
08:14:49,680 --> 08:14:54,200
is more complex in order to favor a simpler hypothesis that is more
9902
08:14:54,200 --> 08:14:56,600
likely to generalize well, more likely to be
9903
08:14:56,600 --> 08:15:01,120
able to apply to other situations that are dealing with other input points
9904
08:15:01,120 --> 08:15:04,480
unlike the ones that we've necessarily seen before.
9905
08:15:04,480 --> 08:15:08,440
So oftentimes, you'll see us add some regularizing term
9906
08:15:08,440 --> 08:15:14,120
to what we're trying to minimize in order to avoid this problem of overfitting.
9907
08:15:14,120 --> 08:15:17,240
Now, another way of making sure we don't overfit
9908
08:15:17,240 --> 08:15:20,400
is to run some experiments and to see whether or not
9909
08:15:20,400 --> 08:15:25,320
we are able to generalize our model that we've created to other data sets
9910
08:15:25,320 --> 08:15:26,160
as well.
9911
08:15:26,160 --> 08:15:28,480
And it's for that reason that oftentimes when you're
9912
08:15:28,480 --> 08:15:30,720
doing a machine learning experiment, when you've got some data
9913
08:15:30,720 --> 08:15:33,360
and you want to try and come up with some function that predicts,
9914
08:15:33,360 --> 08:15:36,120
given some input, what the output is going to be,
9915
08:15:36,120 --> 08:15:39,720
you don't necessarily want to do your training on all of the data
9916
08:15:39,720 --> 08:15:42,120
you have available to you that you could employ
9917
08:15:42,120 --> 08:15:45,360
a method known as holdout cross-validation,
9918
08:15:45,360 --> 08:15:48,400
where in holdout cross-validation, we split up our data.
9919
08:15:48,400 --> 08:15:53,400
We split up our data into a training set and a testing set.
9920
08:15:53,400 --> 08:15:55,240
The training set is the set of data that we're
9921
08:15:55,240 --> 08:15:57,800
going to use to train our machine learning model.
9922
08:15:57,800 --> 08:16:00,460
And the testing set is the set of data that we're
9923
08:16:00,460 --> 08:16:04,160
going to use in order to test to see how well our machine learning
9924
08:16:04,160 --> 08:16:06,600
model actually performed.
9925
08:16:06,600 --> 08:16:08,680
So the learning happens on the training set.
9926
08:16:08,680 --> 08:16:10,520
We figure out what the parameters should be.
9927
08:16:10,520 --> 08:16:12,600
We figure out what the right model is.
9928
08:16:12,600 --> 08:16:15,280
And then we see, all right, now that we've trained the model,
9929
08:16:15,280 --> 08:16:17,920
we'll see how well it does at predicting things
9930
08:16:17,920 --> 08:16:22,200
inside of the testing set, some set of data that we haven't seen before.
9931
08:16:22,200 --> 08:16:24,040
And the hope then is that we're going to be
9932
08:16:24,040 --> 08:16:26,360
able to predict the testing set pretty well
9933
08:16:26,360 --> 08:16:29,380
if we're able to generalize based on the training
9934
08:16:29,380 --> 08:16:31,000
data that's available to us.
9935
08:16:31,000 --> 08:16:32,760
If we've overfit the training data, though,
9936
08:16:32,760 --> 08:16:36,360
and we're not able to generalize, well, then when we look at the testing set,
9937
08:16:36,360 --> 08:16:38,000
it's likely going to be the case that we're not
9938
08:16:38,000 --> 08:16:42,000
going to predict things in the testing set nearly as effectively.
9939
08:16:42,000 --> 08:16:44,160
So this is one method of cross-validation,
9940
08:16:44,160 --> 08:16:46,720
validating to make sure that the work we have done
9941
08:16:46,720 --> 08:16:49,680
is actually going to generalize to other data sets as well.
9942
08:16:49,680 --> 08:16:52,520
And there are other statistical techniques we can use as well.
9943
08:16:52,520 --> 08:16:55,800
One of the downsides of this just hold out cross-validation
9944
08:16:55,800 --> 08:17:00,160
is if you say I just split it 50-50, I train using 50% of the data
9945
08:17:00,160 --> 08:17:04,000
and test using the other 50%, or you could choose other percentages as well,
9946
08:17:04,000 --> 08:17:08,560
is that there is a fair amount of data that I am now not using to train,
9947
08:17:08,560 --> 08:17:12,560
that I might be able to get a better model as a result, for example.
9948
08:17:12,560 --> 08:17:16,440
So one approach is known as k-fold cross-validation.
9949
08:17:16,440 --> 08:17:20,640
In k-fold cross-validation, rather than just divide things into two sets
9950
08:17:20,640 --> 08:17:24,920
and run one experiment, we divide things into k different sets.
9951
08:17:24,920 --> 08:17:27,720
So maybe I divide things up into 10 different sets
9952
08:17:27,720 --> 08:17:30,320
and then run 10 different experiments.
9953
08:17:30,320 --> 08:17:33,680
So if I split up my data into 10 different sets of data,
9954
08:17:33,680 --> 08:17:37,360
then what I'll do is each time for each of my 10 experiments,
9955
08:17:37,360 --> 08:17:40,360
I will hold out one of those sets of data, where I'll say,
9956
08:17:40,360 --> 08:17:43,240
let me train my model on these nine sets,
9957
08:17:43,240 --> 08:17:47,000
and then test to see how well it predicts on set number 10.
9958
08:17:47,000 --> 08:17:50,120
And then pick another set of nine sets to train on,
9959
08:17:50,120 --> 08:17:52,240
and then test it on the other one that I held out,
9960
08:17:52,240 --> 08:17:55,400
where each time I train the model on everything
9961
08:17:55,400 --> 08:17:57,840
minus the one set that I'm holding out, and then
9962
08:17:57,840 --> 08:18:02,040
test to see how well our model performs on the test that I did hold out.
9963
08:18:02,040 --> 08:18:04,240
And what you end up getting is 10 different results,
9964
08:18:04,240 --> 08:18:07,400
10 different answers for how accurately our model worked.
9965
08:18:07,400 --> 08:18:09,800
And oftentimes, you could just take the average of those 10
9966
08:18:09,800 --> 08:18:14,040
to get an approximation for how well we think our model performs overall.
9967
08:18:14,040 --> 08:18:18,200
But the key idea is separating the training data from the testing data,
9968
08:18:18,200 --> 08:18:20,600
because you want to test your model on data
9969
08:18:20,600 --> 08:18:23,360
that is different from what you trained the model on.
9970
08:18:23,360 --> 08:18:25,360
Because the training, you want to avoid overfitting.
9971
08:18:25,360 --> 08:18:26,880
You want to be able to generalize.
9972
08:18:26,880 --> 08:18:29,480
And the way you test whether you're able to generalize
9973
08:18:29,480 --> 08:18:32,520
is by looking at some data that you haven't seen before
9974
08:18:32,520 --> 08:18:36,200
and seeing how well we're actually able to perform.
9975
08:18:36,200 --> 08:18:38,960
And so if we want to actually implement any of these techniques
9976
08:18:38,960 --> 08:18:42,720
inside of a programming language like Python, number of ways we could do that.
9977
08:18:42,720 --> 08:18:45,000
We could write this from scratch on our own,
9978
08:18:45,000 --> 08:18:46,760
but there are libraries out there that allow
9979
08:18:46,760 --> 08:18:50,240
us to take advantage of existing implementations of these algorithms,
9980
08:18:50,240 --> 08:18:53,000
that we can use the same types of algorithms
9981
08:18:53,000 --> 08:18:54,880
in a lot of different situations.
9982
08:18:54,880 --> 08:18:58,280
And so there's a library, very popular one, known as Scikit-learn,
9983
08:18:58,280 --> 08:19:01,520
which allows us in Python to be able to very quickly get
9984
08:19:01,520 --> 08:19:03,920
set up with a lot of these different machine learning models.
9985
08:19:03,920 --> 08:19:06,440
This library has already written an algorithm
9986
08:19:06,440 --> 08:19:09,360
for nearest neighbor classification, for doing perceptron learning,
9987
08:19:09,360 --> 08:19:12,800
for doing a bunch of other types of inference and supervised learning
9988
08:19:12,800 --> 08:19:14,360
that we haven't yet talked about.
9989
08:19:14,360 --> 08:19:19,760
But using it, we can begin to try actually testing how these methods work
9990
08:19:19,760 --> 08:19:22,240
and how accurately they perform.
9991
08:19:22,240 --> 08:19:24,480
So let's go ahead and take a look at one approach
9992
08:19:24,480 --> 08:19:26,840
to trying to solve this type of problem.
9993
08:19:26,840 --> 08:19:30,360
All right, so I'm first going to pull up banknotes.csv, which
9994
08:19:30,360 --> 08:19:33,020
is a whole bunch of data provided by UC Irvine, which
9995
08:19:33,020 --> 08:19:36,080
is information about various different banknotes
9996
08:19:36,080 --> 08:19:38,360
that people took pictures of various different banknotes
9997
08:19:38,360 --> 08:19:41,440
and measured various different properties of those banknotes.
9998
08:19:41,440 --> 08:19:45,120
And in particular, some human categorized each of those banknotes
9999
08:19:45,120 --> 08:19:48,720
as either a counterfeit banknote or as not counterfeit.
10000
08:19:48,720 --> 08:19:52,480
And so what you're looking at here is each row represents one banknote.
10001
08:19:52,480 --> 08:19:55,960
This is formatted as a CSV spreadsheet, where just comma separated values
10002
08:19:55,960 --> 08:19:58,680
separating each of these various different fields.
10003
08:19:58,680 --> 08:20:03,000
We have four different input values for each of these data points,
10004
08:20:03,000 --> 08:20:06,400
just information, some measurement that was made on the banknote.
10005
08:20:06,400 --> 08:20:09,280
And what those measurements exactly are aren't as important as the fact
10006
08:20:09,280 --> 08:20:11,280
that we do have access to this data.
10007
08:20:11,280 --> 08:20:14,880
But more importantly, we have access for each of these data points
10008
08:20:14,880 --> 08:20:19,160
to a label, where 0 indicates something like this was not a counterfeit bill,
10009
08:20:19,160 --> 08:20:20,840
meaning it was an authentic bill.
10010
08:20:20,840 --> 08:20:25,440
And a data point labeled 1 means that it is a counterfeit bill,
10011
08:20:25,440 --> 08:20:29,080
at least according to the human researcher who labeled this particular data.
10012
08:20:29,080 --> 08:20:31,280
So we have a whole bunch of data representing
10013
08:20:31,280 --> 08:20:33,860
a whole bunch of different data points, each of which
10014
08:20:33,860 --> 08:20:35,600
has these various different measurements that
10015
08:20:35,600 --> 08:20:38,000
were made on that particular bill, and each of which
10016
08:20:38,000 --> 08:20:44,200
has an output value, 0 or 1, 0 meaning it was a genuine bill, 1 meaning
10017
08:20:44,200 --> 08:20:46,000
it was a counterfeit bill.
10018
08:20:46,000 --> 08:20:48,560
And what we would like to do is use supervised learning
10019
08:20:48,560 --> 08:20:51,600
to begin to predict or model some sort of function that
10020
08:20:51,600 --> 08:20:55,480
can take these four values as input and predict what the output would be.
10021
08:20:55,480 --> 08:20:58,600
We want our learning algorithm to find some sort of pattern
10022
08:20:58,600 --> 08:21:01,040
that is able to predict based on these measurements, something
10023
08:21:01,040 --> 08:21:03,640
that you could measure just by taking a photo of a bill,
10024
08:21:03,640 --> 08:21:09,200
predict whether that bill is authentic or whether that bill is counterfeit.
10025
08:21:09,200 --> 08:21:10,560
And so how can we do that?
10026
08:21:10,560 --> 08:21:13,700
Well, I'm first going to open up banknote0.py
10027
08:21:13,700 --> 08:21:15,960
and see how it is that we do this.
10028
08:21:15,960 --> 08:21:18,960
I'm first importing a lot of things from Scikit-learn,
10029
08:21:18,960 --> 08:21:23,480
but importantly, I'm going to set my model equal to the perceptron model,
10030
08:21:23,480 --> 08:21:25,360
which is one of those models that we talked about before.
10031
08:21:25,360 --> 08:21:28,080
We're just going to try and figure out some setting of weights
10032
08:21:28,080 --> 08:21:31,880
that is able to divide our data into two different groups.
10033
08:21:31,880 --> 08:21:36,200
Then I'm going to go ahead and read data in for my file from banknotes.csv.
10034
08:21:36,200 --> 08:21:39,600
And basically, for every row, I'm going to separate that row
10035
08:21:39,600 --> 08:21:44,400
into the first four values of that row, which is the evidence for that row.
10036
08:21:44,400 --> 08:21:49,860
And then the label, where if the final column in that row is a 0,
10037
08:21:49,860 --> 08:21:51,240
the label is authentic.
10038
08:21:51,240 --> 08:21:53,680
And otherwise, it's going to be counterfeit.
10039
08:21:53,680 --> 08:21:56,820
So I'm effectively reading data in from the CSV file,
10040
08:21:56,820 --> 08:22:00,280
dividing into a whole bunch of rows where each row has some evidence,
10041
08:22:00,280 --> 08:22:04,320
those four input values that are going to be inputs to my hypothesis function.
10042
08:22:04,320 --> 08:22:07,680
And then the label, the output, whether it is authentic or counterfeit,
10043
08:22:07,680 --> 08:22:10,120
that is the thing that I am then trying to predict.
10044
08:22:10,120 --> 08:22:12,880
So the next step is that I would like to split up my data set
10045
08:22:12,880 --> 08:22:15,960
into a training set and a testing set, some set of data
10046
08:22:15,960 --> 08:22:18,320
that I would like to train my machine learning model on,
10047
08:22:18,320 --> 08:22:21,040
and some set of data that I would like to use to test that model,
10048
08:22:21,040 --> 08:22:22,440
see how well it performed.
10049
08:22:22,440 --> 08:22:25,360
So what I'll do is I'll go ahead and figure out length of the data,
10050
08:22:25,360 --> 08:22:27,080
how many data points do I have.
10051
08:22:27,080 --> 08:22:30,400
I'll go ahead and take half of them, save that number as a number called holdout.
10052
08:22:30,400 --> 08:22:33,440
That is how many items I'm going to hold out for my data set
10053
08:22:33,440 --> 08:22:35,320
to save for the testing phase.
10054
08:22:35,320 --> 08:22:38,180
I'll randomly shuffle the data so it's in some random order.
10055
08:22:38,180 --> 08:22:43,360
And then I'll say my testing set will be all of the data up to the holdout.
10056
08:22:43,360 --> 08:22:47,720
So I'll take holdout many data items, and that will be my testing set.
10057
08:22:47,720 --> 08:22:51,000
My training data will be everything else, the information
10058
08:22:51,000 --> 08:22:53,800
that I'm going to train my model on.
10059
08:22:53,800 --> 08:22:58,960
And then I'll say I need to divide my training data into two different sets.
10060
08:22:58,960 --> 08:23:03,680
I need to divide it into my x values, where x here represents the inputs.
10061
08:23:03,680 --> 08:23:06,600
So the x values, the x values that I'm going to train on,
10062
08:23:06,600 --> 08:23:09,040
are basically for every row in my training set,
10063
08:23:09,040 --> 08:23:12,080
I'm going to get the evidence for that row, those four values,
10064
08:23:12,080 --> 08:23:14,840
where it's basically a vector of four numbers, where
10065
08:23:14,840 --> 08:23:16,920
that is going to be all of the input.
10066
08:23:16,920 --> 08:23:18,400
And then I need the y values.
10067
08:23:18,400 --> 08:23:20,400
What are the outputs that I want to learn from,
10068
08:23:20,400 --> 08:23:23,780
the labels that belong to each of these various different input points?
10069
08:23:23,780 --> 08:23:26,600
Well, that's going to be the same thing for each row in the training data.
10070
08:23:26,600 --> 08:23:29,360
But this time, I take that row and get what its label is,
10071
08:23:29,360 --> 08:23:31,640
whether it is authentic or counterfeit.
10072
08:23:31,640 --> 08:23:36,200
So I end up with one list of all of these vectors of my input data,
10073
08:23:36,200 --> 08:23:38,720
and one list, which follows the same order,
10074
08:23:38,720 --> 08:23:42,640
but is all of the labels that correspond with each of those vectors.
10075
08:23:42,640 --> 08:23:46,720
And then to train my model, which in this case is just this perceptron model,
10076
08:23:46,720 --> 08:23:49,960
I just call model.fit, pass in the training data,
10077
08:23:49,960 --> 08:23:52,640
and what the labels for those training data are.
10078
08:23:52,640 --> 08:23:54,960
And scikit-learn will take care of fitting the model,
10079
08:23:54,960 --> 08:23:57,080
will do the entire algorithm for me.
10080
08:23:57,080 --> 08:24:01,240
And then when it's done, I can then test to see how well that model performed.
10081
08:24:01,240 --> 08:24:04,200
So I can say, let me get all of these input vectors
10082
08:24:04,200 --> 08:24:05,880
for what I want to test on.
10083
08:24:05,880 --> 08:24:09,800
So for each row in my testing data set, go ahead and get the evidence.
10084
08:24:09,800 --> 08:24:13,400
And the y values, those are what the actual values were
10085
08:24:13,400 --> 08:24:17,520
for each of the rows in the testing data set, what the actual label is.
10086
08:24:17,520 --> 08:24:19,800
But then I'm going to generate some predictions.
10087
08:24:19,800 --> 08:24:22,280
I'm going to use this model and try and predict,
10088
08:24:22,280 --> 08:24:26,840
based on the testing vectors, I want to predict what the output is.
10089
08:24:26,840 --> 08:24:31,160
And my goal then is to now compare y testing with predictions.
10090
08:24:31,160 --> 08:24:34,360
I want to see how well my predictions, based on the model,
10091
08:24:34,360 --> 08:24:38,240
actually reflect what the y values were, what the output is,
10092
08:24:38,240 --> 08:24:39,480
that were actually labeled.
10093
08:24:39,480 --> 08:24:44,320
Because I now have this label data, I can assess how well the algorithm worked.
10094
08:24:44,320 --> 08:24:47,060
And so now I can just compute how well we did.
10095
08:24:47,060 --> 08:24:49,960
I'm going to, this zip function basically just lets
10096
08:24:49,960 --> 08:24:53,440
me look through two different lists, one by one at the same time.
10097
08:24:53,440 --> 08:24:57,160
So for each actual value and for each predicted value,
10098
08:24:57,160 --> 08:24:59,200
if the actual is the same thing as what I predicted,
10099
08:24:59,200 --> 08:25:01,400
I'll go ahead and increment the counter by one.
10100
08:25:01,400 --> 08:25:04,760
Otherwise, I'll increment my incorrect counter by one.
10101
08:25:04,760 --> 08:25:06,880
And so at the end, I can print out, here are the results,
10102
08:25:06,880 --> 08:25:09,380
here's how many I got right, here's how many I got wrong,
10103
08:25:09,380 --> 08:25:12,560
and here was my overall accuracy, for example.
10104
08:25:12,560 --> 08:25:14,000
So I can go ahead and run this.
10105
08:25:14,000 --> 08:25:17,720
I can run python banknote0.py.
10106
08:25:17,720 --> 08:25:20,000
And it's going to train on half the data set
10107
08:25:20,000 --> 08:25:21,760
and then test on half the data set.
10108
08:25:21,760 --> 08:25:24,040
And here are the results for my perceptron model.
10109
08:25:24,040 --> 08:25:29,020
In this case, it correctly was able to classify 679 bills as correctly
10110
08:25:29,020 --> 08:25:33,400
either authentic or counterfeit and incorrectly classified seven of them
10111
08:25:33,400 --> 08:25:37,000
for an overall accuracy of close to 99% accurate.
10112
08:25:37,000 --> 08:25:40,160
So on this particular data set, using this perceptron model,
10113
08:25:40,160 --> 08:25:44,240
we were able to predict very well what the output was going to be.
10114
08:25:44,240 --> 08:25:46,600
And we can try different models, too, that scikit-learn
10115
08:25:46,600 --> 08:25:50,880
makes it very easy just to swap out one model for another model.
10116
08:25:50,880 --> 08:25:55,640
So instead of the perceptron model, I can use the support vector machine
10117
08:25:55,640 --> 08:25:59,440
using the SVC, otherwise known as a support vector classifier,
10118
08:25:59,440 --> 08:26:01,880
using a support vector machine to classify things
10119
08:26:01,880 --> 08:26:03,640
into two different groups.
10120
08:26:03,640 --> 08:26:07,120
And now see, all right, how well does this perform?
10121
08:26:07,120 --> 08:26:10,560
And all right, this time, we were able to correctly predict 682
10122
08:26:10,560 --> 08:26:15,200
and incorrectly predicted four for accuracy of 99.4%.
10123
08:26:15,200 --> 08:26:20,680
And we could even try the k-neighbors classifier as the model instead.
10124
08:26:20,680 --> 08:26:24,160
And this takes a parameter, n neighbors, for how many neighbors
10125
08:26:24,160 --> 08:26:25,160
do you want to look at?
10126
08:26:25,160 --> 08:26:27,480
Let's just look at one neighbor, the one nearest neighbor,
10127
08:26:27,480 --> 08:26:29,000
and use that to predict.
10128
08:26:29,000 --> 08:26:31,080
Go ahead and run this as well.
10129
08:26:31,080 --> 08:26:33,520
And it looks like, based on the k-neighbors classifier,
10130
08:26:33,520 --> 08:26:36,400
looking at just one neighbor, we were able to correctly classify
10131
08:26:36,400 --> 08:26:40,360
685 data points, incorrectly classified one.
10132
08:26:40,360 --> 08:26:43,560
Maybe let's try three neighbors instead, instead of just using one neighbor.
10133
08:26:43,560 --> 08:26:45,360
Do more of a k-nearest neighbors approach,
10134
08:26:45,360 --> 08:26:48,640
where I look at the three nearest neighbors and see how that performs.
10135
08:26:48,640 --> 08:26:54,240
And that one, in this case, seems to have gotten 100% of all of the predictions
10136
08:26:54,240 --> 08:26:58,280
correctly described as either authentic banknotes
10137
08:26:58,280 --> 08:27:00,280
or as counterfeit banknotes.
10138
08:27:00,280 --> 08:27:02,400
And we could run these experiments multiple times,
10139
08:27:02,400 --> 08:27:05,120
because I'm randomly reorganizing the data every time.
10140
08:27:05,120 --> 08:27:07,640
We're technically training these on slightly different data sets.
10141
08:27:07,640 --> 08:27:10,440
And so you might want to run multiple experiments to really see
10142
08:27:10,440 --> 08:27:12,200
how well they're actually going to perform.
10143
08:27:12,200 --> 08:27:14,160
But in short, they all perform very well.
10144
08:27:14,160 --> 08:27:16,560
And while some of them perform slightly better than others here,
10145
08:27:16,560 --> 08:27:19,160
that might not always be the case for every data set.
10146
08:27:19,160 --> 08:27:22,180
But you can begin to test now by very quickly putting together
10147
08:27:22,180 --> 08:27:24,720
these machine learning models using Scikit-learn
10148
08:27:24,720 --> 08:27:27,120
to be able to train on some training set and then
10149
08:27:27,120 --> 08:27:29,920
test on some testing set as well.
10150
08:27:29,920 --> 08:27:33,040
And this splitting up into training groups and testing groups and testing
10151
08:27:33,040 --> 08:27:37,000
happens so often that Scikit-learn has functions built in for trying to do it.
10152
08:27:37,000 --> 08:27:39,040
I did it all by hand just now.
10153
08:27:39,040 --> 08:27:41,520
But if we take a look at banknotes one, we
10154
08:27:41,520 --> 08:27:45,920
take advantage of some other features that exist in Scikit-learn,
10155
08:27:45,920 --> 08:27:48,320
where we can really simplify a lot of our logic,
10156
08:27:48,320 --> 08:27:52,440
that there is a function built into Scikit-learn called train test split,
10157
08:27:52,440 --> 08:27:56,080
which will automatically split data into a training group and a testing group.
10158
08:27:56,080 --> 08:27:59,680
I just have to say what proportion should be in the testing group, something
10159
08:27:59,680 --> 08:28:02,920
like 0.5, half the data inside the testing group.
10160
08:28:02,920 --> 08:28:05,320
Then I can fit the model on the training data,
10161
08:28:05,320 --> 08:28:08,800
make the predictions on the testing data, and then just count up.
10162
08:28:08,800 --> 08:28:11,760
And Scikit-learn has some nice methods for just counting up
10163
08:28:11,760 --> 08:28:15,040
how many times our testing data match the predictions,
10164
08:28:15,040 --> 08:28:18,280
how many times our testing data didn't match the predictions.
10165
08:28:18,280 --> 08:28:21,600
So very quickly, you can write programs with not all that many lines of code.
10166
08:28:21,600 --> 08:28:25,480
It's maybe like 40 lines of code to get through all of these predictions.
10167
08:28:25,480 --> 08:28:28,440
And then as a result, see how well we're able to do.
10168
08:28:28,440 --> 08:28:31,520
So these types of libraries can allow us, without really knowing
10169
08:28:31,520 --> 08:28:33,920
the implementation details of these algorithms,
10170
08:28:33,920 --> 08:28:36,920
to be able to use the algorithms in a very practical way
10171
08:28:36,920 --> 08:28:40,120
to be able to solve these types of problems.
10172
08:28:40,120 --> 08:28:42,880
So that then was supervised learning, this task
10173
08:28:42,880 --> 08:28:45,960
of given a whole set of data, some input output pairs,
10174
08:28:45,960 --> 08:28:50,040
we would like to learn some function that maps those inputs to those outputs.
10175
08:28:50,040 --> 08:28:52,560
But turns out there are other forms of learning as well.
10176
08:28:52,560 --> 08:28:55,840
And another popular type of machine learning, especially nowadays,
10177
08:28:55,840 --> 08:28:58,080
is known as reinforcement learning.
10178
08:28:58,080 --> 08:29:00,920
And the idea of reinforcement learning is rather than just
10179
08:29:00,920 --> 08:29:04,160
being given a whole data set at the beginning of input output pairs,
10180
08:29:04,160 --> 08:29:07,600
reinforcement learning is all about learning from experience.
10181
08:29:07,600 --> 08:29:10,320
In reinforcement learning, our agent, whether it's
10182
08:29:10,320 --> 08:29:13,000
like a physical robot that's trying to make actions in the world
10183
08:29:13,000 --> 08:29:16,680
or just some virtual agent that is a program running somewhere,
10184
08:29:16,680 --> 08:29:20,480
our agent is going to be given a set of rewards or punishments
10185
08:29:20,480 --> 08:29:22,040
in the form of numerical values.
10186
08:29:22,040 --> 08:29:24,360
But you can think of them as reward or punishment.
10187
08:29:24,360 --> 08:29:28,440
And based on that, it learns what actions to take in the future,
10188
08:29:28,440 --> 08:29:32,400
that our agent, our AI, will be put in some sort of environment.
10189
08:29:32,400 --> 08:29:33,640
It will make some actions.
10190
08:29:33,640 --> 08:29:36,280
And based on the actions that it makes, it learns something.
10191
08:29:36,280 --> 08:29:38,480
It either gets a reward when it does something well,
10192
08:29:38,480 --> 08:29:40,640
it gets a punishment when it does something poorly,
10193
08:29:40,640 --> 08:29:44,640
and it learns what to do or what not to do in the future
10194
08:29:44,640 --> 08:29:47,880
based on those individual experiences.
10195
08:29:47,880 --> 08:29:50,400
And so what this will often look like is it will often
10196
08:29:50,400 --> 08:29:54,000
start with some agent, some AI, which might, again, be a physical robot,
10197
08:29:54,000 --> 08:29:56,200
if you're imagining a physical robot moving around,
10198
08:29:56,200 --> 08:29:58,120
but it can also just be a program.
10199
08:29:58,120 --> 08:30:01,160
And our agent is situated in their environment,
10200
08:30:01,160 --> 08:30:04,040
where the environment is where they're going to make their actions,
10201
08:30:04,040 --> 08:30:06,760
and it's what's going to give them rewards or punishments
10202
08:30:06,760 --> 08:30:09,080
for various actions that they're in.
10203
08:30:09,080 --> 08:30:12,160
So for example, the environment is going to start off
10204
08:30:12,160 --> 08:30:14,920
by putting our agent inside of a state.
10205
08:30:14,920 --> 08:30:17,280
Our agent has some state that, in a game,
10206
08:30:17,280 --> 08:30:19,840
might be the state of the game that the agent is playing.
10207
08:30:19,840 --> 08:30:21,800
In a world that the agent is exploring might
10208
08:30:21,800 --> 08:30:24,760
be some position inside of a grid representing the world
10209
08:30:24,760 --> 08:30:25,720
that they're exploring.
10210
08:30:25,720 --> 08:30:28,000
But the agent is in some sort of state.
10211
08:30:28,000 --> 08:30:32,080
And in that state, the agent needs to choose to take an action.
10212
08:30:32,080 --> 08:30:34,600
The agent likely has multiple actions they can choose from,
10213
08:30:34,600 --> 08:30:36,240
but they pick an action.
10214
08:30:36,240 --> 08:30:39,240
So they take an action in a particular state.
10215
08:30:39,240 --> 08:30:42,080
And as a result of that, the agent will generally
10216
08:30:42,080 --> 08:30:44,960
get two things in response as we model them.
10217
08:30:44,960 --> 08:30:47,680
The agent gets a new state that they find themselves in.
10218
08:30:47,680 --> 08:30:50,040
After being in this state, taking one action,
10219
08:30:50,040 --> 08:30:52,120
they end up in some other state.
10220
08:30:52,120 --> 08:30:55,300
And they're also given some sort of numerical reward,
10221
08:30:55,300 --> 08:30:58,560
positive meaning reward, meaning it was a good thing,
10222
08:30:58,560 --> 08:31:00,920
negative generally meaning they did something bad,
10223
08:31:00,920 --> 08:31:03,200
they received some sort of punishment.
10224
08:31:03,200 --> 08:31:06,100
And that is all the information the agent has.
10225
08:31:06,100 --> 08:31:08,160
It's told what state it's in.
10226
08:31:08,160 --> 08:31:10,040
It makes some sort of action.
10227
08:31:10,040 --> 08:31:12,040
And based on that, it ends up in another state.
10228
08:31:12,040 --> 08:31:14,440
And it ends up getting some particular reward.
10229
08:31:14,440 --> 08:31:17,440
And it needs to learn, based on that information, what actions
10230
08:31:17,440 --> 08:31:19,640
to begin to take in the future.
10231
08:31:19,640 --> 08:31:21,640
And so you could imagine generalizing this to a lot
10232
08:31:21,640 --> 08:31:22,880
of different situations.
10233
08:31:22,880 --> 08:31:26,240
This is oftentimes how you train if you've ever seen those robots that
10234
08:31:26,240 --> 08:31:29,040
are now able to walk around the way humans do.
10235
08:31:29,040 --> 08:31:32,400
It would be quite difficult to program the robot in exactly the right way
10236
08:31:32,400 --> 08:31:34,240
to get it to walk the way humans do.
10237
08:31:34,240 --> 08:31:36,840
You could instead train it through reinforcement learning,
10238
08:31:36,840 --> 08:31:40,320
give it some sort of numerical reward every time it does something good,
10239
08:31:40,320 --> 08:31:43,640
like take steps forward, and punish it every time it does something
10240
08:31:43,640 --> 08:31:46,520
bad, like fall over, and then let the AI just
10241
08:31:46,520 --> 08:31:48,880
learn based on that sequence of rewards, based
10242
08:31:48,880 --> 08:31:51,260
on trying to take various different actions.
10243
08:31:51,260 --> 08:31:54,480
You can begin to have the agent learn what to do in the future
10244
08:31:54,480 --> 08:31:56,120
and what not to do.
10245
08:31:56,120 --> 08:31:59,480
So in order to begin to formalize this, the first thing we need to do
10246
08:31:59,480 --> 08:32:03,620
is formalize this notion of what we mean about states and actions and rewards,
10247
08:32:03,620 --> 08:32:05,720
like what does this world look like?
10248
08:32:05,720 --> 08:32:07,920
And oftentimes, we'll formulate this world
10249
08:32:07,920 --> 08:32:11,720
as what's known as a Markov decision process, similar in spirit
10250
08:32:11,720 --> 08:32:14,360
to Markov chains, which you might recall from before.
10251
08:32:14,360 --> 08:32:16,940
But a Markov decision process is a model that we
10252
08:32:16,940 --> 08:32:19,700
can use for decision making, for an agent trying
10253
08:32:19,700 --> 08:32:21,500
to make decisions in its environment.
10254
08:32:21,500 --> 08:32:25,200
And it's a model that allows us to represent the various different states
10255
08:32:25,200 --> 08:32:28,840
that an agent can be in, the various different actions that they can take,
10256
08:32:28,840 --> 08:32:35,120
and also what the reward is for taking one action as opposed to another action.
10257
08:32:35,120 --> 08:32:37,520
So what then does it actually look like?
10258
08:32:37,520 --> 08:32:40,580
Well, if you recall a Markov chain from before,
10259
08:32:40,580 --> 08:32:43,200
a Markov chain looked a little something like this,
10260
08:32:43,200 --> 08:32:45,760
where we had a whole bunch of these individual states,
10261
08:32:45,760 --> 08:32:48,760
and each state immediately transitioned to another state
10262
08:32:48,760 --> 08:32:50,840
based on some probability distribution.
10263
08:32:50,840 --> 08:32:54,000
We saw this in the context of the weather before, where if it was sunny,
10264
08:32:54,000 --> 08:32:56,720
we said with some probability, it'll be sunny the next day.
10265
08:32:56,720 --> 08:32:59,840
With some other probability, it'll be rainy, for example.
10266
08:32:59,840 --> 08:33:02,320
But we could also imagine generalizing this.
10267
08:33:02,320 --> 08:33:04,000
It's not just sun and rain anymore.
10268
08:33:04,000 --> 08:33:07,160
We just have these states, where one state leads to another state
10269
08:33:07,160 --> 08:33:09,760
according to some probability distribution.
10270
08:33:09,760 --> 08:33:12,280
But in this original model, there was no agent
10271
08:33:12,280 --> 08:33:14,440
that had any control over this process.
10272
08:33:14,440 --> 08:33:17,720
It was just entirely probability based, where with some probability,
10273
08:33:17,720 --> 08:33:18,960
we moved to this next state.
10274
08:33:18,960 --> 08:33:22,400
But maybe it's going to be some other state with some other probability.
10275
08:33:22,400 --> 08:33:26,280
What we'll now have is the ability for the agent in this state
10276
08:33:26,280 --> 08:33:29,480
to choose from a set of actions, where maybe instead of just one path
10277
08:33:29,480 --> 08:33:33,240
forward, they have three different choices of actions that each lead up
10278
08:33:33,240 --> 08:33:34,120
down different paths.
10279
08:33:34,120 --> 08:33:36,480
And even this is a bit of an oversimplification,
10280
08:33:36,480 --> 08:33:39,240
because in each of these states, you might imagine more branching points
10281
08:33:39,240 --> 08:33:42,040
where there are more decisions that can be taken as well.
10282
08:33:42,040 --> 08:33:46,360
So we've extended the Markov chain to say that from a state,
10283
08:33:46,360 --> 08:33:48,360
you now have available action choices.
10284
08:33:48,360 --> 08:33:50,760
And each of those actions might be associated
10285
08:33:50,760 --> 08:33:55,880
with its own probability distribution of going to various different states.
10286
08:33:55,880 --> 08:33:58,840
Then in addition, we'll add another extension,
10287
08:33:58,840 --> 08:34:01,840
where any time you move from a state, taking an action,
10288
08:34:01,840 --> 08:34:07,000
going into this other state, we can associate a reward with that outcome,
10289
08:34:07,000 --> 08:34:10,120
saying either r is positive, meaning some positive reward,
10290
08:34:10,120 --> 08:34:13,320
or r is negative, meaning there was some sort of punishment.
10291
08:34:13,320 --> 08:34:16,440
And this then is what we'll consider to be a Markov decision process.
10292
08:34:16,440 --> 08:34:18,960
That a Markov decision process has some initial set
10293
08:34:18,960 --> 08:34:21,600
of states, of states in the world that we can be in.
10294
08:34:21,600 --> 08:34:24,560
We have some set of actions that, given a state,
10295
08:34:24,560 --> 08:34:28,040
I can say, what are the actions that are available to me in that state,
10296
08:34:28,040 --> 08:34:30,560
an action that I can choose from?
10297
08:34:30,560 --> 08:34:32,480
Then we have some transition model.
10298
08:34:32,480 --> 08:34:36,160
The transition model before just said that, given my current state,
10299
08:34:36,160 --> 08:34:39,880
what is the probability that I end up in that next state or this other state?
10300
08:34:39,880 --> 08:34:44,080
The transition model now has effectively two things we're conditioning on.
10301
08:34:44,080 --> 08:34:48,080
We're saying, given that I'm in this state and that I take this action,
10302
08:34:48,080 --> 08:34:52,280
what's the probability that I end up in this next state?
10303
08:34:52,280 --> 08:34:56,120
Now maybe we live in a very deterministic world in this Markov decision process.
10304
08:34:56,120 --> 08:34:58,120
We're given a state and given an action.
10305
08:34:58,120 --> 08:35:00,680
We know for sure what next state we'll end up in.
10306
08:35:00,680 --> 08:35:02,480
But maybe there's some randomness in the world
10307
08:35:02,480 --> 08:35:04,560
that when you take in a state and you take an action,
10308
08:35:04,560 --> 08:35:07,200
you might not always end up in the exact same state.
10309
08:35:07,200 --> 08:35:09,800
There might be some probabilities involved there as well.
10310
08:35:09,800 --> 08:35:14,200
The Markov decision process can handle both of those possible cases.
10311
08:35:14,200 --> 08:35:18,280
And then finally, we have a reward function, generally called r,
10312
08:35:18,280 --> 08:35:21,960
that in this case says, what is the reward for being in this state,
10313
08:35:21,960 --> 08:35:26,760
taking this action, and then getting to s prime this next state?
10314
08:35:26,760 --> 08:35:27,960
So I'm in this original state.
10315
08:35:27,960 --> 08:35:28,720
I take this action.
10316
08:35:28,720 --> 08:35:29,880
I get to this next state.
10317
08:35:29,880 --> 08:35:32,600
What is the reward for doing that process?
10318
08:35:32,600 --> 08:35:35,360
And you can add up these rewards every time you take an action
10319
08:35:35,360 --> 08:35:38,080
to get the total amount of rewards that an agent might
10320
08:35:38,080 --> 08:35:41,440
get from interacting in a particular environment
10321
08:35:41,440 --> 08:35:44,080
modeled using this Markov decision process.
10322
08:35:44,080 --> 08:35:46,760
So what might this actually look like in practice?
10323
08:35:46,760 --> 08:35:49,200
Well, let's just create a little simulated world here
10324
08:35:49,200 --> 08:35:52,160
where I have this agent that is just trying to navigate its way.
10325
08:35:52,160 --> 08:35:55,040
This agent is this yellow dot here, like a robot in the world,
10326
08:35:55,040 --> 08:35:57,160
trying to navigate its way through this grid.
10327
08:35:57,160 --> 08:36:00,160
And ultimately, it's trying to find its way to the goal.
10328
08:36:00,160 --> 08:36:04,280
And if it gets to the green goal, then it's going to get some sort of reward.
10329
08:36:04,280 --> 08:36:08,200
But then we might also have some red squares that are places
10330
08:36:08,200 --> 08:36:11,280
where you get some sort of punishment, some bad place where we don't want
10331
08:36:11,280 --> 08:36:12,400
the agent to go.
10332
08:36:12,400 --> 08:36:14,940
And if it ends up in the red square, then our agent
10333
08:36:14,940 --> 08:36:18,240
is going to get some sort of punishment as a result of that.
10334
08:36:18,240 --> 08:36:21,560
But the agent originally doesn't know all of these details.
10335
08:36:21,560 --> 08:36:24,280
It doesn't know that these states are associated with punishments.
10336
08:36:24,280 --> 08:36:27,120
But maybe it does know that this state is associated with a reward.
10337
08:36:27,120 --> 08:36:28,120
Maybe it doesn't.
10338
08:36:28,120 --> 08:36:30,680
But it just needs to sort of interact with the environment
10339
08:36:30,680 --> 08:36:33,960
to try and figure out what to do and what not to do.
10340
08:36:33,960 --> 08:36:35,800
So the first thing the agent might do is,
10341
08:36:35,800 --> 08:36:39,120
given no additional information, if it doesn't know what the punishments are,
10342
08:36:39,120 --> 08:36:43,080
it doesn't know where the rewards are, it just might try and take an action.
10343
08:36:43,080 --> 08:36:45,640
And it takes an action and ends up realizing
10344
08:36:45,640 --> 08:36:47,560
that it got some sort of punishment.
10345
08:36:47,560 --> 08:36:49,760
And so what does it learn from that experience?
10346
08:36:49,760 --> 08:36:53,480
Well, it might learn that when you're in this state in the future,
10347
08:36:53,480 --> 08:36:57,200
don't take the action move to the right, that that is a bad action to take.
10348
08:36:57,200 --> 08:36:59,840
That in the future, if you ever find yourself back in the state,
10349
08:36:59,840 --> 08:37:02,200
don't take this action of going to the right
10350
08:37:02,200 --> 08:37:05,280
when you're in this particular state, because that leads to punishment.
10351
08:37:05,280 --> 08:37:06,780
That might be the intuition at least.
10352
08:37:06,780 --> 08:37:08,560
And so you could try doing other actions.
10353
08:37:08,560 --> 08:37:11,160
You move up, all right, that didn't lead to any immediate rewards.
10354
08:37:11,160 --> 08:37:12,840
Maybe try something else.
10355
08:37:12,840 --> 08:37:14,680
Then maybe try something else.
10356
08:37:14,680 --> 08:37:17,160
And all right, now you found that you got another punishment.
10357
08:37:17,160 --> 08:37:18,840
And so you learn something from that experience.
10358
08:37:18,840 --> 08:37:20,800
So the next time you do this whole process,
10359
08:37:20,800 --> 08:37:22,960
you know that if you ever end up in this square,
10360
08:37:22,960 --> 08:37:26,040
you shouldn't take the down action, because being in this state
10361
08:37:26,040 --> 08:37:30,800
and taking that action ultimately leads to some sort of punishment,
10362
08:37:30,800 --> 08:37:33,040
a negative reward, in other words.
10363
08:37:33,040 --> 08:37:34,080
And this process repeats.
10364
08:37:34,080 --> 08:37:37,200
You might imagine just letting our agent explore the world,
10365
08:37:37,200 --> 08:37:41,200
learning over time what states tend to correspond with poor actions,
10366
08:37:41,200 --> 08:37:43,960
learning over time what states correspond with poor actions,
10367
08:37:43,960 --> 08:37:47,240
until eventually, if it tries enough things randomly,
10368
08:37:47,240 --> 08:37:50,600
it might find that eventually when you get to this state,
10369
08:37:50,600 --> 08:37:53,120
if you take the up action in this state, it
10370
08:37:53,120 --> 08:37:56,120
might find that you actually get a reward from that.
10371
08:37:56,120 --> 08:37:59,800
And what it can learn from that is that if you're in this state,
10372
08:37:59,800 --> 08:38:02,560
you should take the up action, because that leads to a reward.
10373
08:38:02,560 --> 08:38:05,160
And over time, you can also learn that if you're in this state,
10374
08:38:05,160 --> 08:38:08,520
you should take the left action, because that leads to this state that also
10375
08:38:08,520 --> 08:38:10,280
lets you eventually get to the reward.
10376
08:38:10,280 --> 08:38:14,080
So you begin to learn over time not only which actions
10377
08:38:14,080 --> 08:38:18,360
are good in particular states, but also which actions are bad,
10378
08:38:18,360 --> 08:38:20,680
such that once you know some sequence of good actions that
10379
08:38:20,680 --> 08:38:24,960
leads you to some sort of reward, our agent can just follow those
10380
08:38:24,960 --> 08:38:27,680
instructions, follow the experience that it has learned.
10381
08:38:27,680 --> 08:38:30,240
We didn't tell the agent what the goal was.
10382
08:38:30,240 --> 08:38:32,800
We didn't tell the agent where the punishments were.
10383
08:38:32,800 --> 08:38:35,680
But the agent can begin to learn from this experience
10384
08:38:35,680 --> 08:38:40,720
and learn to begin to perform these sorts of tasks better in the future.
10385
08:38:40,720 --> 08:38:43,840
And so let's now try to formalize this idea, formalize the idea
10386
08:38:43,840 --> 08:38:47,440
that we would like to be able to learn in this state taking this action,
10387
08:38:47,440 --> 08:38:49,120
is that a good thing or a bad thing?
10388
08:38:49,120 --> 08:38:51,760
There are lots of different models for reinforcement learning.
10389
08:38:51,760 --> 08:38:53,600
We're just going to look at one of them today.
10390
08:38:53,600 --> 08:38:57,280
And the one that we're going to look at is a method known as Q-learning.
10391
08:38:57,280 --> 08:38:59,880
And what Q-learning is all about is about learning
10392
08:38:59,880 --> 08:39:05,440
a function, a function Q, that takes inputs S and A, where S is a state
10393
08:39:05,440 --> 08:39:07,760
and A is an action that you take in that state.
10394
08:39:07,760 --> 08:39:12,280
And what this Q function is going to do is it is going to estimate the value.
10395
08:39:12,280 --> 08:39:18,880
How much reward will I get from taking this action in this state?
10396
08:39:18,880 --> 08:39:21,800
Originally, we don't know what this Q function should be.
10397
08:39:21,800 --> 08:39:24,800
But over time, based on experience, based on trying things out
10398
08:39:24,800 --> 08:39:28,160
and seeing what the result is, I would like to try and learn
10399
08:39:28,160 --> 08:39:32,680
what Q of SA is for any particular state and any particular action
10400
08:39:32,680 --> 08:39:34,680
that I might take in that state.
10401
08:39:34,680 --> 08:39:35,800
So what is the approach?
10402
08:39:35,800 --> 08:39:40,960
Well, the approach originally is we'll start with Q SA equal to 0 for all
10403
08:39:40,960 --> 08:39:43,840
states S and for all actions A. That initially,
10404
08:39:43,840 --> 08:39:47,200
before I've ever started anything, before I've had any experiences,
10405
08:39:47,200 --> 08:39:50,760
I don't know the value of taking any action in any given state.
10406
08:39:50,760 --> 08:39:55,240
So I'm going to assume that the value is just 0 all across the board.
10407
08:39:55,240 --> 08:39:59,720
But then as I interact with the world, as I experience rewards or punishments,
10408
08:39:59,720 --> 08:40:03,400
or maybe I go to a cell where I don't get either reward or a punishment,
10409
08:40:03,400 --> 08:40:07,240
I want to somehow update my estimate of Q SA.
10410
08:40:07,240 --> 08:40:10,160
I want to continually update my estimate of Q SA
10411
08:40:10,160 --> 08:40:13,680
based on the experiences and rewards and punishments that I've received,
10412
08:40:13,680 --> 08:40:17,160
such that in the future, my knowledge of what actions are good
10413
08:40:17,160 --> 08:40:19,160
and what states will be better.
10414
08:40:19,160 --> 08:40:22,240
So when we take an action and receive some sort of reward,
10415
08:40:22,240 --> 08:40:25,680
I want to estimate the new value of Q SA.
10416
08:40:25,680 --> 08:40:28,360
And I estimate that based on a couple of different things.
10417
08:40:28,360 --> 08:40:32,040
I estimate it based on the reward that I'm getting from taking this action
10418
08:40:32,040 --> 08:40:33,760
and getting into the next state.
10419
08:40:33,760 --> 08:40:37,520
But assuming the situation isn't over, assuming there are still
10420
08:40:37,520 --> 08:40:40,000
future actions that I might take as well,
10421
08:40:40,000 --> 08:40:44,640
I also need to take into account the expected future rewards.
10422
08:40:44,640 --> 08:40:47,520
That if you imagine an agent interacting with the environment,
10423
08:40:47,520 --> 08:40:49,960
then sometimes you'll take an action and get a reward,
10424
08:40:49,960 --> 08:40:52,920
but then you can keep taking more actions and get more rewards,
10425
08:40:52,920 --> 08:40:55,240
that these both are relevant, both the current reward
10426
08:40:55,240 --> 08:40:58,520
I'm getting from this current step and also my future reward.
10427
08:40:58,520 --> 08:41:01,160
And it might be the case that I'll want to take a step that
10428
08:41:01,160 --> 08:41:05,080
doesn't immediately lead to a reward, because later on down the line,
10429
08:41:05,080 --> 08:41:07,600
I know it will lead to more rewards as well.
10430
08:41:07,600 --> 08:41:10,480
So there's a balancing act between current rewards
10431
08:41:10,480 --> 08:41:13,400
that the agent experiences and future rewards
10432
08:41:13,400 --> 08:41:16,800
that the agent experiences as well.
10433
08:41:16,800 --> 08:41:19,560
And then we need to update QSA.
10434
08:41:19,560 --> 08:41:22,560
So we estimate the value of QSA based on the current reward
10435
08:41:22,560 --> 08:41:24,360
and the expected future rewards.
10436
08:41:24,360 --> 08:41:26,920
And then we need to update this Q function
10437
08:41:26,920 --> 08:41:29,480
to take into account this new estimate.
10438
08:41:29,480 --> 08:41:31,680
Now, we already, as we go through this process,
10439
08:41:31,680 --> 08:41:35,120
we'll already have an estimate for what we think the value is.
10440
08:41:35,120 --> 08:41:37,120
Now we have a new estimate, and then somehow we
10441
08:41:37,120 --> 08:41:39,520
need to combine these two estimates together,
10442
08:41:39,520 --> 08:41:43,040
and we'll look at more formal ways that we can actually begin to do that.
10443
08:41:43,040 --> 08:41:45,720
So to actually show you what this formula looks like,
10444
08:41:45,720 --> 08:41:47,760
here is the approach we'll take with Q learning.
10445
08:41:47,760 --> 08:41:52,440
We're going to, again, start with Q of S and A being equal to 0 for all states.
10446
08:41:52,440 --> 08:41:59,760
And then every time we take an action A in state S and observer reward R,
10447
08:41:59,760 --> 08:42:04,160
we're going to update our value, our estimate, for Q of SA.
10448
08:42:04,160 --> 08:42:06,720
And the idea is that we're going to figure out
10449
08:42:06,720 --> 08:42:12,120
what the new value estimate is minus what our existing value estimate is.
10450
08:42:12,120 --> 08:42:15,720
And so we have some preconceived notion for what the value is
10451
08:42:15,720 --> 08:42:17,400
for taking this action in this state.
10452
08:42:17,400 --> 08:42:21,400
Maybe our expectation is we currently think the value is 10.
10453
08:42:21,400 --> 08:42:24,480
But then we're going to estimate what we now think it's going to be.
10454
08:42:24,480 --> 08:42:27,200
Maybe the new value estimate is something like 20.
10455
08:42:27,200 --> 08:42:30,520
So there's a delta of 10 that our new value estimate
10456
08:42:30,520 --> 08:42:35,200
is 10 points higher than what our current value estimate happens to be.
10457
08:42:35,200 --> 08:42:37,120
And so we have a couple of options here.
10458
08:42:37,120 --> 08:42:40,020
We need to decide how much we want to adjust
10459
08:42:40,020 --> 08:42:42,800
our current expectation of what the value is
10460
08:42:42,800 --> 08:42:45,640
of taking this action in this particular state.
10461
08:42:45,640 --> 08:42:49,560
And what that difference is, how much we add or subtract
10462
08:42:49,560 --> 08:42:52,720
from our existing notion of how much do we expect the value to be,
10463
08:42:52,720 --> 08:42:56,680
is dependent on this parameter alpha, also called a learning rate.
10464
08:42:56,680 --> 08:43:01,200
And alpha represents, in effect, how much we value new information
10465
08:43:01,200 --> 08:43:04,680
compared to how much we value old information.
10466
08:43:04,680 --> 08:43:08,320
An alpha value of 1 means we really value new information.
10467
08:43:08,320 --> 08:43:10,520
But if we have a new estimate, then it doesn't
10468
08:43:10,520 --> 08:43:12,160
matter what our old estimate is.
10469
08:43:12,160 --> 08:43:14,080
We're only going to consider our new estimate
10470
08:43:14,080 --> 08:43:18,240
because we always just want to take into consideration our new information.
10471
08:43:18,240 --> 08:43:21,960
So the way that works is that if you imagine alpha being 1,
10472
08:43:21,960 --> 08:43:25,760
well, then we're taking the old value of QSA
10473
08:43:25,760 --> 08:43:29,840
and then adding 1 times the new value minus the old value.
10474
08:43:29,840 --> 08:43:31,800
And that just leaves us with the new value.
10475
08:43:31,800 --> 08:43:34,940
So when alpha is 1, all we take into consideration
10476
08:43:34,940 --> 08:43:37,600
is what our new estimate happens to be.
10477
08:43:37,600 --> 08:43:40,400
But over time, as we go through a lot of experiences,
10478
08:43:40,400 --> 08:43:42,520
we already have some existing information.
10479
08:43:42,520 --> 08:43:46,000
We might have tried taking this action nine times already.
10480
08:43:46,000 --> 08:43:48,000
And now we just tried it a 10th time.
10481
08:43:48,000 --> 08:43:51,160
And we don't only want to consider this 10th experience.
10482
08:43:51,160 --> 08:43:54,800
I also want to consider the fact that my prior nine experiences, those
10483
08:43:54,800 --> 08:43:55,640
were meaningful, too.
10484
08:43:55,640 --> 08:43:58,240
And that's data I don't necessarily want to lose.
10485
08:43:58,240 --> 08:44:01,080
And so this alpha controls that decision,
10486
08:44:01,080 --> 08:44:03,400
controls how important is the new information.
10487
08:44:03,400 --> 08:44:06,480
0 would mean ignore all the new information.
10488
08:44:06,480 --> 08:44:09,320
Just keep this Q value the same.
10489
08:44:09,320 --> 08:44:13,120
1 means replace the old information entirely with the new information.
10490
08:44:13,120 --> 08:44:17,920
And somewhere in between, keep some sort of balance between these two values.
10491
08:44:17,920 --> 08:44:21,000
We can put this equation a little bit more formally as well.
10492
08:44:21,000 --> 08:44:23,880
The old value estimate is our old estimate
10493
08:44:23,880 --> 08:44:27,600
for what the value is of taking this action in a particular state.
10494
08:44:27,600 --> 08:44:30,040
That's just Q of SNA.
10495
08:44:30,040 --> 08:44:33,120
So we have it once here, and we're going to add something to it.
10496
08:44:33,120 --> 08:44:35,580
We're going to add alpha times the new value estimate
10497
08:44:35,580 --> 08:44:37,680
minus the old value estimate.
10498
08:44:37,680 --> 08:44:42,280
But the old value estimate, we just look up by calling this Q function.
10499
08:44:42,280 --> 08:44:44,240
And what then is the new value estimate?
10500
08:44:44,240 --> 08:44:46,440
Based on this experience we have just taken,
10501
08:44:46,440 --> 08:44:48,800
what is our new estimate for the value of taking
10502
08:44:48,800 --> 08:44:51,480
this action in this particular state?
10503
08:44:51,480 --> 08:44:54,000
Well, it's going to be composed of two parts.
10504
08:44:54,000 --> 08:44:56,940
It's going to be composed of what reward did I just
10505
08:44:56,940 --> 08:45:00,000
get from taking this action in this state.
10506
08:45:00,000 --> 08:45:03,320
And then it's going to be, what can I expect my future rewards
10507
08:45:03,320 --> 08:45:05,600
to be from this point forward?
10508
08:45:05,600 --> 08:45:10,200
So it's going to be R, some reward I'm getting right now,
10509
08:45:10,200 --> 08:45:14,280
plus whatever I estimate I'm going to get in the future.
10510
08:45:14,280 --> 08:45:16,760
And how do I estimate what I'm going to get in the future?
10511
08:45:16,760 --> 08:45:19,960
Well, it's a bit of another call to this Q function.
10512
08:45:19,960 --> 08:45:23,940
It's going to be take the maximum across all possible actions
10513
08:45:23,940 --> 08:45:27,920
I could take next and say, all right, of all of these possible actions
10514
08:45:27,920 --> 08:45:31,480
I could take, which one is going to have the highest reward?
10515
08:45:31,480 --> 08:45:33,600
And so this then looks a little bit complicated.
10516
08:45:33,600 --> 08:45:35,400
This is going to be our notion for how we're
10517
08:45:35,400 --> 08:45:37,680
going to perform this kind of update.
10518
08:45:37,680 --> 08:45:41,680
I have some estimate, some old estimate, for what the value is
10519
08:45:41,680 --> 08:45:44,040
of taking this action in this state.
10520
08:45:44,040 --> 08:45:46,920
And I'm going to update it based on new information
10521
08:45:46,920 --> 08:45:48,680
that I experience some reward.
10522
08:45:48,680 --> 08:45:51,240
I predict what my future reward is going to be.
10523
08:45:51,240 --> 08:45:54,600
And using that I update what I estimate the reward will
10524
08:45:54,600 --> 08:45:57,880
be for taking this action in this particular state.
10525
08:45:57,880 --> 08:46:00,760
And there are other additions you might make to this algorithm as well.
10526
08:46:00,760 --> 08:46:03,200
Sometimes it might not be the case that future rewards
10527
08:46:03,200 --> 08:46:05,760
you want to wait equally to current rewards.
10528
08:46:05,760 --> 08:46:10,360
Maybe you want an agent that values reward now over reward later.
10529
08:46:10,360 --> 08:46:13,940
And so sometimes you can even add another term in here, some other parameter,
10530
08:46:13,940 --> 08:46:17,800
where you discount future rewards and say future rewards are not
10531
08:46:17,800 --> 08:46:19,840
as valuable as rewards immediately.
10532
08:46:19,840 --> 08:46:21,640
That getting reward in the current time step
10533
08:46:21,640 --> 08:46:24,520
is better than waiting a year and getting rewards later.
10534
08:46:24,520 --> 08:46:26,200
But that's something up to the programmer
10535
08:46:26,200 --> 08:46:29,240
to decide what that parameter ought to be.
10536
08:46:29,240 --> 08:46:32,060
But the big picture idea of this entire formula
10537
08:46:32,060 --> 08:46:35,600
is to say that every time we experience some new reward,
10538
08:46:35,600 --> 08:46:36,840
we take that into account.
10539
08:46:36,840 --> 08:46:40,760
We update our estimate of how good is this action.
10540
08:46:40,760 --> 08:46:44,040
And then in the future, we can make decisions based on that algorithm.
10541
08:46:44,040 --> 08:46:48,160
Once we have some good estimate for every state and for every action,
10542
08:46:48,160 --> 08:46:50,920
what the value is of taking that action, then we
10543
08:46:50,920 --> 08:46:54,920
can do something like implement a greedy decision making policy.
10544
08:46:54,920 --> 08:46:57,960
That if I am in a state and I want to know what action
10545
08:46:57,960 --> 08:47:00,160
should I take in that state, well, then I
10546
08:47:00,160 --> 08:47:05,600
consider for all of my possible actions, what is the value of QSA?
10547
08:47:05,600 --> 08:47:08,960
What is my estimated value of taking that action in that state?
10548
08:47:08,960 --> 08:47:12,920
And I will just pick the action that has the highest value
10549
08:47:12,920 --> 08:47:15,360
after I evaluate that expression.
10550
08:47:15,360 --> 08:47:17,560
So I pick the action that has the highest value.
10551
08:47:17,560 --> 08:47:19,960
And based on that, that tells me what action I should take.
10552
08:47:19,960 --> 08:47:24,880
At any given state that I'm in, I can just greedily say across all my actions,
10553
08:47:24,880 --> 08:47:27,960
this action gives me the highest expected value.
10554
08:47:27,960 --> 08:47:33,320
And so I'll go ahead and choose that action as the action that I take as well.
10555
08:47:33,320 --> 08:47:36,160
But there is a downside to this kind of approach.
10556
08:47:36,160 --> 08:47:38,760
And then downside comes up in a situation like this,
10557
08:47:38,760 --> 08:47:44,080
where we know that there is some solution that gets me to the reward.
10558
08:47:44,080 --> 08:47:46,400
And our agent has been able to figure that out.
10559
08:47:46,400 --> 08:47:49,920
But it might not necessarily be the best way or the fastest way.
10560
08:47:49,920 --> 08:47:52,640
If the agent is allowed to explore a little bit more,
10561
08:47:52,640 --> 08:47:55,160
it might find that it can get the reward faster
10562
08:47:55,160 --> 08:47:59,600
by taking some other route instead, by going through this particular path
10563
08:47:59,600 --> 08:48:04,240
that is a faster way to get to that ultimate goal.
10564
08:48:04,240 --> 08:48:07,640
And maybe we would like for the agent to be able to figure that out as well.
10565
08:48:07,640 --> 08:48:11,560
But if the agent always takes the actions that it knows to be best,
10566
08:48:11,560 --> 08:48:13,840
well, when it gets to this particular square,
10567
08:48:13,840 --> 08:48:17,680
it doesn't know that this is a good action because it's never really tried it.
10568
08:48:17,680 --> 08:48:21,840
But it knows that going down eventually leads its way to this reward.
10569
08:48:21,840 --> 08:48:25,160
So it might learn in the future that it should just always take this route
10570
08:48:25,160 --> 08:48:29,760
and it's never going to explore and go along that route instead.
10571
08:48:29,760 --> 08:48:32,080
So in reinforcement learning, there is this tension
10572
08:48:32,080 --> 08:48:35,360
between exploration and exploitation.
10573
08:48:35,360 --> 08:48:40,240
And exploitation generally refers to using knowledge that the AI already has.
10574
08:48:40,240 --> 08:48:43,520
The AI already knows that this is a move that leads to reward.
10575
08:48:43,520 --> 08:48:45,400
So we'll go ahead and use that move.
10576
08:48:45,400 --> 08:48:49,280
And exploration is all about exploring other actions
10577
08:48:49,280 --> 08:48:51,720
that we may not have explored as thoroughly before
10578
08:48:51,720 --> 08:48:54,920
because maybe one of these actions, even if I don't know anything about it,
10579
08:48:54,920 --> 08:49:00,200
might lead to better rewards faster or to more rewards in the future.
10580
08:49:00,200 --> 08:49:04,440
And so an agent that only ever exploits information and never explores
10581
08:49:04,440 --> 08:49:07,680
might be able to get reward, but it might not maximize its rewards
10582
08:49:07,680 --> 08:49:10,800
because it doesn't know what other possibilities are out there,
10583
08:49:10,800 --> 08:49:15,840
possibilities that we only know about by taking advantage of exploration.
10584
08:49:15,840 --> 08:49:17,640
And so how can we try and address this?
10585
08:49:17,640 --> 08:49:21,480
Well, one possible solution is known as the Epsilon greedy algorithm,
10586
08:49:21,480 --> 08:49:26,000
where we set Epsilon equal to how often we want to just make a random move,
10587
08:49:26,000 --> 08:49:29,600
where occasionally we will just make a random move in order to say,
10588
08:49:29,600 --> 08:49:33,200
let's try to explore and see what happens.
10589
08:49:33,200 --> 08:49:38,200
And then the logic of the algorithm will be with probability 1 minus Epsilon,
10590
08:49:38,200 --> 08:49:40,760
choose the estimated best move.
10591
08:49:40,760 --> 08:49:43,600
In a greedy case, we'd always choose the best move.
10592
08:49:43,600 --> 08:49:46,960
But in Epsilon greedy, we're most of the time
10593
08:49:46,960 --> 08:49:50,480
going to choose the best move or sometimes going to choose the best move.
10594
08:49:50,480 --> 08:49:53,040
But sometimes with probability Epsilon, we're
10595
08:49:53,040 --> 08:49:56,120
going to choose a random move instead.
10596
08:49:56,120 --> 08:49:58,760
So every time we're faced with the ability to take an action,
10597
08:49:58,760 --> 08:50:00,880
sometimes we're going to choose the best move.
10598
08:50:00,880 --> 08:50:03,840
Sometimes we're just going to choose a random move.
10599
08:50:03,840 --> 08:50:07,520
So this type of algorithm can be quite powerful in a reinforcement learning
10600
08:50:07,520 --> 08:50:11,480
context by not always just choosing the best possible move right now,
10601
08:50:11,480 --> 08:50:14,480
but sometimes, especially early on, allowing yourself
10602
08:50:14,480 --> 08:50:18,160
to make random moves that allow you to explore various different possible
10603
08:50:18,160 --> 08:50:20,920
states and actions more, and maybe over time,
10604
08:50:20,920 --> 08:50:23,280
you might decrease your value of Epsilon.
10605
08:50:23,280 --> 08:50:25,240
More and more often, choosing the best move
10606
08:50:25,240 --> 08:50:27,440
after you're more confident that you've explored
10607
08:50:27,440 --> 08:50:30,640
what all of the possibilities actually are.
10608
08:50:30,640 --> 08:50:32,160
So we can put this into practice.
10609
08:50:32,160 --> 08:50:34,760
And one very common application of reinforcement learning
10610
08:50:34,760 --> 08:50:38,320
is in game playing, that if you want to teach an agent how to play a game,
10611
08:50:38,320 --> 08:50:41,200
you just let the agent play the game a whole bunch.
10612
08:50:41,200 --> 08:50:44,160
And then the reward signal happens at the end of the game.
10613
08:50:44,160 --> 08:50:47,360
When the game is over, if our AI won the game,
10614
08:50:47,360 --> 08:50:49,640
it gets a reward of like 1, for example.
10615
08:50:49,640 --> 08:50:53,040
And if it lost the game, it gets a reward of negative 1.
10616
08:50:53,040 --> 08:50:56,080
And from that, it begins to learn what actions are good
10617
08:50:56,080 --> 08:50:57,080
and what actions are bad.
10618
08:50:57,080 --> 08:50:59,560
You don't have to tell the AI what's good and what's bad,
10619
08:50:59,560 --> 08:51:01,840
but the AI figures it out based on that reward.
10620
08:51:01,840 --> 08:51:04,960
Winning the game is some signal, losing the game is some signal,
10621
08:51:04,960 --> 08:51:07,240
and based on all of that, it begins to figure out
10622
08:51:07,240 --> 08:51:09,960
what decisions it should actually make.
10623
08:51:09,960 --> 08:51:13,040
So one very simple game, which you may have played before, is a game called
10624
08:51:13,040 --> 08:51:13,800
Nim.
10625
08:51:13,800 --> 08:51:16,360
And in the game of Nim, you've got a whole bunch of objects
10626
08:51:16,360 --> 08:51:18,520
in a whole bunch of different piles, where here I've
10627
08:51:18,520 --> 08:51:20,880
represented each pile as an individual row.
10628
08:51:20,880 --> 08:51:22,720
So you've got one object in the first pile,
10629
08:51:22,720 --> 08:51:26,280
three in the second pile, five in the third pile, seven in the fourth pile.
10630
08:51:26,280 --> 08:51:28,360
And the game of Nim is a two player game
10631
08:51:28,360 --> 08:51:31,880
where players take turns removing objects from piles.
10632
08:51:31,880 --> 08:51:34,160
And the rule is that on any given turn, you
10633
08:51:34,160 --> 08:51:39,120
were allowed to remove as many objects as you want from any one of these piles,
10634
08:51:39,120 --> 08:51:40,240
any one of these rows.
10635
08:51:40,240 --> 08:51:42,160
You have to remove at least one object, but you
10636
08:51:42,160 --> 08:51:46,800
remove as many as you want from exactly one of the piles.
10637
08:51:46,800 --> 08:51:50,720
And whoever takes the last object loses.
10638
08:51:50,720 --> 08:51:54,600
So player one might remove four from this pile here.
10639
08:51:54,600 --> 08:51:57,640
Player two might remove four from this pile here.
10640
08:51:57,640 --> 08:52:00,520
So now we've got four piles left, one, three, one, and three.
10641
08:52:00,520 --> 08:52:03,960
Player one might remove the entirety of the second pile.
10642
08:52:03,960 --> 08:52:09,840
Player two, if they're being strategic, might remove two from the third pile.
10643
08:52:09,840 --> 08:52:13,080
Now we've got three piles left, each with one object left.
10644
08:52:13,080 --> 08:52:15,360
Player one might remove one from one pile.
10645
08:52:15,360 --> 08:52:17,720
Player two removes one from the other pile.
10646
08:52:17,720 --> 08:52:22,120
And now player one is left with choosing this one object from the last pile,
10647
08:52:22,120 --> 08:52:24,640
at which point player one loses the game.
10648
08:52:24,640 --> 08:52:25,920
So fairly simple game.
10649
08:52:25,920 --> 08:52:28,960
Piles of objects, any turn you choose how many objects
10650
08:52:28,960 --> 08:52:33,240
to remove from a pile, whoever removes the last object loses.
10651
08:52:33,240 --> 08:52:36,980
And this is the type of game you could encode into an AI fairly easily,
10652
08:52:36,980 --> 08:52:39,480
because the states are really just four numbers.
10653
08:52:39,480 --> 08:52:43,080
Every state is just how many objects in each of the four piles.
10654
08:52:43,080 --> 08:52:45,440
And the actions are things like, how many
10655
08:52:45,440 --> 08:52:49,040
am I going to remove from each one of these individual piles?
10656
08:52:49,040 --> 08:52:51,440
And the reward happens at the end, that if you
10657
08:52:51,440 --> 08:52:53,920
were the player that had to remove the last object,
10658
08:52:53,920 --> 08:52:55,920
then you get some sort of punishment.
10659
08:52:55,920 --> 08:52:57,760
But if you were not, and the other player
10660
08:52:57,760 --> 08:53:01,760
had to remove the last object, well, then you get some sort of reward.
10661
08:53:01,760 --> 08:53:04,600
So we could actually try and show a demonstration of this,
10662
08:53:04,600 --> 08:53:08,720
that I've implemented an AI to play the game of Nim.
10663
08:53:08,720 --> 08:53:11,640
All right, so here, what we're going to do is create an AI
10664
08:53:11,640 --> 08:53:15,160
as a result of training the AI on some number of games,
10665
08:53:15,160 --> 08:53:18,360
that the AI is going to play against itself, where the idea is the AI will
10666
08:53:18,360 --> 08:53:22,000
play games against itself, learn from each of those experiences,
10667
08:53:22,000 --> 08:53:23,640
and learn what to do in the future.
10668
08:53:23,640 --> 08:53:26,280
And then I, the human, will play against the AI.
10669
08:53:26,280 --> 08:53:28,320
So initially, we'll say train zero times,
10670
08:53:28,320 --> 08:53:32,120
meaning we're not going to let the AI play any practice games against itself
10671
08:53:32,120 --> 08:53:34,040
in order to learn from its experiences.
10672
08:53:34,040 --> 08:53:36,840
We're just going to see how well it plays.
10673
08:53:36,840 --> 08:53:38,400
And it looks like there are four piles.
10674
08:53:38,400 --> 08:53:41,520
I can choose how many I remove from any one of the piles.
10675
08:53:41,520 --> 08:53:46,920
So maybe from pile three, I will remove five objects, for example.
10676
08:53:46,920 --> 08:53:50,280
So now, AI chose to take one item from pile zero.
10677
08:53:50,280 --> 08:53:53,000
So I'm left with these piles now, for example.
10678
08:53:53,000 --> 08:53:55,440
And so here, I could choose maybe to say, I
10679
08:53:55,440 --> 08:54:00,200
would like to remove from pile two, I'll remove all five of them,
10680
08:54:00,200 --> 08:54:01,720
for example.
10681
08:54:01,720 --> 08:54:04,240
And so AI chose to take two away from pile one.
10682
08:54:04,240 --> 08:54:08,360
Now I'm left with one pile that has one object, one pile that has two objects.
10683
08:54:08,360 --> 08:54:11,880
So from pile three, I will remove two objects.
10684
08:54:11,880 --> 08:54:15,240
And now I've left the AI with no choice but to take that last one.
10685
08:54:15,240 --> 08:54:17,680
And so the game is over, and I was able to win.
10686
08:54:17,680 --> 08:54:20,120
But I did so because the AI was really just playing randomly.
10687
08:54:20,120 --> 08:54:23,040
It didn't have any prior experience that it was using in order
10688
08:54:23,040 --> 08:54:24,840
to make these sorts of judgments.
10689
08:54:24,840 --> 08:54:29,120
Now let me let the AI train itself on 10,000 games.
10690
08:54:29,120 --> 08:54:32,920
I'm going to let the AI play 10,000 games of nim against itself.
10691
08:54:32,920 --> 08:54:36,120
Every time it wins or loses, it's going to learn from that experience
10692
08:54:36,120 --> 08:54:39,760
and learn in the future what to do and what not to do.
10693
08:54:39,760 --> 08:54:42,560
So here then, I'll go ahead and run this again.
10694
08:54:42,560 --> 08:54:45,720
And now you see the AI running through a whole bunch of training games,
10695
08:54:45,720 --> 08:54:47,680
10,000 training games against itself.
10696
08:54:47,680 --> 08:54:50,440
And now it's going to let me make these sorts of decisions.
10697
08:54:50,440 --> 08:54:52,560
So now I'm going to play against the AI.
10698
08:54:52,560 --> 08:54:55,880
Maybe I'll remove one from pile three.
10699
08:54:55,880 --> 08:54:59,520
And the AI took everything from pile three, so I'm left with three piles.
10700
08:54:59,520 --> 08:55:04,640
I'll go ahead and from pile two maybe remove three items.
10701
08:55:04,640 --> 08:55:07,280
And the AI removes one item from pile zero.
10702
08:55:07,280 --> 08:55:10,480
I'm left with two piles, each of which has two items in it.
10703
08:55:10,480 --> 08:55:14,400
I'll remove one from pile one, I guess.
10704
08:55:14,400 --> 08:55:17,520
And the AI took two from pile two, leaving me with no choice
10705
08:55:17,520 --> 08:55:20,440
but to take one away from pile one.
10706
08:55:20,440 --> 08:55:24,600
So it seems like after playing 10,000 games of nim against itself,
10707
08:55:24,600 --> 08:55:28,960
the AI has learned something about what states and what actions tend to be good
10708
08:55:28,960 --> 08:55:31,280
and has begun to learn some sort of pattern for how
10709
08:55:31,280 --> 08:55:33,720
to predict what actions are going to be good
10710
08:55:33,720 --> 08:55:37,200
and what actions are going to be bad in any given state.
10711
08:55:37,200 --> 08:55:39,880
So reinforcement learning can be a very powerful technique
10712
08:55:39,880 --> 08:55:42,480
for achieving these sorts of game-playing agents, agents
10713
08:55:42,480 --> 08:55:45,960
that are able to play a game well just by learning from experience,
10714
08:55:45,960 --> 08:55:47,880
whether that's playing against other people
10715
08:55:47,880 --> 08:55:51,880
or by playing against itself and learning from those experiences as well.
10716
08:55:51,880 --> 08:55:55,440
Now, nim is a bit of an easy game to use reinforcement learning for
10717
08:55:55,440 --> 08:55:57,040
because there are so few states.
10718
08:55:57,040 --> 08:55:59,960
There are only states that are as many as how many different objects
10719
08:55:59,960 --> 08:56:02,080
are in each of these various different piles.
10720
08:56:02,080 --> 08:56:06,120
You might imagine that it's going to be harder if you think of a game like chess
10721
08:56:06,120 --> 08:56:09,960
or games where there are many, many more states and many, many more actions
10722
08:56:09,960 --> 08:56:11,840
that you can imagine taking, where it's not
10723
08:56:11,840 --> 08:56:15,600
going to be as easy to learn for every state and for every action
10724
08:56:15,600 --> 08:56:17,520
what the value is going to be.
10725
08:56:17,520 --> 08:56:20,040
So oftentimes in that case, we can't necessarily
10726
08:56:20,040 --> 08:56:23,960
learn exactly what the value is for every state and for every action,
10727
08:56:23,960 --> 08:56:25,400
but we can approximate it.
10728
08:56:25,400 --> 08:56:28,720
So much as we saw with minimax, so we could use a depth-limiting approach
10729
08:56:28,720 --> 08:56:31,760
to stop calculating at a certain point in time,
10730
08:56:31,760 --> 08:56:34,360
we can do a similar type of approximation known
10731
08:56:34,360 --> 08:56:37,640
as function approximation in a reinforcement learning context
10732
08:56:37,640 --> 08:56:42,800
where instead of learning a value of q for every state and every action,
10733
08:56:42,800 --> 08:56:46,160
we just have some function that estimates what the value is
10734
08:56:46,160 --> 08:56:49,000
for taking this action in this particular state that
10735
08:56:49,000 --> 08:56:53,240
might be based on various different features of the state
10736
08:56:53,240 --> 08:56:55,960
that the agent happens to be in, where you might have
10737
08:56:55,960 --> 08:56:58,400
to choose what those features actually are.
10738
08:56:58,400 --> 08:57:02,400
But you can begin to learn some patterns that generalize beyond one
10739
08:57:02,400 --> 08:57:05,840
specific state and one specific action that you can begin to learn
10740
08:57:05,840 --> 08:57:08,480
if certain features tend to be good things or bad things.
10741
08:57:08,480 --> 08:57:11,760
Reinforcement learning can allow you, using a very similar mechanism,
10742
08:57:11,760 --> 08:57:14,320
to generalize beyond one particular state and say,
10743
08:57:14,320 --> 08:57:17,080
if this other state looks kind of like this state,
10744
08:57:17,080 --> 08:57:20,000
then maybe the similar types of actions that worked in one state
10745
08:57:20,000 --> 08:57:23,240
will also work in another state as well.
10746
08:57:23,240 --> 08:57:25,360
And so this type of approach can be quite helpful
10747
08:57:25,360 --> 08:57:27,680
as you begin to deal with reinforcement learning that
10748
08:57:27,680 --> 08:57:31,600
exist in larger and larger state spaces where it's just not feasible
10749
08:57:31,600 --> 08:57:36,120
to explore all of the possible states that could actually exist.
10750
08:57:36,120 --> 08:57:39,400
So there, then, are two of the main categories of reinforcement learning.
10751
08:57:39,400 --> 08:57:42,760
Supervised learning, where you have labeled input and output pairs,
10752
08:57:42,760 --> 08:57:46,480
and reinforcement learning, where an agent learns from rewards or punishments
10753
08:57:46,480 --> 08:57:47,480
that it receives.
10754
08:57:47,480 --> 08:57:49,640
The third major category of machine learning
10755
08:57:49,640 --> 08:57:53,400
that we'll just touch on briefly is known as unsupervised learning.
10756
08:57:53,400 --> 08:57:56,520
And unsupervised learning happens when we have data
10757
08:57:56,520 --> 08:57:59,280
without any additional feedback, without labels,
10758
08:57:59,280 --> 08:58:02,760
that in the supervised learning case, all of our data had labels.
10759
08:58:02,760 --> 08:58:06,520
We labeled the data point with whether that was a rainy day or not rainy day.
10760
08:58:06,520 --> 08:58:09,560
And using those labels, we were able to infer what the pattern was.
10761
08:58:09,560 --> 08:58:13,160
Or we labeled data as a counterfeit banknote or not a counterfeit.
10762
08:58:13,160 --> 08:58:16,720
And using those labels, we were able to draw inferences and patterns
10763
08:58:16,720 --> 08:58:20,840
to figure out what does a banknote look like versus not.
10764
08:58:20,840 --> 08:58:25,240
In unsupervised learning, we don't have any access to any of those labels.
10765
08:58:25,240 --> 08:58:28,320
But we still would like to learn some of those patterns.
10766
08:58:28,320 --> 08:58:31,920
And one of the tasks that you might want to perform in unsupervised learning
10767
08:58:31,920 --> 08:58:34,680
is something like clustering, where clustering is just
10768
08:58:34,680 --> 08:58:37,800
the task of, given some set of objects, organize it
10769
08:58:37,800 --> 08:58:42,160
into distinct clusters, groups of objects that are similar to one another.
10770
08:58:42,160 --> 08:58:44,440
And there's lots of applications for clustering.
10771
08:58:44,440 --> 08:58:47,480
It comes up in genetic research, where you might have
10772
08:58:47,480 --> 08:58:50,840
a whole bunch of different genes and you want to cluster them into similar genes
10773
08:58:50,840 --> 08:58:54,480
if you're trying to analyze them across a population or across species.
10774
08:58:54,480 --> 08:58:57,480
It comes up in an image if you want to take all the pixels of an image,
10775
08:58:57,480 --> 08:58:59,520
cluster them into different parts of the image.
10776
08:58:59,520 --> 08:59:03,400
Comes a lot up in market research if you want to divide your consumers
10777
08:59:03,400 --> 08:59:06,640
into different groups so you know which groups to target with certain types
10778
08:59:06,640 --> 08:59:10,240
of product advertisements, for example, and a number of other contexts
10779
08:59:10,240 --> 08:59:13,280
as well in which clustering can be very applicable.
10780
08:59:13,280 --> 08:59:17,880
One technique for clustering is an algorithm known as k-means clustering.
10781
08:59:17,880 --> 08:59:20,240
And what k-means clustering is going to do
10782
08:59:20,240 --> 08:59:24,720
is it is going to divide all of our data points into k different clusters.
10783
08:59:24,720 --> 08:59:28,640
And it's going to do so by repeating this process of assigning points
10784
08:59:28,640 --> 08:59:32,640
to clusters and then moving around those clusters at centers.
10785
08:59:32,640 --> 08:59:36,600
We're going to define a cluster by its center, the middle of the cluster,
10786
08:59:36,600 --> 08:59:39,760
and then assign points to that cluster based on which
10787
08:59:39,760 --> 08:59:42,360
center is closest to that point.
10788
08:59:42,360 --> 08:59:44,560
And I'll show you an example of that now.
10789
08:59:44,560 --> 08:59:47,960
Here, for example, I have a whole bunch of unlabeled data,
10790
08:59:47,960 --> 08:59:51,760
just various data points that are in some sort of graphical space.
10791
08:59:51,760 --> 08:59:55,560
And I would like to group them into various different clusters.
10792
08:59:55,560 --> 08:59:57,400
But I don't know how to do that originally.
10793
08:59:57,400 --> 09:00:00,400
And let's say I want to assign like three clusters to this group.
10794
09:00:00,400 --> 09:00:03,400
And you have to choose how many clusters you want in k-means clustering
10795
09:00:03,400 --> 09:00:06,800
that you could try multiple and see how well those values perform.
10796
09:00:06,800 --> 09:00:09,960
But I'll start just by randomly picking some places
10797
09:00:09,960 --> 09:00:12,040
to put the centers of those clusters.
10798
09:00:12,040 --> 09:00:15,600
Maybe I have a blue cluster, a red cluster, and a green cluster.
10799
09:00:15,600 --> 09:00:18,040
And I'm going to start with the centers of those clusters
10800
09:00:18,040 --> 09:00:20,600
just being in these three locations here.
10801
09:00:20,600 --> 09:00:23,040
And what k-means clustering tells us to do
10802
09:00:23,040 --> 09:00:25,720
is once I have the centers of the clusters,
10803
09:00:25,720 --> 09:00:32,440
assign every point to a cluster based on which cluster center it is closest to.
10804
09:00:32,440 --> 09:00:35,920
So we end up with something like this, where all of these points
10805
09:00:35,920 --> 09:00:40,240
are closer to the blue cluster center than any other cluster center.
10806
09:00:40,240 --> 09:00:43,880
All of these points here are closer to the green cluster
10807
09:00:43,880 --> 09:00:45,800
center than any other cluster center.
10808
09:00:45,800 --> 09:00:48,360
And then these two points plus these points over here,
10809
09:00:48,360 --> 09:00:53,200
those are all closest to the red cluster center instead.
10810
09:00:53,200 --> 09:00:57,000
So here then is one possible assignment of all these points
10811
09:00:57,000 --> 09:00:58,880
to three different clusters.
10812
09:00:58,880 --> 09:01:01,560
But it's not great that it seems like in this red cluster,
10813
09:01:01,560 --> 09:01:02,960
these points are kind of far apart.
10814
09:01:02,960 --> 09:01:05,800
In this green cluster, these points are kind of far apart.
10815
09:01:05,800 --> 09:01:08,760
It might not be my ideal choice of how I would cluster
10816
09:01:08,760 --> 09:01:10,640
these various different data points.
10817
09:01:10,640 --> 09:01:13,720
But k-means clustering is an iterative process
10818
09:01:13,720 --> 09:01:16,360
that after I do this, there is a next step, which
10819
09:01:16,360 --> 09:01:19,920
is that after I've assigned all of the points to the cluster center
10820
09:01:19,920 --> 09:01:24,280
that it is nearest to, we are going to re-center the clusters,
10821
09:01:24,280 --> 09:01:27,560
meaning take the cluster centers, these diamond shapes here,
10822
09:01:27,560 --> 09:01:30,420
and move them to the middle, or the average,
10823
09:01:30,420 --> 09:01:33,960
effectively, of all of the points that are in that cluster.
10824
09:01:33,960 --> 09:01:36,080
So we'll take this blue point, this blue center,
10825
09:01:36,080 --> 09:01:39,800
and go ahead and move it to the middle or to the center of all
10826
09:01:39,800 --> 09:01:41,920
of the points that were assigned to the blue cluster,
10827
09:01:41,920 --> 09:01:43,920
moving it slightly to the right in this case.
10828
09:01:43,920 --> 09:01:45,040
And we'll do the same thing for red.
10829
09:01:45,040 --> 09:01:49,840
We'll move the cluster center to the middle of all of these points,
10830
09:01:49,840 --> 09:01:51,560
weighted by how many points there are.
10831
09:01:51,560 --> 09:01:55,040
There are more points over here, so the red center ends up
10832
09:01:55,040 --> 09:01:56,720
moving a little bit further that way.
10833
09:01:56,720 --> 09:01:59,300
And likewise, for the green center, there are many more points
10834
09:01:59,300 --> 09:02:01,200
on this side of the green center.
10835
09:02:01,200 --> 09:02:04,440
So the green center ends up being pulled a little bit further
10836
09:02:04,440 --> 09:02:06,160
in this direction.
10837
09:02:06,160 --> 09:02:10,020
So we re-center all of the clusters, and then we repeat the process.
10838
09:02:10,020 --> 09:02:14,480
We go ahead and now reassign all of the points to the cluster center
10839
09:02:14,480 --> 09:02:16,080
that they are now closest to.
10840
09:02:16,080 --> 09:02:18,560
And now that we've moved around the cluster centers,
10841
09:02:18,560 --> 09:02:20,640
these cluster assignments might change.
10842
09:02:20,640 --> 09:02:23,840
That this point originally was closer to the red cluster center,
10843
09:02:23,840 --> 09:02:26,840
but now it's actually closer to the blue cluster center.
10844
09:02:26,840 --> 09:02:28,520
Same goes for this point as well.
10845
09:02:28,520 --> 09:02:31,620
And these three points that were originally closer to the green cluster
10846
09:02:31,620 --> 09:02:36,600
center are now closer to the red cluster center instead.
10847
09:02:36,600 --> 09:02:41,320
So we can reassign what colors or which clusters each of these data points
10848
09:02:41,320 --> 09:02:43,960
belongs to, and then repeat the process again,
10849
09:02:43,960 --> 09:02:47,520
moving each of these cluster means and the middles of the clusterism
10850
09:02:47,520 --> 09:02:52,720
to the mean, the average, of all of the other points that happen to be there,
10851
09:02:52,720 --> 09:02:54,320
and repeat the process again.
10852
09:02:54,320 --> 09:02:57,060
Go ahead and assign each of the points to the cluster
10853
09:02:57,060 --> 09:02:58,440
that they are closest to.
10854
09:02:58,440 --> 09:03:01,600
So once we reach a point where we've assigned all the points to clusters
10855
09:03:01,600 --> 09:03:05,140
to the cluster that they are nearest to, and nothing changed,
10856
09:03:05,140 --> 09:03:07,840
we've reached a sort of equilibrium in this situation,
10857
09:03:07,840 --> 09:03:09,960
where no points are changing their allegiance.
10858
09:03:09,960 --> 09:03:12,800
And as a result, we can declare this algorithm is now over.
10859
09:03:12,800 --> 09:03:15,840
And we now have some assignment of each of these points
10860
09:03:15,840 --> 09:03:17,160
into three different clusters.
10861
09:03:17,160 --> 09:03:19,480
And it looks like we did a pretty good job of trying
10862
09:03:19,480 --> 09:03:22,880
to identify which points are more similar to one another
10863
09:03:22,880 --> 09:03:24,640
than they are to points in other groups.
10864
09:03:24,640 --> 09:03:27,920
So we have the green cluster down here, this blue cluster here,
10865
09:03:27,920 --> 09:03:30,480
and then this red cluster over there as well.
10866
09:03:30,480 --> 09:03:33,400
And we did so without any access to some labels
10867
09:03:33,400 --> 09:03:35,960
to tell us what these various different clusters were.
10868
09:03:35,960 --> 09:03:38,800
We just used an algorithm in an unsupervised sense
10869
09:03:38,800 --> 09:03:41,760
without any of those labels to figure out which points
10870
09:03:41,760 --> 09:03:43,640
belonged to which categories.
10871
09:03:43,640 --> 09:03:47,680
And again, lots of applications for this type of clustering technique.
10872
09:03:47,680 --> 09:03:50,760
And there are many more algorithms in each of these various different fields
10873
09:03:50,760 --> 09:03:54,240
within machine learning, supervised and reinforcement and unsupervised.
10874
09:03:54,240 --> 09:03:57,240
But those are many of the big picture foundational ideas
10875
09:03:57,240 --> 09:04:00,320
that underlie a lot of these techniques, where these are the problems
10876
09:04:00,320 --> 09:04:01,520
that we're trying to solve.
10877
09:04:01,520 --> 09:04:03,640
And we try and solve those problems using
10878
09:04:03,640 --> 09:04:06,560
a number of different methods of trying to take data and learn
10879
09:04:06,560 --> 09:04:08,800
patterns in that data, whether that's trying
10880
09:04:08,800 --> 09:04:10,960
to find neighboring data points that are similar
10881
09:04:10,960 --> 09:04:13,840
or trying to minimize some sort of loss function
10882
09:04:13,840 --> 09:04:17,080
or any number of other techniques that allow us to begin to try
10883
09:04:17,080 --> 09:04:19,360
to solve these sorts of problems.
10884
09:04:19,360 --> 09:04:21,360
That then was a look at some of the principles
10885
09:04:21,360 --> 09:04:23,800
that are at the foundation of modern machine learning,
10886
09:04:23,800 --> 09:04:26,760
this ability to take data and learn from that data
10887
09:04:26,760 --> 09:04:28,840
so that the computer can perform a task even
10888
09:04:28,840 --> 09:04:31,240
if they haven't explicitly been given instructions
10889
09:04:31,240 --> 09:04:32,440
in order to do so.
10890
09:04:32,440 --> 09:04:35,200
Next time, we'll continue this conversation about machine learning,
10891
09:04:35,200 --> 09:04:38,600
looking at other techniques we can use for solving these sorts of problems.
10892
09:04:38,600 --> 09:04:41,320
We'll see you then.
10893
09:04:41,320 --> 09:05:01,360
All right, welcome back, everyone, to an introduction
10894
09:05:01,360 --> 09:05:03,360
to artificial intelligence with Python.
10895
09:05:03,360 --> 09:05:05,640
Now, last time, we took a look at machine learning,
10896
09:05:05,640 --> 09:05:09,280
a set of techniques that computers can use in order to take a set of data
10897
09:05:09,280 --> 09:05:11,480
and learn some patterns inside of that data,
10898
09:05:11,480 --> 09:05:14,560
learn how to perform a task even if we the programmers didn't
10899
09:05:14,560 --> 09:05:18,760
give the computer explicit instructions for how to perform that task.
10900
09:05:18,760 --> 09:05:21,780
Today, we transition to one of the most popular techniques and tools
10901
09:05:21,780 --> 09:05:24,600
within machine learning, that of neural networks.
10902
09:05:24,600 --> 09:05:27,600
And neural networks were inspired as early as the 1940s
10903
09:05:27,600 --> 09:05:30,600
by researchers who were thinking about how it is that humans learn,
10904
09:05:30,600 --> 09:05:33,000
studying neuroscience in the human brain and trying
10905
09:05:33,000 --> 09:05:36,160
to see whether or not we could apply those same ideas to computers
10906
09:05:36,160 --> 09:05:39,440
as well and model computer learning off of human learning.
10907
09:05:39,440 --> 09:05:41,480
So how is the brain structured?
10908
09:05:41,480 --> 09:05:45,080
Well, very simply put, the brain consists of a whole bunch of neurons.
10909
09:05:45,080 --> 09:05:47,480
And those neurons are connected to one another
10910
09:05:47,480 --> 09:05:49,800
and communicate with one another in some way.
10911
09:05:49,800 --> 09:05:52,800
In particular, if you think about the structure of a biological neural
10912
09:05:52,800 --> 09:05:55,800
network, something like this, there are a couple of key properties
10913
09:05:55,800 --> 09:05:57,320
that scientists observed.
10914
09:05:57,320 --> 09:05:59,680
One was that these neurons are connected to each other
10915
09:05:59,680 --> 09:06:01,880
and receive electrical signals from one another,
10916
09:06:01,880 --> 09:06:06,120
that one neuron can propagate electrical signals to another neuron.
10917
09:06:06,120 --> 09:06:09,080
And another point is that neurons process those input signals
10918
09:06:09,080 --> 09:06:12,760
and then can be activated, that a neuron becomes activated at a certain point
10919
09:06:12,760 --> 09:06:16,840
and then can propagate further signals onto neurons in the future.
10920
09:06:16,840 --> 09:06:18,600
And so the question then became, could we
10921
09:06:18,600 --> 09:06:22,160
take this biological idea of how it is that humans learn with brains
10922
09:06:22,160 --> 09:06:25,440
and with neurons and apply that to a machine as well,
10923
09:06:25,440 --> 09:06:29,520
in effect designing an artificial neural network, or an ANN,
10924
09:06:29,520 --> 09:06:31,760
which will be a mathematical model for learning
10925
09:06:31,760 --> 09:06:34,860
that is inspired by these biological neural networks?
10926
09:06:34,860 --> 09:06:37,360
And what artificial neural networks will allow us to do
10927
09:06:37,360 --> 09:06:40,600
is they will first be able to model some sort of mathematical function.
10928
09:06:40,600 --> 09:06:42,440
Every time you look at a neural network, which
10929
09:06:42,440 --> 09:06:44,520
we'll see more of later today, each one of them
10930
09:06:44,520 --> 09:06:46,720
is really just some mathematical function that
10931
09:06:46,720 --> 09:06:50,160
is mapping certain inputs to particular outputs based
10932
09:06:50,160 --> 09:06:53,240
on the structure of the network, that depending on where we place
10933
09:06:53,240 --> 09:06:55,800
particular units inside of this neural network,
10934
09:06:55,800 --> 09:06:59,640
that's going to determine how it is that the network is going to function.
10935
09:06:59,640 --> 09:07:01,800
And in particular, artificial neural networks
10936
09:07:01,800 --> 09:07:05,760
are going to lend themselves to a way that we can learn what the network's
10937
09:07:05,760 --> 09:07:07,240
parameters should be.
10938
09:07:07,240 --> 09:07:08,920
We'll see more on that in just a moment.
10939
09:07:08,920 --> 09:07:11,000
But in effect, we want a model such that it
10940
09:07:11,000 --> 09:07:13,500
is easy for us to be able to write some code that
10941
09:07:13,500 --> 09:07:16,040
allows for the network to be able to figure out
10942
09:07:16,040 --> 09:07:18,480
how to model the right mathematical function given
10943
09:07:18,480 --> 09:07:20,840
a particular set of input data.
10944
09:07:20,840 --> 09:07:23,120
So in order to create our artificial neural network,
10945
09:07:23,120 --> 09:07:25,360
instead of using biological neurons, we're just
10946
09:07:25,360 --> 09:07:28,240
going to use what we're going to call units, units inside of a neural
10947
09:07:28,240 --> 09:07:31,460
network, which we can represent kind of like a node in a graph, which
10948
09:07:31,460 --> 09:07:34,520
will here be represented just by a blue circle like this.
10949
09:07:34,520 --> 09:07:37,520
And these artificial units, these artificial neurons,
10950
09:07:37,520 --> 09:07:39,320
can be connected to one another.
10951
09:07:39,320 --> 09:07:41,560
So here, for instance, we have two units that
10952
09:07:41,560 --> 09:07:46,240
are connected by this edge inside of this graph, effectively.
10953
09:07:46,240 --> 09:07:48,020
And so what we're going to do now is think
10954
09:07:48,020 --> 09:07:51,680
of this idea as some sort of mapping from inputs to outputs.
10955
09:07:51,680 --> 09:07:54,800
So we have one unit that is connected to another unit
10956
09:07:54,800 --> 09:07:58,460
that we might think of this side of the input and that side of the output.
10957
09:07:58,460 --> 09:08:00,680
And what we're trying to do then is to figure out
10958
09:08:00,680 --> 09:08:04,000
how to solve a problem, how to model some sort of mathematical function.
10959
09:08:04,000 --> 09:08:05,680
And this might take the form of something
10960
09:08:05,680 --> 09:08:08,640
we saw last time, which was something like we have certain inputs,
10961
09:08:08,640 --> 09:08:10,680
like variables x1 and x2.
10962
09:08:10,680 --> 09:08:13,800
And given those inputs, we want to perform some sort of task,
10963
09:08:13,800 --> 09:08:16,820
a task like predicting whether or not it's going to rain.
10964
09:08:16,820 --> 09:08:20,120
And ideally, we'd like some way, given these inputs, x1 and x2,
10965
09:08:20,120 --> 09:08:23,120
which stand for some sort of variables to do with the weather,
10966
09:08:23,120 --> 09:08:27,000
we would like to be able to predict, in this case, a Boolean classification.
10967
09:08:27,000 --> 09:08:30,160
Is it going to rain, or is it not going to rain?
10968
09:08:30,160 --> 09:08:33,360
And we did this last time by way of a mathematical function.
10969
09:08:33,360 --> 09:08:36,880
We defined some function, h, for our hypothesis function,
10970
09:08:36,880 --> 09:08:41,160
that took as input x1 and x2, the two inputs that we cared about processing,
10971
09:08:41,160 --> 09:08:44,080
in order to determine whether we thought it was going to rain
10972
09:08:44,080 --> 09:08:46,160
or whether we thought it was not going to rain.
10973
09:08:46,160 --> 09:08:48,840
The question then becomes, what does this hypothesis function
10974
09:08:48,840 --> 09:08:51,400
do in order to make that determination?
10975
09:08:51,400 --> 09:08:56,520
And we decided last time to use a linear combination of these input variables
10976
09:08:56,520 --> 09:08:58,160
to determine what the output should be.
10977
09:08:58,160 --> 09:09:02,680
So our hypothesis function was equal to something like this.
10978
09:09:02,680 --> 09:09:07,560
Weight 0 plus weight 1 times x1 plus weight 2 times x2.
10979
09:09:07,560 --> 09:09:11,960
So what's going on here is that x1 and x2, those are input variables,
10980
09:09:11,960 --> 09:09:15,040
the inputs to this hypothesis function.
10981
09:09:15,040 --> 09:09:17,960
And each of those input variables is being multiplied
10982
09:09:17,960 --> 09:09:20,400
by some weight, which is just some number.
10983
09:09:20,400 --> 09:09:25,240
So x1 is being multiplied by weight 1, x2 is being multiplied by weight 2.
10984
09:09:25,240 --> 09:09:27,660
And we have this additional weight, weight 0,
10985
09:09:27,660 --> 09:09:30,160
that doesn't get multiplied by an input variable at all,
10986
09:09:30,160 --> 09:09:32,040
that just serves to either move the function up
10987
09:09:32,040 --> 09:09:33,840
or move the function's value down.
10988
09:09:33,840 --> 09:09:36,160
You can think of this as either a weight that's just
10989
09:09:36,160 --> 09:09:38,720
multiplied by some dummy value, like the number 1.
10990
09:09:38,720 --> 09:09:41,840
It's multiplied by 1, and so it's not multiplied by anything.
10991
09:09:41,840 --> 09:09:43,840
Or sometimes, you'll see in the literature,
10992
09:09:43,840 --> 09:09:46,280
people call this variable weight 0 a bias,
10993
09:09:46,280 --> 09:09:48,780
so that you can think of these variables as slightly different.
10994
09:09:48,780 --> 09:09:50,920
We have weights that are multiplied by the input,
10995
09:09:50,920 --> 09:09:54,560
and we separately add some bias to the result as well.
10996
09:09:54,560 --> 09:09:56,240
You'll hear both of those terminologies used
10997
09:09:56,240 --> 09:09:59,960
when people talk about neural networks and machine learning.
10998
09:09:59,960 --> 09:10:02,160
So in effect, what we've done here is that in order
10999
09:10:02,160 --> 09:10:06,240
to define a hypothesis function, we just need to decide and figure out
11000
09:10:06,240 --> 09:10:08,640
what these weights should be to determine
11001
09:10:08,640 --> 09:10:12,520
what values to multiply by our inputs to get some sort of result.
11002
09:10:12,520 --> 09:10:14,600
Of course, at the end of this, what we need to do
11003
09:10:14,600 --> 09:10:18,120
is make some sort of classification, like rainy or not rainy.
11004
09:10:18,120 --> 09:10:20,400
And to do that, we use some sort of function
11005
09:10:20,400 --> 09:10:22,400
that defines some sort of threshold.
11006
09:10:22,400 --> 09:10:25,040
And so we saw, for instance, the step function,
11007
09:10:25,040 --> 09:10:30,120
which is defined as 1 if the result of multiplying the weights by the inputs
11008
09:10:30,120 --> 09:10:32,360
is at least 0, otherwise it's 0.
11009
09:10:32,360 --> 09:10:34,200
And you can think of this line down the middle
11010
09:10:34,200 --> 09:10:35,560
as kind of like a dotted line.
11011
09:10:35,560 --> 09:10:38,560
Effectively, it stays at 0 all the way up to one point,
11012
09:10:38,560 --> 09:10:41,320
and then the function steps or jumps up to 1.
11013
09:10:41,320 --> 09:10:43,720
So it's 0 before it reaches some threshold,
11014
09:10:43,720 --> 09:10:46,960
and then it's 1 after it reaches a particular threshold.
11015
09:10:46,960 --> 09:10:49,040
And so this was one way we could define what
11016
09:10:49,040 --> 09:10:51,680
will come to call an activation function, a function that
11017
09:10:51,680 --> 09:10:56,400
determines when it is that this output becomes active, changes to 1
11018
09:10:56,400 --> 09:10:58,280
instead of being a 0.
11019
09:10:58,280 --> 09:11:02,120
But we also saw that if we didn't just want a purely binary classification,
11020
09:11:02,120 --> 09:11:04,800
we didn't want purely 1 or 0, but we wanted
11021
09:11:04,800 --> 09:11:07,880
to allow for some in-between real numbered values,
11022
09:11:07,880 --> 09:11:09,300
we could use a different function.
11023
09:11:09,300 --> 09:11:11,720
And there are a number of choices, but the one that we looked at
11024
09:11:11,720 --> 09:11:15,760
was the logistic sigmoid function that has sort of an s-shaped curve,
11025
09:11:15,760 --> 09:11:18,160
where we could represent this as a probability that
11026
09:11:18,160 --> 09:11:20,920
may be somewhere in between the probability of rain
11027
09:11:20,920 --> 09:11:22,640
or something like 0.5.
11028
09:11:22,640 --> 09:11:25,680
Maybe a little bit later, the probability of rain is 0.8.
11029
09:11:25,680 --> 09:11:29,600
And so rather than just have a binary classification of 0 or 1,
11030
09:11:29,600 --> 09:11:32,320
we could allow for numbers that are in between as well.
11031
09:11:32,320 --> 09:11:35,000
And it turns out there are many other different types of activation
11032
09:11:35,000 --> 09:11:37,560
functions, where an activation function just
11033
09:11:37,560 --> 09:11:41,000
takes the output of multiplying the weights together and adding that bias,
11034
09:11:41,000 --> 09:11:43,800
and then figuring out what the actual output should be.
11035
09:11:43,800 --> 09:11:48,040
Another popular one is the rectified linear unit, otherwise known as ReLU.
11036
09:11:48,040 --> 09:11:50,480
And the way that works is that it just takes its input
11037
09:11:50,480 --> 09:11:52,920
and takes the maximum of that input and 0.
11038
09:11:52,920 --> 09:11:55,120
So if it's positive, it remains unchanged.
11039
09:11:55,120 --> 09:11:59,000
But if it's 0, if it's negative, it goes ahead and levels out at 0.
11040
09:11:59,000 --> 09:12:02,400
And there are other activation functions that we could choose as well.
11041
09:12:02,400 --> 09:12:04,720
But in short, each of these activation functions,
11042
09:12:04,720 --> 09:12:07,880
you can just think of as a function that gets applied
11043
09:12:07,880 --> 09:12:10,360
to the result of all of this computation.
11044
09:12:10,360 --> 09:12:15,400
We take some function g and apply it to the result of all of that calculation.
11045
09:12:15,400 --> 09:12:17,380
And this then is what we saw last time, the way
11046
09:12:17,380 --> 09:12:20,920
of defining some hypothesis function that takes in inputs,
11047
09:12:20,920 --> 09:12:23,980
calculate some linear combination of those inputs,
11048
09:12:23,980 --> 09:12:28,760
and then passes it through some sort of activation function to get our output.
11049
09:12:28,760 --> 09:12:32,800
And this actually turns out to be the model for the simplest of neural
11050
09:12:32,800 --> 09:12:36,440
networks, that we're going to instead represent this mathematical idea
11051
09:12:36,440 --> 09:12:39,320
graphically by using a structure like this.
11052
09:12:39,320 --> 09:12:42,000
Here then is a neural network that has two inputs.
11053
09:12:42,000 --> 09:12:44,400
We can think of this as x1 and this as x2.
11054
09:12:44,400 --> 09:12:48,120
And then one output, which you can think of as classifying whether or not
11055
09:12:48,120 --> 09:12:50,960
we think it's going to rain or not rain, for example,
11056
09:12:50,960 --> 09:12:52,640
in this particular instance.
11057
09:12:52,640 --> 09:12:54,600
And so how exactly does this model work?
11058
09:12:54,600 --> 09:12:57,640
Well, each of these two inputs represents one of our input variables,
11059
09:12:57,640 --> 09:12:59,660
x1 and x2.
11060
09:12:59,660 --> 09:13:05,040
And notice that these inputs are connected to this output via these edges,
11061
09:13:05,040 --> 09:13:06,980
which are going to be defined by their weights.
11062
09:13:06,980 --> 09:13:10,840
So these edges each have a weight associated with them, weight 1 and weight
11063
09:13:10,840 --> 09:13:12,000
2.
11064
09:13:12,000 --> 09:13:14,320
And then this output unit, what it's going to do
11065
09:13:14,320 --> 09:13:17,680
is it is going to calculate an output based on those inputs
11066
09:13:17,680 --> 09:13:19,240
and based on those weights.
11067
09:13:19,240 --> 09:13:23,320
This output unit is going to multiply all the inputs by their weights,
11068
09:13:23,320 --> 09:13:26,760
add in this bias term, which you can think of as an extra w0 term
11069
09:13:26,760 --> 09:13:31,120
that gets added into it, and then we pass it through an activation function.
11070
09:13:31,120 --> 09:13:34,640
So this then is just a graphical way of representing the same idea
11071
09:13:34,640 --> 09:13:36,800
we saw last time just mathematically.
11072
09:13:36,800 --> 09:13:40,120
And we're going to call this a very simple neural network.
11073
09:13:40,120 --> 09:13:42,000
And we'd like for this neural network to be
11074
09:13:42,000 --> 09:13:44,560
able to learn how to calculate some function,
11075
09:13:44,560 --> 09:13:46,960
that we want some function for the neural network to learn.
11076
09:13:46,960 --> 09:13:50,600
And the neural network is going to learn what should the values of w0,
11077
09:13:50,600 --> 09:13:52,240
w1, and w2 be?
11078
09:13:52,240 --> 09:13:54,560
What should the activation function be in order
11079
09:13:54,560 --> 09:13:57,240
to get the result that we would expect?
11080
09:13:57,240 --> 09:13:59,440
So we can actually take a look at an example of this.
11081
09:13:59,440 --> 09:14:02,440
What then is a very simple function that we might calculate?
11082
09:14:02,440 --> 09:14:06,040
Well, if we recall back from when we were looking at propositional logic,
11083
09:14:06,040 --> 09:14:07,920
one of the simplest functions we looked at
11084
09:14:07,920 --> 09:14:12,160
was something like the or function that takes two inputs, x and y,
11085
09:14:12,160 --> 09:14:16,640
and outputs 1, otherwise known as true, if either one of the inputs
11086
09:14:16,640 --> 09:14:22,280
or both of them are 1, and outputs of 0 if both of the inputs are 0 or false.
11087
09:14:22,280 --> 09:14:23,800
So this then is the or function.
11088
09:14:23,800 --> 09:14:25,880
And this was the truth table for the or function,
11089
09:14:25,880 --> 09:14:29,840
that as long as either of the inputs are 1, the output of the function is 1,
11090
09:14:29,840 --> 09:14:34,520
and the only case where the output is 0 is where both of the inputs are 0.
11091
09:14:34,520 --> 09:14:38,200
So the question is, how could we take this and train a neural network
11092
09:14:38,200 --> 09:14:40,600
to be able to learn this particular function?
11093
09:14:40,600 --> 09:14:42,440
What would those weights look like?
11094
09:14:42,440 --> 09:14:44,280
Well, we could do something like this.
11095
09:14:44,280 --> 09:14:45,960
Here's our neural network.
11096
09:14:45,960 --> 09:14:48,920
And I'll propose that in order to calculate the or function,
11097
09:14:48,920 --> 09:14:52,880
we're going to use a value of 1 for each of the weights.
11098
09:14:52,880 --> 09:14:55,800
And we'll use a bias of negative 1.
11099
09:14:55,800 --> 09:14:59,520
And then we'll just use this step function as our activation function.
11100
09:14:59,520 --> 09:15:00,720
How then does this work?
11101
09:15:00,720 --> 09:15:04,240
Well, if I wanted to calculate something like 0 or 0,
11102
09:15:04,240 --> 09:15:08,000
which we know to be 0 because false or false is false, then what are we going
11103
09:15:08,000 --> 09:15:08,720
to do?
11104
09:15:08,720 --> 09:15:12,400
Well, our output unit is going to calculate this input multiplied
11105
09:15:12,400 --> 09:15:14,680
by the weight, 0 times 1, that's 0.
11106
09:15:14,680 --> 09:15:17,440
Same thing here, 0 times 1, that's 0.
11107
09:15:17,440 --> 09:15:21,360
And we'll add to that the bias minus 1.
11108
09:15:21,360 --> 09:15:23,800
So that'll give us a result of negative 1.
11109
09:15:23,800 --> 09:15:26,920
If we plot that on our activation function, negative 1 is here.
11110
09:15:26,920 --> 09:15:30,520
It's before the threshold, which means either 0 or 1.
11111
09:15:30,520 --> 09:15:32,400
It's only 1 after the threshold.
11112
09:15:32,400 --> 09:15:34,720
Since negative 1 is before the threshold,
11113
09:15:34,720 --> 09:15:38,440
the output that this unit provides is going to be 0.
11114
09:15:38,440 --> 09:15:43,480
And that's what we would expect it to be, that 0 or 0 should be 0.
11115
09:15:43,480 --> 09:15:47,360
What if instead we had had 1 or 0, where this is the number 1?
11116
09:15:47,360 --> 09:15:50,720
Well, in this case, in order to calculate what the output is going to be,
11117
09:15:50,720 --> 09:15:55,720
we again have to do this weighted sum, 1 times 1, that's 1.
11118
09:15:55,720 --> 09:15:57,400
0 times 1, that's 0.
11119
09:15:57,400 --> 09:15:59,480
Sum of that so far is 1.
11120
09:15:59,480 --> 09:16:00,880
Add negative 1 to that.
11121
09:16:00,880 --> 09:16:02,440
Well, then the output is 0.
11122
09:16:02,440 --> 09:16:05,600
And if we plot 0 on the step function, 0 ends up being here.
11123
09:16:05,600 --> 09:16:07,320
It's just at the threshold.
11124
09:16:07,320 --> 09:16:11,440
And so the output here is going to be 1, because the output of 1 or 0,
11125
09:16:11,440 --> 09:16:12,240
that's 1.
11126
09:16:12,240 --> 09:16:13,960
So that's what we would expect as well.
11127
09:16:13,960 --> 09:16:17,800
And just for one more example, if I had 1 or 1, what would the result be?
11128
09:16:17,800 --> 09:16:19,520
Well, 1 times 1 is 1.
11129
09:16:19,520 --> 09:16:20,520
1 times 1 is 1.
11130
09:16:20,520 --> 09:16:22,120
The sum of those is 2.
11131
09:16:22,120 --> 09:16:23,440
I add the bias term to that.
11132
09:16:23,440 --> 09:16:24,720
I get the number 1.
11133
09:16:24,720 --> 09:16:27,000
1 plotted on this graph is way over there.
11134
09:16:27,000 --> 09:16:28,760
That's well beyond the threshold.
11135
09:16:28,760 --> 09:16:31,040
And so this output is going to be 1 as well.
11136
09:16:31,040 --> 09:16:34,160
The output is always 0 or 1, depending on whether or not
11137
09:16:34,160 --> 09:16:35,480
we're past the threshold.
11138
09:16:35,480 --> 09:16:39,840
And this neural network then models the OR function, a very simple function,
11139
09:16:39,840 --> 09:16:40,720
definitely.
11140
09:16:40,720 --> 09:16:42,520
But it still is able to model it correctly.
11141
09:16:42,520 --> 09:16:48,000
If I give it the inputs, it will tell me what x1 or x2 happens to be.
11142
09:16:48,000 --> 09:16:50,840
And you could imagine trying to do this for other functions as well.
11143
09:16:50,840 --> 09:16:55,440
A function like the AND function, for instance, that takes two inputs
11144
09:16:55,440 --> 09:16:59,360
and calculates whether both x and y are true.
11145
09:16:59,360 --> 09:17:04,080
So if x is 1 and y is 1, then the output of x and y is 1.
11146
09:17:04,080 --> 09:17:07,160
But in all the other cases, the output is 0.
11147
09:17:07,160 --> 09:17:10,400
How could we model that inside of a neural network as well?
11148
09:17:10,400 --> 09:17:13,200
Well, it turns out we could do it in the same way,
11149
09:17:13,200 --> 09:17:16,480
except instead of negative 1 as the bias,
11150
09:17:16,480 --> 09:17:20,000
we can use negative 2 as the bias instead.
11151
09:17:20,000 --> 09:17:21,360
What does that end up looking like?
11152
09:17:21,360 --> 09:17:25,960
Well, if I had 1 and 1, that should be 1, because 1 true and true
11153
09:17:25,960 --> 09:17:27,080
is equal to true.
11154
09:17:27,080 --> 09:17:29,000
Well, I take 1 times 1, that's 1.
11155
09:17:29,000 --> 09:17:30,200
1 times 1 is 1.
11156
09:17:30,200 --> 09:17:32,240
I get a total sum of 2 so far.
11157
09:17:32,240 --> 09:17:35,880
Now I add the bias of negative 2, and I get the value 0.
11158
09:17:35,880 --> 09:17:38,480
And 0, when I plot it on the activation function,
11159
09:17:38,480 --> 09:17:42,480
is just past that threshold, and so the output is going to be 1.
11160
09:17:42,480 --> 09:17:46,800
But if I had any other input, for example, like 1 and 0,
11161
09:17:46,800 --> 09:17:51,040
well, the weighted sum of these is 1 plus 0 is going to be 1.
11162
09:17:51,040 --> 09:17:53,960
Minus 2 is going to give us negative 1, and negative 1
11163
09:17:53,960 --> 09:17:58,760
is not past that threshold, and so the output is going to be 0.
11164
09:17:58,760 --> 09:18:01,400
So those then are some very simple functions
11165
09:18:01,400 --> 09:18:05,720
that we can model using a neural network that has two inputs and one output,
11166
09:18:05,720 --> 09:18:08,720
where our goal is to be able to figure out what those weights should be
11167
09:18:08,720 --> 09:18:11,080
in order to determine what the output should be.
11168
09:18:11,080 --> 09:18:14,480
And you could imagine generalizing this to calculate more complex functions
11169
09:18:14,480 --> 09:18:17,160
as well, that maybe, given the humidity and the pressure,
11170
09:18:17,160 --> 09:18:20,000
we want to calculate what's the probability that it's going to rain,
11171
09:18:20,000 --> 09:18:20,840
for example.
11172
09:18:20,840 --> 09:18:22,880
Or we might want to do a regression-style problem.
11173
09:18:22,880 --> 09:18:26,440
We're given some amount of advertising, and given what month it is maybe,
11174
09:18:26,440 --> 09:18:28,480
we want to predict what our expected sales are
11175
09:18:28,480 --> 09:18:30,400
going to be for that particular month.
11176
09:18:30,400 --> 09:18:34,000
So you could imagine these inputs and outputs being different as well.
11177
09:18:34,000 --> 09:18:36,360
And it turns out that in some problems, we're not just
11178
09:18:36,360 --> 09:18:39,840
going to have two inputs, and the nice thing about these neural networks
11179
09:18:39,840 --> 09:18:42,480
is that we can compose multiple units together,
11180
09:18:42,480 --> 09:18:46,280
make our networks more complex just by adding more units
11181
09:18:46,280 --> 09:18:48,720
into this particular neural network.
11182
09:18:48,720 --> 09:18:52,880
So the network we've been looking at has two inputs and one output.
11183
09:18:52,880 --> 09:18:56,280
But we could just as easily say, let's go ahead and have three inputs in there,
11184
09:18:56,280 --> 09:18:58,600
or have even more inputs, where we could arbitrarily
11185
09:18:58,600 --> 09:19:02,520
decide however many inputs there are to our problem, all going
11186
09:19:02,520 --> 09:19:06,480
to be calculating some sort of output that we care about figuring out
11187
09:19:06,480 --> 09:19:07,720
the value of.
11188
09:19:07,720 --> 09:19:10,480
How then does the math work for figuring out that output?
11189
09:19:10,480 --> 09:19:12,440
Well, it's going to work in a very similar way.
11190
09:19:12,440 --> 09:19:16,760
In the case of two inputs, we had two weights indicated by these edges,
11191
09:19:16,760 --> 09:19:20,280
and we multiplied the weights by the numbers, adding this bias term.
11192
09:19:20,280 --> 09:19:22,840
And we'll do the same thing in the other cases as well.
11193
09:19:22,840 --> 09:19:25,160
If I have three inputs, you'll imagine multiplying
11194
09:19:25,160 --> 09:19:27,920
each of these three inputs by each of these weights.
11195
09:19:27,920 --> 09:19:31,080
If I had five inputs instead, we're going to do the same thing.
11196
09:19:31,080 --> 09:19:35,920
Here I'm saying sum up from 1 to 5, xi multiplied by weight i.
11197
09:19:35,920 --> 09:19:38,880
So take each of the five input variables, multiply them
11198
09:19:38,880 --> 09:19:41,880
by their corresponding weight, and then add the bias to that.
11199
09:19:41,880 --> 09:19:45,280
So this would be a case where there are five inputs into this neural network,
11200
09:19:45,280 --> 09:19:46,000
for example.
11201
09:19:46,000 --> 09:19:48,200
But there could be more, arbitrarily many nodes
11202
09:19:48,200 --> 09:19:51,120
that we want inside of this neural network, where each time we're just
11203
09:19:51,120 --> 09:19:54,960
going to sum up all of those input variables multiplied by their weight
11204
09:19:54,960 --> 09:19:57,760
and then add the bias term at the very end.
11205
09:19:57,760 --> 09:20:00,240
And so this allows us to be able to represent problems
11206
09:20:00,240 --> 09:20:05,440
that have even more inputs just by growing the size of our neural network.
11207
09:20:05,440 --> 09:20:08,480
Now, the next question we might ask is a question about how it
11208
09:20:08,480 --> 09:20:10,840
is that we train these neural networks.
11209
09:20:10,840 --> 09:20:13,160
In the case of the or function and the and function,
11210
09:20:13,160 --> 09:20:16,080
they were simple enough functions that I could just tell you,
11211
09:20:16,080 --> 09:20:17,480
like here, what the weights should be.
11212
09:20:17,480 --> 09:20:19,360
And you could probably reason through it yourself
11213
09:20:19,360 --> 09:20:23,280
what the weights should be in order to calculate the output that you want.
11214
09:20:23,280 --> 09:20:26,000
But in general, with functions like predicting sales
11215
09:20:26,000 --> 09:20:27,960
or predicting whether or not it's going to rain,
11216
09:20:27,960 --> 09:20:30,640
these are much trickier functions to be able to figure out.
11217
09:20:30,640 --> 09:20:33,160
We would like the computer to have some mechanism
11218
09:20:33,160 --> 09:20:36,040
of calculating what it is that the weights should be,
11219
09:20:36,040 --> 09:20:39,280
how it is to set the weights so that our neural network is
11220
09:20:39,280 --> 09:20:41,800
able to accurately model the function that we
11221
09:20:41,800 --> 09:20:43,320
care about trying to estimate.
11222
09:20:43,320 --> 09:20:45,400
And it turns out that the strategy for doing this,
11223
09:20:45,400 --> 09:20:49,600
inspired by the domain of calculus, is a technique called gradient descent.
11224
09:20:49,600 --> 09:20:52,320
And what gradient descent is, it is an algorithm
11225
09:20:52,320 --> 09:20:55,920
for minimizing loss when you're training a neural network.
11226
09:20:55,920 --> 09:20:59,920
And recall that loss refers to how bad our hypothesis
11227
09:20:59,920 --> 09:21:03,520
function happens to be, that we can define certain loss functions.
11228
09:21:03,520 --> 09:21:06,720
And we saw some examples of loss functions last time that just give us
11229
09:21:06,720 --> 09:21:09,360
a number for any particular hypothesis, saying,
11230
09:21:09,360 --> 09:21:11,360
how poorly does it model the data?
11231
09:21:11,360 --> 09:21:13,200
How many examples does it get wrong?
11232
09:21:13,200 --> 09:21:17,640
How are they worse or less bad as compared to other hypothesis functions
11233
09:21:17,640 --> 09:21:19,120
that we might define?
11234
09:21:19,120 --> 09:21:22,640
And this loss function is just a mathematical function.
11235
09:21:22,640 --> 09:21:24,360
And when you have a mathematical function,
11236
09:21:24,360 --> 09:21:26,160
in calculus what you could do is calculate
11237
09:21:26,160 --> 09:21:29,280
something known as the gradient, which you can think of as like a slope.
11238
09:21:29,280 --> 09:21:32,960
It's the direction the loss function is moving at any particular point.
11239
09:21:32,960 --> 09:21:36,200
And what it's going to tell us is, in which direction
11240
09:21:36,200 --> 09:21:41,120
should we be moving these weights in order to minimize the amount of loss?
11241
09:21:41,120 --> 09:21:43,920
And so generally speaking, we won't get into the calculus of it.
11242
09:21:43,920 --> 09:21:46,240
But the high level idea for gradient descent
11243
09:21:46,240 --> 09:21:47,880
is going to look something like this.
11244
09:21:47,880 --> 09:21:51,080
If we want to train a neural network, we'll go ahead and start just
11245
09:21:51,080 --> 09:21:52,840
by choosing the weights randomly.
11246
09:21:52,840 --> 09:21:56,180
Just pick random weights for all of the weights in the neural network.
11247
09:21:56,180 --> 09:21:58,440
And then we'll use the input data that we have access
11248
09:21:58,440 --> 09:22:00,560
to in order to train the network, in order
11249
09:22:00,560 --> 09:22:02,880
to figure out what the weights should actually be.
11250
09:22:02,880 --> 09:22:05,360
So we'll repeat this process again and again.
11251
09:22:05,360 --> 09:22:08,200
The first step is we're going to calculate the gradient based
11252
09:22:08,200 --> 09:22:09,320
on all of the data points.
11253
09:22:09,320 --> 09:22:11,240
So we'll look at all the data and figure out
11254
09:22:11,240 --> 09:22:13,760
what the gradient is at the place where we currently
11255
09:22:13,760 --> 09:22:15,760
are for the current setting of the weights, which
11256
09:22:15,760 --> 09:22:19,440
means in which direction should we move the weights in order
11257
09:22:19,440 --> 09:22:24,480
to minimize the total amount of loss, in order to make our solution better.
11258
09:22:24,480 --> 09:22:26,820
And once we've calculated that gradient, which direction
11259
09:22:26,820 --> 09:22:29,120
we should move in the loss function, well,
11260
09:22:29,120 --> 09:22:32,240
then we can just update those weights according to the gradient.
11261
09:22:32,240 --> 09:22:35,200
Take a small step in the direction of those weights
11262
09:22:35,200 --> 09:22:37,800
in order to try to make our solution a little bit better.
11263
09:22:37,800 --> 09:22:40,200
And the size of the step that we take, that's going to vary.
11264
09:22:40,200 --> 09:22:43,120
And you can choose that when you're training a particular neural network.
11265
09:22:43,120 --> 09:22:46,240
But in short, the idea is going to be take all the data points,
11266
09:22:46,240 --> 09:22:48,840
figure out based on those data points in what direction
11267
09:22:48,840 --> 09:22:52,240
the weights should move, and then move the weights one small step
11268
09:22:52,240 --> 09:22:53,160
in that direction.
11269
09:22:53,160 --> 09:22:55,600
And if you repeat that process over and over again,
11270
09:22:55,600 --> 09:22:58,760
adjusting the weights a little bit at a time based on all the data points,
11271
09:22:58,760 --> 09:23:02,320
eventually you should end up with a pretty good solution
11272
09:23:02,320 --> 09:23:04,300
to trying to solve this sort of problem.
11273
09:23:04,300 --> 09:23:06,760
At least that's what we would hope to happen.
11274
09:23:06,760 --> 09:23:08,600
Now, if you look at this algorithm, a good question
11275
09:23:08,600 --> 09:23:10,920
to ask anytime you're analyzing an algorithm
11276
09:23:10,920 --> 09:23:14,640
is what is going to be the expensive part of doing the calculation?
11277
09:23:14,640 --> 09:23:17,160
What's going to take a lot of work to try to figure out?
11278
09:23:17,160 --> 09:23:19,680
What is going to be expensive to calculate?
11279
09:23:19,680 --> 09:23:22,060
And in particular, in the case of gradient descent,
11280
09:23:22,060 --> 09:23:26,240
the really expensive part is this all data points part right here,
11281
09:23:26,240 --> 09:23:30,200
having to take all of the data points and using all of those data points
11282
09:23:30,200 --> 09:23:34,000
figure out what the gradient is at this particular setting of all
11283
09:23:34,000 --> 09:23:34,500
of the weights.
11284
09:23:34,500 --> 09:23:37,040
Because odds are in a big machine learning problem
11285
09:23:37,040 --> 09:23:39,580
where you're trying to solve a big problem with a lot of data,
11286
09:23:39,580 --> 09:23:41,960
you have a lot of data points in order to calculate.
11287
09:23:41,960 --> 09:23:44,840
And figuring out the gradient based on all of those data points
11288
09:23:44,840 --> 09:23:46,160
is going to be expensive.
11289
09:23:46,160 --> 09:23:47,680
And you'll have to do it many times.
11290
09:23:47,680 --> 09:23:50,640
You'll likely repeat this process again and again and again,
11291
09:23:50,640 --> 09:23:54,320
going through all the data points, taking one small step over and over
11292
09:23:54,320 --> 09:23:57,780
as you try and figure out what the optimal setting of those weights
11293
09:23:57,780 --> 09:23:59,280
happens to be.
11294
09:23:59,280 --> 09:24:02,120
It turns out that we would ideally like to be
11295
09:24:02,120 --> 09:24:04,000
able to train our neural networks faster,
11296
09:24:04,000 --> 09:24:07,680
to be able to more quickly converge to some sort of solution that
11297
09:24:07,680 --> 09:24:10,000
is going to be a good solution to the problem.
11298
09:24:10,000 --> 09:24:13,000
So in that case, there are alternatives to just standard gradient descent,
11299
09:24:13,000 --> 09:24:15,280
which looks at all of the data points at once.
11300
09:24:15,280 --> 09:24:18,640
We can employ a method like stochastic gradient descent,
11301
09:24:18,640 --> 09:24:22,320
which will randomly just choose one data point at a time
11302
09:24:22,320 --> 09:24:25,240
to calculate the gradient based on, instead of calculating it
11303
09:24:25,240 --> 09:24:27,160
based on all of the data points.
11304
09:24:27,160 --> 09:24:30,180
So the idea there is that we have some setting of the weights.
11305
09:24:30,180 --> 09:24:31,560
We pick a data point.
11306
09:24:31,560 --> 09:24:34,720
And based on that one data point, we figure out in which direction
11307
09:24:34,720 --> 09:24:37,400
should we move all of the weights and move the weights in that small
11308
09:24:37,400 --> 09:24:39,800
direction, then take another data point and do that again
11309
09:24:39,800 --> 09:24:41,600
and repeat this process again and again,
11310
09:24:41,600 --> 09:24:44,240
maybe looking at each of the data points multiple times,
11311
09:24:44,240 --> 09:24:48,640
but each time only using one data point to calculate the gradient,
11312
09:24:48,640 --> 09:24:51,720
to calculate which direction we should move in.
11313
09:24:51,720 --> 09:24:55,040
Now, just using one data point instead of all of the data points
11314
09:24:55,040 --> 09:24:58,920
probably gives us a less accurate estimate of what the gradient actually
11315
09:24:58,920 --> 09:24:59,760
is.
11316
09:24:59,760 --> 09:25:01,880
But on the plus side, it's going to be much faster
11317
09:25:01,880 --> 09:25:04,520
to be able to calculate, that we can much more quickly calculate
11318
09:25:04,520 --> 09:25:07,200
what the gradient is based on one data point,
11319
09:25:07,200 --> 09:25:09,880
instead of calculating based on all of the data points
11320
09:25:09,880 --> 09:25:13,400
and having to do all of that computational work again and again.
11321
09:25:13,400 --> 09:25:16,120
So there are trade-offs here between looking at all of the data points
11322
09:25:16,120 --> 09:25:18,160
and just looking at one data point.
11323
09:25:18,160 --> 09:25:21,000
And it turns out that a middle ground that is also quite popular
11324
09:25:21,000 --> 09:25:24,600
is a technique called mini-batch gradient descent, where the idea there
11325
09:25:24,600 --> 09:25:28,080
is instead of looking at all of the data versus just a single point,
11326
09:25:28,080 --> 09:25:32,080
we instead divide our data set up into small batches, groups of data points,
11327
09:25:32,080 --> 09:25:34,840
where you can decide how big a particular batch is.
11328
09:25:34,840 --> 09:25:37,680
But in short, you're just going to look at a small number of points
11329
09:25:37,680 --> 09:25:41,280
at any given time, hopefully getting a more accurate estimate of the gradient,
11330
09:25:41,280 --> 09:25:44,960
but also not requiring all of the computational effort needed
11331
09:25:44,960 --> 09:25:48,800
to look at every single one of these data points.
11332
09:25:48,800 --> 09:25:50,960
So gradient descent, then, is this technique
11333
09:25:50,960 --> 09:25:53,800
that we can use in order to train these neural networks,
11334
09:25:53,800 --> 09:25:56,680
in order to figure out what the setting of all of these weights
11335
09:25:56,680 --> 09:25:59,520
should be if we want some way to try and get
11336
09:25:59,520 --> 09:26:02,760
an accurate notion of how it is that this function should work,
11337
09:26:02,760 --> 09:26:08,400
some way of modeling how to transform the inputs into particular outputs.
11338
09:26:08,400 --> 09:26:11,320
Now, so far, the networks that we've taken a look at
11339
09:26:11,320 --> 09:26:13,600
have all been structured similar to this.
11340
09:26:13,600 --> 09:26:17,080
We have some number of inputs, maybe two or three or five or more.
11341
09:26:17,080 --> 09:26:21,240
And then we have one output that is just predicting like rain or no rain
11342
09:26:21,240 --> 09:26:23,680
or just predicting one particular value.
11343
09:26:23,680 --> 09:26:25,600
But often in machine learning problems, we
11344
09:26:25,600 --> 09:26:27,840
don't just care about one output.
11345
09:26:27,840 --> 09:26:31,040
We might care about an output that has multiple different values
11346
09:26:31,040 --> 09:26:32,320
associated with it.
11347
09:26:32,320 --> 09:26:35,040
So in the same way that we could take a neural network
11348
09:26:35,040 --> 09:26:40,160
and add units to the input layer, we can likewise add inputs or add outputs
11349
09:26:40,160 --> 09:26:41,760
to the output layer as well.
11350
09:26:41,760 --> 09:26:44,760
Instead of just one output, you could imagine we have two outputs,
11351
09:26:44,760 --> 09:26:47,120
or we could have four outputs, for example,
11352
09:26:47,120 --> 09:26:50,880
where in each case, as we add more inputs or add more outputs,
11353
09:26:50,880 --> 09:26:54,360
if we want to keep this network fully connected between these two layers,
11354
09:26:54,360 --> 09:26:58,840
we just need to add more weights, that now each of these input nodes
11355
09:26:58,840 --> 09:27:02,820
has four weights associated with each of the four outputs.
11356
09:27:02,820 --> 09:27:06,320
And that's true for each of these various different input nodes.
11357
09:27:06,320 --> 09:27:09,120
So as we add nodes, we add more weights in order
11358
09:27:09,120 --> 09:27:11,480
to make sure that each of the inputs can somehow
11359
09:27:11,480 --> 09:27:14,800
be connected to each of the outputs so that each output
11360
09:27:14,800 --> 09:27:19,760
value can be calculated based on what the value of the input happens to be.
11361
09:27:19,760 --> 09:27:23,720
So what might a case be where we want multiple different output values?
11362
09:27:23,720 --> 09:27:26,600
Well, you might consider that in the case of weather predicting,
11363
09:27:26,600 --> 09:27:30,720
for example, we might not just care whether it's raining or not raining.
11364
09:27:30,720 --> 09:27:33,500
There might be multiple different categories of weather
11365
09:27:33,500 --> 09:27:35,600
that we would like to categorize the weather into.
11366
09:27:35,600 --> 09:27:39,360
With just a single output variable, we can do a binary classification,
11367
09:27:39,360 --> 09:27:42,920
like rain or no rain, for instance, 1 or 0.
11368
09:27:42,920 --> 09:27:45,600
But it doesn't allow us to do much more than that.
11369
09:27:45,600 --> 09:27:47,560
With multiple output variables, I might be
11370
09:27:47,560 --> 09:27:50,480
able to use each one to predict something a little different.
11371
09:27:50,480 --> 09:27:54,040
Maybe I want to categorize the weather into one of four different categories,
11372
09:27:54,040 --> 09:27:58,000
something like is it going to be raining or sunny or cloudy or snowy.
11373
09:27:58,000 --> 09:27:59,960
And I now have four output variables that
11374
09:27:59,960 --> 09:28:03,800
can be used to represent maybe the probability that it is
11375
09:28:03,800 --> 09:28:08,520
rainy as opposed to sunny as opposed to cloudy or as opposed to snowy.
11376
09:28:08,520 --> 09:28:10,560
How then would this neural network work?
11377
09:28:10,560 --> 09:28:13,320
Well, we have some input variables that represent some data
11378
09:28:13,320 --> 09:28:15,240
that we have collected about the weather.
11379
09:28:15,240 --> 09:28:18,760
Each of those inputs gets multiplied by each of these various different weights.
11380
09:28:18,760 --> 09:28:20,980
We have more multiplications to do, but these
11381
09:28:20,980 --> 09:28:24,040
are fairly quick mathematical operations to perform.
11382
09:28:24,040 --> 09:28:25,800
And then what we get is after passing them
11383
09:28:25,800 --> 09:28:28,400
through some sort of activation function in the outputs,
11384
09:28:28,400 --> 09:28:32,160
we end up getting some sort of number, where that number, you might imagine,
11385
09:28:32,160 --> 09:28:36,000
you could interpret as a probability, like a probability that it is one
11386
09:28:36,000 --> 09:28:38,360
category as opposed to another category.
11387
09:28:38,360 --> 09:28:40,640
So here we're saying that based on the inputs,
11388
09:28:40,640 --> 09:28:45,000
we think there is a 10% chance that it's raining, a 60% chance that it's sunny,
11389
09:28:45,000 --> 09:28:48,720
a 20% chance of cloudy, a 10% chance that it's snowy.
11390
09:28:48,720 --> 09:28:52,800
And given that output, if these represent a probability distribution,
11391
09:28:52,800 --> 09:28:55,920
well, then you could just pick whichever one has the highest value,
11392
09:28:55,920 --> 09:28:58,720
in this case, sunny, and say that, well, most likely, we
11393
09:28:58,720 --> 09:29:04,040
think that this categorization of inputs means that the output should be snowy
11394
09:29:04,040 --> 09:29:05,120
or should be sunny.
11395
09:29:05,120 --> 09:29:09,800
And that is what we would expect the weather to be in this particular instance.
11396
09:29:09,800 --> 09:29:13,760
And so this allows us to do these sort of multi-class classifications,
11397
09:29:13,760 --> 09:29:17,620
where instead of just having a binary classification, 1 or 0,
11398
09:29:17,620 --> 09:29:20,680
we can have as many different categories as we want.
11399
09:29:20,680 --> 09:29:23,640
And we can have our neural network output these probabilities
11400
09:29:23,640 --> 09:29:27,700
over which categories are more likely than other categories.
11401
09:29:27,700 --> 09:29:30,820
And using that data, we're able to draw some sort of inference
11402
09:29:30,820 --> 09:29:33,200
on what it is that we should do.
11403
09:29:33,200 --> 09:29:35,800
So this was sort of the idea of supervised machine learning.
11404
09:29:35,800 --> 09:29:38,860
I can give this neural network a whole bunch of data,
11405
09:29:38,860 --> 09:29:42,920
a whole bunch of input data corresponding to some label, some output data,
11406
09:29:42,920 --> 09:29:45,000
like we know that it was raining on this day,
11407
09:29:45,000 --> 09:29:46,960
we know that it was sunny on that day.
11408
09:29:46,960 --> 09:29:49,400
And using all of that data, the algorithm
11409
09:29:49,400 --> 09:29:52,400
can use gradient descent to figure out what all of the weights
11410
09:29:52,400 --> 09:29:55,840
should be in order to create some sort of model that hopefully allows us
11411
09:29:55,840 --> 09:29:59,280
a way to predict what we think the weather is going to be.
11412
09:29:59,280 --> 09:30:02,160
But neural networks have a lot of other applications as well.
11413
09:30:02,160 --> 09:30:06,520
You could imagine applying the same sort of idea to a reinforcement learning
11414
09:30:06,520 --> 09:30:09,280
sort of example as well, where you remember that in reinforcement
11415
09:30:09,280 --> 09:30:13,080
learning, what we wanted to do is train some sort of agent
11416
09:30:13,080 --> 09:30:16,080
to learn what action to take, depending on what state
11417
09:30:16,080 --> 09:30:17,400
they currently happen to be in.
11418
09:30:17,400 --> 09:30:19,640
So depending on the current state of the world,
11419
09:30:19,640 --> 09:30:23,040
we wanted the agent to pick from one of the available actions
11420
09:30:23,040 --> 09:30:24,800
that is available to them.
11421
09:30:24,800 --> 09:30:28,280
And you might model that by having each of these input variables
11422
09:30:28,280 --> 09:30:33,240
represent some information about the state, some data about what state
11423
09:30:33,240 --> 09:30:34,920
our agent is currently in.
11424
09:30:34,920 --> 09:30:37,320
And then the output, for example, could be each
11425
09:30:37,320 --> 09:30:40,160
of the various different actions that our agent could take,
11426
09:30:40,160 --> 09:30:42,560
action 1, 2, 3, and 4.
11427
09:30:42,560 --> 09:30:45,560
And you might imagine that this network would work in the same way,
11428
09:30:45,560 --> 09:30:48,840
but based on these particular inputs, we go ahead and calculate values
11429
09:30:48,840 --> 09:30:50,080
for each of these outputs.
11430
09:30:50,080 --> 09:30:53,960
And those outputs could model which action is better than other actions.
11431
09:30:53,960 --> 09:30:56,440
And we could just choose, based on looking at those outputs,
11432
09:30:56,440 --> 09:30:59,120
which action we should take.
11433
09:30:59,120 --> 09:31:01,840
And so these neural networks are very broadly applicable,
11434
09:31:01,840 --> 09:31:05,240
that all they're really doing is modeling some mathematical function.
11435
09:31:05,240 --> 09:31:07,600
So anything that we can frame as a mathematical function,
11436
09:31:07,600 --> 09:31:11,320
something like classifying inputs into various different categories
11437
09:31:11,320 --> 09:31:15,120
or figuring out based on some input state what action we should take,
11438
09:31:15,120 --> 09:31:18,680
these are all mathematical functions that we could attempt to model
11439
09:31:18,680 --> 09:31:21,360
by taking advantage of this neural network structure,
11440
09:31:21,360 --> 09:31:25,000
and in particular, taking advantage of this technique, gradient descent,
11441
09:31:25,000 --> 09:31:27,600
that we can use in order to figure out what the weights should
11442
09:31:27,600 --> 09:31:31,280
be in order to do this sort of calculation.
11443
09:31:31,280 --> 09:31:33,960
Now, how is it that you would go about training a neural network that
11444
09:31:33,960 --> 09:31:36,800
has multiple outputs instead of just one?
11445
09:31:36,800 --> 09:31:40,320
Well, with just a single output, we could see what the output for that value
11446
09:31:40,320 --> 09:31:44,360
should be, and then you update all of the weights that corresponded to it.
11447
09:31:44,360 --> 09:31:47,920
And when we have multiple outputs, at least in this particular case,
11448
09:31:47,920 --> 09:31:51,520
we can really think of this as four separate neural networks,
11449
09:31:51,520 --> 09:31:55,520
that really we just have one network here that has these three inputs
11450
09:31:55,520 --> 09:32:00,040
corresponding with these three weights corresponding to this one output value.
11451
09:32:00,040 --> 09:32:02,440
And the same thing is true for this output value.
11452
09:32:02,440 --> 09:32:06,040
This output value effectively defines yet another neural network
11453
09:32:06,040 --> 09:32:09,600
that has these same three inputs, but a different set of weights
11454
09:32:09,600 --> 09:32:11,160
that correspond to this output.
11455
09:32:11,160 --> 09:32:14,200
And likewise, this output has its own set of weights as well,
11456
09:32:14,200 --> 09:32:17,080
and same thing for the fourth output too.
11457
09:32:17,080 --> 09:32:20,760
And so if you wanted to train a neural network that had four outputs instead
11458
09:32:20,760 --> 09:32:23,720
of just one, in this case where the inputs are directly
11459
09:32:23,720 --> 09:32:25,640
connected to the outputs, you could really
11460
09:32:25,640 --> 09:32:28,840
think of this as just training four independent neural networks.
11461
09:32:28,840 --> 09:32:31,040
We know what the outputs for each of these four
11462
09:32:31,040 --> 09:32:34,280
should be based on our input data, and using that data,
11463
09:32:34,280 --> 09:32:37,680
we can begin to figure out what all of these individual weights should be.
11464
09:32:37,680 --> 09:32:39,560
And maybe there's an additional step at the end
11465
09:32:39,560 --> 09:32:43,680
to make sure that we turn these values into a probability distribution such
11466
09:32:43,680 --> 09:32:46,160
that we can interpret which one is better than another
11467
09:32:46,160 --> 09:32:50,360
or more likely than another as a category or something like that.
11468
09:32:50,360 --> 09:32:53,560
So this then seems like it does a pretty good job of taking inputs
11469
09:32:53,560 --> 09:32:55,480
and trying to predict what outputs should be.
11470
09:32:55,480 --> 09:32:58,800
And we'll see some real examples of this in just a moment as well.
11471
09:32:58,800 --> 09:33:01,360
But it's important then to think about what the limitations
11472
09:33:01,360 --> 09:33:05,520
of this sort of approach is, of just taking some linear combination
11473
09:33:05,520 --> 09:33:09,200
of inputs and passing it into some sort of activation function.
11474
09:33:09,200 --> 09:33:12,440
And it turns out that when we do this in the case of binary classification,
11475
09:33:12,440 --> 09:33:16,720
trying to predict does it belong to one category or another,
11476
09:33:16,720 --> 09:33:20,240
we can only predict things that are linearly separable.
11477
09:33:20,240 --> 09:33:22,800
Because we're taking a linear combination of inputs
11478
09:33:22,800 --> 09:33:26,520
and using that to define some decision boundary or threshold,
11479
09:33:26,520 --> 09:33:29,960
then what we get is a situation where if we have this set of data,
11480
09:33:29,960 --> 09:33:35,080
we can predict a line that separates linearly the red points from the blue
11481
09:33:35,080 --> 09:33:39,840
points, but a single unit that is making a binary classification, otherwise
11482
09:33:39,840 --> 09:33:44,880
known as a perceptron, can't deal with a situation like this, where we've
11483
09:33:44,880 --> 09:33:48,400
seen this type of situation before, where there is no straight line that
11484
09:33:48,400 --> 09:33:51,540
just goes straight through the data that will divide the red points away
11485
09:33:51,540 --> 09:33:52,680
from the blue points.
11486
09:33:52,680 --> 09:33:55,000
It's a more complex decision boundary.
11487
09:33:55,000 --> 09:33:58,280
The decision boundary somehow needs to capture the things inside of this
11488
09:33:58,280 --> 09:33:59,280
circle.
11489
09:33:59,280 --> 09:34:03,160
And there isn't really a line that will allow us to deal with that.
11490
09:34:03,160 --> 09:34:05,640
So this is the limitation of the perceptron,
11491
09:34:05,640 --> 09:34:08,800
these units that just make these binary decisions based on their inputs,
11492
09:34:08,800 --> 09:34:12,520
that a single perceptron is only capable of learning
11493
09:34:12,520 --> 09:34:15,280
a linearly separable decision boundary.
11494
09:34:15,280 --> 09:34:17,480
All it can do is define a line.
11495
09:34:17,480 --> 09:34:19,440
And sure, it can give us probabilities based
11496
09:34:19,440 --> 09:34:21,880
on how close to that decision boundary we are,
11497
09:34:21,880 --> 09:34:26,760
but it can only really decide based on a linear decision boundary.
11498
09:34:26,760 --> 09:34:29,600
And so this doesn't seem like it's going to generalize well
11499
09:34:29,600 --> 09:34:32,160
to situations where real world data is involved,
11500
09:34:32,160 --> 09:34:34,880
because real world data often isn't linearly separable.
11501
09:34:34,880 --> 09:34:38,240
It often isn't the case that we can just draw a line through the data
11502
09:34:38,240 --> 09:34:41,280
and be able to divide it up into multiple groups.
11503
09:34:41,280 --> 09:34:43,320
So what then is the solution to this?
11504
09:34:43,320 --> 09:34:47,640
Well, what was proposed was the idea of a multilayer neural network,
11505
09:34:47,640 --> 09:34:49,840
that so far all of the neural networks we've seen
11506
09:34:49,840 --> 09:34:52,480
have had a set of inputs and a set of outputs,
11507
09:34:52,480 --> 09:34:55,440
and the inputs are connected to those outputs.
11508
09:34:55,440 --> 09:34:57,840
But in a multilayer neural network, this is going
11509
09:34:57,840 --> 09:35:00,800
to be an artificial neural network that has an input layer still.
11510
09:35:00,800 --> 09:35:06,200
It has an output layer, but also has one or more hidden layers in between.
11511
09:35:06,200 --> 09:35:09,320
Other layers of artificial neurons or units
11512
09:35:09,320 --> 09:35:12,160
that are going to calculate their own values as well.
11513
09:35:12,160 --> 09:35:15,280
So instead of a neural network that looks like this with three inputs
11514
09:35:15,280 --> 09:35:17,800
and one output, you might imagine in the middle
11515
09:35:17,800 --> 09:35:21,520
here injecting a hidden layer, something like this.
11516
09:35:21,520 --> 09:35:23,520
This is a hidden layer that has four nodes.
11517
09:35:23,520 --> 09:35:26,760
You could choose how many nodes or units end up going into the hidden layer.
11518
09:35:26,760 --> 09:35:29,480
You can have multiple hidden layers as well.
11519
09:35:29,480 --> 09:35:33,680
And so now each of these inputs isn't directly connected to the output.
11520
09:35:33,680 --> 09:35:36,440
Each of the inputs is connected to this hidden layer.
11521
09:35:36,440 --> 09:35:38,520
And then all of the nodes in the hidden layer, those
11522
09:35:38,520 --> 09:35:41,200
are connected to the one output.
11523
09:35:41,200 --> 09:35:43,920
And so this is just another step that we can
11524
09:35:43,920 --> 09:35:46,480
take towards calculating more complex functions.
11525
09:35:46,480 --> 09:35:49,920
Each of these hidden units will calculate its output value,
11526
09:35:49,920 --> 09:35:53,920
otherwise known as its activation, based on a linear combination
11527
09:35:53,920 --> 09:35:55,320
of all the inputs.
11528
09:35:55,320 --> 09:35:57,600
And once we have values for all of these nodes,
11529
09:35:57,600 --> 09:36:00,720
as opposed to this just being the output, we do the same thing again.
11530
09:36:00,720 --> 09:36:04,240
Calculate the output for this node based on multiplying
11531
09:36:04,240 --> 09:36:07,960
each of the values for these units by their weights as well.
11532
09:36:07,960 --> 09:36:10,520
So in effect, the way this works is that we start with inputs.
11533
09:36:10,520 --> 09:36:14,120
They get multiplied by weights in order to calculate values for the hidden nodes.
11534
09:36:14,120 --> 09:36:16,880
Those get multiplied by weights in order to figure out
11535
09:36:16,880 --> 09:36:19,840
what the ultimate output is going to be.
11536
09:36:19,840 --> 09:36:22,360
And the advantage of layering things like this
11537
09:36:22,360 --> 09:36:25,640
is it gives us an ability to model more complex functions,
11538
09:36:25,640 --> 09:36:29,560
that instead of just having a single decision boundary, a single line
11539
09:36:29,560 --> 09:36:33,600
dividing the red points from the blue points, each of these hidden nodes
11540
09:36:33,600 --> 09:36:35,960
can learn a different decision boundary.
11541
09:36:35,960 --> 09:36:37,840
And we can combine those decision boundaries
11542
09:36:37,840 --> 09:36:41,000
to figure out what the ultimate output is going to be.
11543
09:36:41,000 --> 09:36:43,480
And as we begin to imagine more complex situations,
11544
09:36:43,480 --> 09:36:47,200
you could imagine each of these nodes learning some useful property
11545
09:36:47,200 --> 09:36:50,560
or learning some useful feature of all of the inputs
11546
09:36:50,560 --> 09:36:53,440
and us somehow learning how to combine those features together
11547
09:36:53,440 --> 09:36:56,320
in order to get the output that we actually want.
11548
09:36:56,320 --> 09:36:59,120
Now, the natural question when we begin to look at this now
11549
09:36:59,120 --> 09:37:02,160
is to ask the question of, how do we train a neural network that
11550
09:37:02,160 --> 09:37:04,440
has hidden layers inside of it?
11551
09:37:04,440 --> 09:37:07,120
And this turns out to initially be a bit of a tricky question,
11552
09:37:07,120 --> 09:37:10,440
because the input data that we are given is we
11553
09:37:10,440 --> 09:37:13,520
are given values for all of the inputs, and we're
11554
09:37:13,520 --> 09:37:16,960
given what the value of the output should be, what the category is,
11555
09:37:16,960 --> 09:37:18,120
for example.
11556
09:37:18,120 --> 09:37:22,160
But the input data doesn't tell us what the values for all of these nodes
11557
09:37:22,160 --> 09:37:22,880
should be.
11558
09:37:22,880 --> 09:37:26,520
So we don't know how far off each of these nodes actually
11559
09:37:26,520 --> 09:37:29,760
is because we're only given data for the inputs and the outputs.
11560
09:37:29,760 --> 09:37:31,640
The reason this is called the hidden layer
11561
09:37:31,640 --> 09:37:34,040
is because the data that is made available to us
11562
09:37:34,040 --> 09:37:38,200
doesn't tell us what the values for all of these intermediate nodes
11563
09:37:38,200 --> 09:37:39,760
should actually be.
11564
09:37:39,760 --> 09:37:42,160
And so the strategy people came up with was
11565
09:37:42,160 --> 09:37:48,120
to say that if you know what the error or the losses on the output node,
11566
09:37:48,120 --> 09:37:50,280
well, then based on what these weights are,
11567
09:37:50,280 --> 09:37:52,280
if one of these weights is higher than another,
11568
09:37:52,280 --> 09:37:55,120
you can calculate an estimate for how much
11569
09:37:55,120 --> 09:38:00,840
the error from this node was due to this part of the hidden node,
11570
09:38:00,840 --> 09:38:03,280
or this part of the hidden layer, or this part of the hidden layer,
11571
09:38:03,280 --> 09:38:05,680
based on the values of these weights, in effect saying
11572
09:38:05,680 --> 09:38:10,120
that based on the error from the output, I can back propagate the error
11573
09:38:10,120 --> 09:38:14,240
and figure out an estimate for what the error is for each of these nodes
11574
09:38:14,240 --> 09:38:15,400
in the hidden layer as well.
11575
09:38:15,400 --> 09:38:18,480
And there's some more calculus here that we won't get into the details of,
11576
09:38:18,480 --> 09:38:21,840
but the idea of this algorithm is known as back propagation.
11577
09:38:21,840 --> 09:38:24,040
It's an algorithm for training a neural network
11578
09:38:24,040 --> 09:38:26,200
with multiple different hidden layers.
11579
09:38:26,200 --> 09:38:28,240
And the idea for this, the pseudocode for it,
11580
09:38:28,240 --> 09:38:31,960
will again be if we want to run gradient descent with back propagation.
11581
09:38:31,960 --> 09:38:35,200
We'll start with a random choice of weights, as we did before.
11582
09:38:35,200 --> 09:38:38,680
And now we'll go ahead and repeat the training process again and again.
11583
09:38:38,680 --> 09:38:41,080
But what we're going to do each time is now
11584
09:38:41,080 --> 09:38:43,960
we're going to calculate the error for the output layer first.
11585
09:38:43,960 --> 09:38:45,920
We know the output and what it should be,
11586
09:38:45,920 --> 09:38:49,680
and we know what we calculated so we can figure out what the error there is.
11587
09:38:49,680 --> 09:38:52,340
But then we're going to repeat for every layer,
11588
09:38:52,340 --> 09:38:55,360
starting with the output layer, moving back into the hidden layer,
11589
09:38:55,360 --> 09:38:58,160
then the hidden layer before that if there are multiple hidden layers,
11590
09:38:58,160 --> 09:39:00,480
going back all the way to the very first hidden layer,
11591
09:39:00,480 --> 09:39:05,000
assuming there are multiple, we're going to propagate the error back one layer.
11592
09:39:05,000 --> 09:39:07,280
Whatever the error was from the output, figure out
11593
09:39:07,280 --> 09:39:09,200
what the error should be a layer before that
11594
09:39:09,200 --> 09:39:11,800
based on what the values of those weights are.
11595
09:39:11,800 --> 09:39:14,900
And then we can update those weights.
11596
09:39:14,900 --> 09:39:17,000
So graphically, the way you might think about this
11597
09:39:17,000 --> 09:39:18,720
is that we first start with the output.
11598
09:39:18,720 --> 09:39:20,360
We know what the output should be.
11599
09:39:20,360 --> 09:39:22,240
We know what output we calculated.
11600
09:39:22,240 --> 09:39:23,720
And based on that, we can figure out, all right,
11601
09:39:23,720 --> 09:39:25,280
how do we need to update those weights?
11602
09:39:25,280 --> 09:39:28,600
Backpropagating the error to these nodes.
11603
09:39:28,600 --> 09:39:31,560
And using that, we can figure out how we should update these weights.
11604
09:39:31,560 --> 09:39:33,480
And you might imagine if there are multiple layers,
11605
09:39:33,480 --> 09:39:35,760
we could repeat this process again and again
11606
09:39:35,760 --> 09:39:39,600
to begin to figure out how all of these weights should be updated.
11607
09:39:39,600 --> 09:39:41,520
And this backpropagation algorithm is really
11608
09:39:41,520 --> 09:39:44,360
the key algorithm that makes neural networks possible.
11609
09:39:44,360 --> 09:39:47,840
It makes it possible to take these multi-level structures
11610
09:39:47,840 --> 09:39:50,240
and be able to train those structures depending
11611
09:39:50,240 --> 09:39:52,800
on what the values of these weights are in order
11612
09:39:52,800 --> 09:39:56,360
to figure out how it is that we should go about updating those weights in
11613
09:39:56,360 --> 09:39:59,520
order to create some function that is able to minimize
11614
09:39:59,520 --> 09:40:02,800
the total amount of loss, to figure out some good setting of the weights
11615
09:40:02,800 --> 09:40:07,600
that will take the inputs and translate it into the output that we expect.
11616
09:40:07,600 --> 09:40:10,640
And this works, as we said, not just for a single hidden layer.
11617
09:40:10,640 --> 09:40:13,800
But you can imagine multiple hidden layers, where each hidden layer we just
11618
09:40:13,800 --> 09:40:17,440
define however many nodes we want, where each of the nodes in one layer,
11619
09:40:17,440 --> 09:40:19,680
we can connect to the nodes in the next layer,
11620
09:40:19,680 --> 09:40:22,160
defining more and more complex networks that
11621
09:40:22,160 --> 09:40:26,320
are able to model more and more complex types of functions.
11622
09:40:26,320 --> 09:40:30,160
And so this type of network is what we might call a deep neural network,
11623
09:40:30,160 --> 09:40:33,480
part of a larger family of deep learning algorithms,
11624
09:40:33,480 --> 09:40:34,760
if you've ever heard that term.
11625
09:40:34,760 --> 09:40:38,520
And all deep learning is about is it's using multiple layers
11626
09:40:38,520 --> 09:40:41,560
to be able to predict and be able to model higher level
11627
09:40:41,560 --> 09:40:44,120
features inside of the input, to be able to figure out
11628
09:40:44,120 --> 09:40:45,280
what the output should be.
11629
09:40:45,280 --> 09:40:47,520
And so a deep neural network is just a neural network
11630
09:40:47,520 --> 09:40:49,240
that has multiple of these hidden layers,
11631
09:40:49,240 --> 09:40:52,200
where we start at the input, calculate values for this layer,
11632
09:40:52,200 --> 09:40:55,640
then this layer, then this layer, and then ultimately get an output.
11633
09:40:55,640 --> 09:40:59,280
And this allows us to be able to model more and more sophisticated types
11634
09:40:59,280 --> 09:41:02,600
of functions, that each of these layers can calculate something
11635
09:41:02,600 --> 09:41:05,800
a little bit different, and we can combine that information
11636
09:41:05,800 --> 09:41:08,560
to figure out what the output should be.
11637
09:41:08,560 --> 09:41:11,120
Of course, as with any situation of machine learning,
11638
09:41:11,120 --> 09:41:13,640
as we begin to make our models more and more complex,
11639
09:41:13,640 --> 09:41:17,200
to model more and more complex functions, the risk we run
11640
09:41:17,200 --> 09:41:18,960
is something like overfitting.
11641
09:41:18,960 --> 09:41:22,840
And we talked about overfitting last time in the context of overfitting
11642
09:41:22,840 --> 09:41:25,480
based on when we were training our models to be
11643
09:41:25,480 --> 09:41:27,720
able to learn some sort of decision boundary,
11644
09:41:27,720 --> 09:41:31,800
where overfitting happens when we fit too closely to the training data.
11645
09:41:31,800 --> 09:41:36,160
And as a result, we don't generalize well to other situations as well.
11646
09:41:36,160 --> 09:41:40,280
And one of the risks we run with a far more complex neural network that
11647
09:41:40,280 --> 09:41:44,160
has many, many different nodes is that we might overfit based on the input
11648
09:41:44,160 --> 09:41:44,660
data.
11649
09:41:44,660 --> 09:41:46,800
We might grow over reliant on certain nodes
11650
09:41:46,800 --> 09:41:49,880
to calculate things just purely based on the input data that
11651
09:41:49,880 --> 09:41:53,520
doesn't allow us to generalize very well to the output.
11652
09:41:53,520 --> 09:41:56,440
And there are a number of strategies for dealing with overfitting.
11653
09:41:56,440 --> 09:41:59,280
But one of the most popular in the context of neural networks
11654
09:41:59,280 --> 09:42:01,160
is a technique known as dropout.
11655
09:42:01,160 --> 09:42:04,440
And what dropout does is it, when we're training the neural network,
11656
09:42:04,440 --> 09:42:08,000
what we'll do in dropout is temporarily remove units,
11657
09:42:08,000 --> 09:42:11,560
temporarily remove these artificial neurons from our network chosen at
11658
09:42:11,560 --> 09:42:12,640
random.
11659
09:42:12,640 --> 09:42:16,360
And the goal here is to prevent over-reliance on certain units.
11660
09:42:16,360 --> 09:42:18,480
What generally happens in overfitting is that we
11661
09:42:18,480 --> 09:42:21,920
begin to over-rely on certain units inside the neural network
11662
09:42:21,920 --> 09:42:24,880
to be able to tell us how to interpret the input data.
11663
09:42:24,880 --> 09:42:28,160
What dropout will do is randomly remove some of these units
11664
09:42:28,160 --> 09:42:31,520
in order to reduce the chance that we over-rely on certain units
11665
09:42:31,520 --> 09:42:35,480
to make our neural network more robust, to be able to handle the situations
11666
09:42:35,480 --> 09:42:39,360
even when we just drop out particular neurons entirely.
11667
09:42:39,360 --> 09:42:42,120
So the way that might work is we have a network like this.
11668
09:42:42,120 --> 09:42:44,240
And as we're training it, when we go about trying
11669
09:42:44,240 --> 09:42:47,120
to update the weights the first time, we'll just randomly pick
11670
09:42:47,120 --> 09:42:49,600
some percentage of the nodes to drop out of the network.
11671
09:42:49,600 --> 09:42:51,440
It's as if those nodes aren't there at all.
11672
09:42:51,440 --> 09:42:54,640
It's as if the weights associated with those nodes aren't there at all.
11673
09:42:54,640 --> 09:42:56,080
And we'll train it this way.
11674
09:42:56,080 --> 09:42:58,440
Then the next time we update the weights, we'll pick a different set
11675
09:42:58,440 --> 09:42:59,960
and just go ahead and train that way.
11676
09:42:59,960 --> 09:43:02,720
And then again, randomly choose and train with other nodes
11677
09:43:02,720 --> 09:43:04,600
that have been dropped out as well.
11678
09:43:04,600 --> 09:43:07,240
And the goal of that is that after the training process,
11679
09:43:07,240 --> 09:43:10,640
if you train by dropping out random nodes inside of this neural network,
11680
09:43:10,640 --> 09:43:13,480
you hopefully end up with a network that's a little bit more robust,
11681
09:43:13,480 --> 09:43:16,880
that doesn't rely too heavily on any one particular node,
11682
09:43:16,880 --> 09:43:21,800
but more generally learns how to approximate a function in general.
11683
09:43:21,800 --> 09:43:24,040
So that then is a look at some of these techniques
11684
09:43:24,040 --> 09:43:27,400
that we can use in order to implement a neural network,
11685
09:43:27,400 --> 09:43:30,320
to get at the idea of taking this input, passing it
11686
09:43:30,320 --> 09:43:34,120
through these various different layers in order to produce some sort of output.
11687
09:43:34,120 --> 09:43:37,460
And what we'd like to do now is take those ideas and put them into code.
11688
09:43:37,460 --> 09:43:40,300
And to do that, there are a number of different machine learning libraries,
11689
09:43:40,300 --> 09:43:44,160
neural network libraries that we can use that allow us to get access
11690
09:43:44,160 --> 09:43:47,840
to someone's implementation of back propagation and all of these hidden
11691
09:43:47,840 --> 09:43:48,440
layers.
11692
09:43:48,440 --> 09:43:52,060
And one of the most popular, developed by Google, is known as TensorFlow,
11693
09:43:52,060 --> 09:43:55,640
a library that we can use for quickly creating neural networks and modeling
11694
09:43:55,640 --> 09:43:59,520
them and running them on some sample data to see what the output is going
11695
09:43:59,520 --> 09:44:00,040
to be.
11696
09:44:00,040 --> 09:44:01,840
And before we actually start writing code,
11697
09:44:01,840 --> 09:44:04,640
we'll go ahead and take a look at TensorFlow's playground, which
11698
09:44:04,640 --> 09:44:08,000
will be an opportunity for us just to play around with this idea of neural
11699
09:44:08,000 --> 09:44:10,880
networks in different layers, just to get a sense for what
11700
09:44:10,880 --> 09:44:15,200
it is that we can do by taking advantage of neural networks.
11701
09:44:15,200 --> 09:44:18,440
So let's go ahead and go into TensorFlow's playground, which
11702
09:44:18,440 --> 09:44:20,920
you can go to by visiting that URL from before.
11703
09:44:20,920 --> 09:44:24,720
And what we're going to do now is we're going to try and learn the decision
11704
09:44:24,720 --> 09:44:27,500
boundary for this particular output.
11705
09:44:27,500 --> 09:44:30,960
I want to learn to separate the orange points from the blue points.
11706
09:44:30,960 --> 09:44:34,000
And I'd like to learn some sort of setting of weights inside of a neural
11707
09:44:34,000 --> 09:44:37,840
network that will be able to separate those from each other.
11708
09:44:37,840 --> 09:44:40,200
The features we have access to, our input data,
11709
09:44:40,200 --> 09:44:44,960
are the x value and the y value, so the two values along each of the two axes.
11710
09:44:44,960 --> 09:44:47,320
And what I'll do now is I can set particular parameters,
11711
09:44:47,320 --> 09:44:50,080
like what activation function I would like to use.
11712
09:44:50,080 --> 09:44:53,960
And I'll just go ahead and press play and see what happens.
11713
09:44:53,960 --> 09:44:56,360
And what happens here is that you'll see that just
11714
09:44:56,360 --> 09:45:00,640
by using these two input features, the x value and the y value,
11715
09:45:00,640 --> 09:45:04,120
with no hidden layers, just take the input, x and y values,
11716
09:45:04,120 --> 09:45:06,240
and figure out what the decision boundary is.
11717
09:45:06,240 --> 09:45:08,840
Our neural network learns pretty quickly that in order
11718
09:45:08,840 --> 09:45:11,400
to divide these two points, we should just use this line.
11719
09:45:11,400 --> 09:45:13,820
This line acts as a decision boundary that
11720
09:45:13,820 --> 09:45:16,720
separates this group of points from that group of points,
11721
09:45:16,720 --> 09:45:17,720
and it does it very well.
11722
09:45:17,720 --> 09:45:19,420
You can see up here what the loss is.
11723
09:45:19,420 --> 09:45:24,000
The training loss is 0, meaning we were able to perfectly model separating
11724
09:45:24,000 --> 09:45:27,720
these two points from each other inside of our training data.
11725
09:45:27,720 --> 09:45:30,420
So this was a fairly simple case of trying
11726
09:45:30,420 --> 09:45:33,840
to apply a neural network because the data is very clean.
11727
09:45:33,840 --> 09:45:35,880
It's very nicely linearly separable.
11728
09:45:35,880 --> 09:45:39,960
We could just draw a line that separates all of those points from each other.
11729
09:45:39,960 --> 09:45:42,160
Let's now consider a more complex case.
11730
09:45:42,160 --> 09:45:44,640
So I'll go ahead and pause the simulation,
11731
09:45:44,640 --> 09:45:47,840
and we'll go ahead and look at this data set here.
11732
09:45:47,840 --> 09:45:50,320
This data set is a little bit more complex now.
11733
09:45:50,320 --> 09:45:52,520
In this data set, we still have blue and orange points
11734
09:45:52,520 --> 09:45:54,440
that we'd like to separate from each other.
11735
09:45:54,440 --> 09:45:56,680
But there's no single line that we can draw
11736
09:45:56,680 --> 09:45:59,640
that is going to be able to figure out how to separate the blue from the orange,
11737
09:45:59,640 --> 09:46:02,720
because the blue is located in these two quadrants,
11738
09:46:02,720 --> 09:46:04,900
and the orange is located here and here.
11739
09:46:04,900 --> 09:46:07,920
It's a more complex function to be able to learn.
11740
09:46:07,920 --> 09:46:09,000
So let's see what happens.
11741
09:46:09,000 --> 09:46:13,440
If we just try and predict based on those inputs, the x and y coordinates,
11742
09:46:13,440 --> 09:46:16,800
what the output should be, I'll press Play.
11743
09:46:16,800 --> 09:46:18,640
And what you'll notice is that we're not really
11744
09:46:18,640 --> 09:46:21,800
able to draw much of a conclusion, that we're not
11745
09:46:21,800 --> 09:46:25,780
able to very cleanly see how we should divide the orange points from the blue
11746
09:46:25,780 --> 09:46:30,040
points, and you don't see a very clean separation there.
11747
09:46:30,040 --> 09:46:34,320
So it seems like we don't have enough sophistication inside of our network
11748
09:46:34,320 --> 09:46:37,080
to be able to model something that is that complex.
11749
09:46:37,080 --> 09:46:39,800
We need a better model for this neural network.
11750
09:46:39,800 --> 09:46:42,960
And I'll do that by adding a hidden layer.
11751
09:46:42,960 --> 09:46:45,960
So now I have a hidden layer that has two neurons inside of it.
11752
09:46:45,960 --> 09:46:49,080
So I have two inputs that then go to two neurons
11753
09:46:49,080 --> 09:46:52,240
inside of a hidden layer that then go to our output.
11754
09:46:52,240 --> 09:46:54,040
And now I'll press Play.
11755
09:46:54,040 --> 09:46:57,800
And what you'll notice here is that we're able to do slightly better.
11756
09:46:57,800 --> 09:47:00,680
We're able to now say, all right, these points are definitely blue.
11757
09:47:00,680 --> 09:47:02,620
These points are definitely orange.
11758
09:47:02,620 --> 09:47:05,760
We're still struggling a little bit with these points up here, though.
11759
09:47:05,760 --> 09:47:08,920
And what we can do is we can see for each of these hidden neurons,
11760
09:47:08,920 --> 09:47:11,720
what is it exactly that these hidden neurons are doing?
11761
09:47:11,720 --> 09:47:15,120
Each hidden neuron is learning its own decision boundary.
11762
09:47:15,120 --> 09:47:16,840
And we can see what that boundary is.
11763
09:47:16,840 --> 09:47:19,600
This first neuron is learning, all right,
11764
09:47:19,600 --> 09:47:22,680
this line that seems to separate some of the blue points
11765
09:47:22,680 --> 09:47:24,760
from the rest of the points.
11766
09:47:24,760 --> 09:47:27,360
This other hidden neuron is learning another line
11767
09:47:27,360 --> 09:47:29,960
that seems to be separating the orange points in the lower right
11768
09:47:29,960 --> 09:47:31,680
from the rest of the points.
11769
09:47:31,680 --> 09:47:36,480
So that's why we're able to figure out these two areas in the bottom region.
11770
09:47:36,480 --> 09:47:40,360
But we're still not able to perfectly classify all of the points.
11771
09:47:40,360 --> 09:47:42,920
So let's go ahead and add another neuron.
11772
09:47:42,920 --> 09:47:46,160
Now we've got three neurons inside of our hidden layer
11773
09:47:46,160 --> 09:47:48,520
and see what we're able to learn now.
11774
09:47:48,520 --> 09:47:50,680
All right, well, now we seem to be doing a better job.
11775
09:47:50,680 --> 09:47:53,240
By learning three different decision boundaries, which
11776
09:47:53,240 --> 09:47:55,800
each of the three neurons inside of our hidden layer,
11777
09:47:55,800 --> 09:47:59,820
we're able to much better figure out how to separate these blue points
11778
09:47:59,820 --> 09:48:00,820
from the orange points.
11779
09:48:00,820 --> 09:48:03,600
And we can see what each of these hidden neurons is learning.
11780
09:48:03,600 --> 09:48:06,480
Each one is learning a slightly different decision boundary.
11781
09:48:06,480 --> 09:48:09,120
And then we're combining those decision boundaries together
11782
09:48:09,120 --> 09:48:11,960
to figure out what the overall output should be.
11783
09:48:11,960 --> 09:48:15,640
And then we can try it one more time by adding a fourth neuron there
11784
09:48:15,640 --> 09:48:17,120
and try learning that.
11785
09:48:17,120 --> 09:48:19,400
And it seems like now we can do even better at trying
11786
09:48:19,400 --> 09:48:21,600
to separate the blue points from the orange points.
11787
09:48:21,600 --> 09:48:24,520
But we were only able to do this by adding a hidden layer,
11788
09:48:24,520 --> 09:48:27,400
by adding some layer that is learning some other boundaries
11789
09:48:27,400 --> 09:48:30,340
and combining those boundaries to determine the output.
11790
09:48:30,340 --> 09:48:33,680
And the strength, the size and thickness of these lines
11791
09:48:33,680 --> 09:48:37,040
indicate how high these weights are, how important each of these inputs
11792
09:48:37,040 --> 09:48:40,320
is for making this sort of calculation.
11793
09:48:40,320 --> 09:48:42,880
And we can do maybe one more simulation.
11794
09:48:42,880 --> 09:48:46,200
Let's go ahead and try this on a data set that looks like this.
11795
09:48:46,200 --> 09:48:47,960
Go ahead and get rid of the hidden layer.
11796
09:48:47,960 --> 09:48:51,040
Here now we're trying to separate the blue points from the orange points
11797
09:48:51,040 --> 09:48:53,080
where all the blue points are located, again,
11798
09:48:53,080 --> 09:48:54,960
inside of a circle effectively.
11799
09:48:54,960 --> 09:48:57,280
So we're not going to be able to learn a line.
11800
09:48:57,280 --> 09:48:58,480
Notice I press Play.
11801
09:48:58,480 --> 09:49:01,400
And we're really not able to draw any sort of classification at all
11802
09:49:01,400 --> 09:49:04,800
because there is no line that cleanly separates the blue points
11803
09:49:04,800 --> 09:49:06,920
from the orange points.
11804
09:49:06,920 --> 09:49:10,600
So let's try to solve this by introducing a hidden layer.
11805
09:49:10,600 --> 09:49:12,760
I'll go ahead and press Play.
11806
09:49:12,760 --> 09:49:14,920
And all right, with two neurons in a hidden layer,
11807
09:49:14,920 --> 09:49:17,080
we're able to do a little better because we effectively
11808
09:49:17,080 --> 09:49:18,880
learned two different decision boundaries.
11809
09:49:18,880 --> 09:49:20,640
We learned this line here.
11810
09:49:20,640 --> 09:49:23,020
And we learned this line on the right-hand side.
11811
09:49:23,020 --> 09:49:25,160
And right now we're just saying, all right, well, if it's in between,
11812
09:49:25,160 --> 09:49:25,960
we'll call it blue.
11813
09:49:25,960 --> 09:49:27,760
And if it's outside, we'll call it orange.
11814
09:49:27,760 --> 09:49:30,240
So not great, but certainly better than before,
11815
09:49:30,240 --> 09:49:33,000
that we're learning one decision boundary and another.
11816
09:49:33,000 --> 09:49:36,920
And based on those, we can figure out what the output should be.
11817
09:49:36,920 --> 09:49:42,200
But let's now go ahead and add a third neuron and see what happens now.
11818
09:49:42,200 --> 09:49:43,400
I go ahead and train it.
11819
09:49:43,400 --> 09:49:46,240
And now, using three different decision boundaries
11820
09:49:46,240 --> 09:49:48,160
that are learned by each of these hidden neurons,
11821
09:49:48,160 --> 09:49:51,040
we're able to much more accurately model this distinction
11822
09:49:51,040 --> 09:49:53,080
between blue points and orange points.
11823
09:49:53,080 --> 09:49:56,000
We're able to figure out maybe with these three decision boundaries,
11824
09:49:56,000 --> 09:49:58,800
combining them together, you can imagine figuring out
11825
09:49:58,800 --> 09:50:02,360
what the output should be and how to make that sort of classification.
11826
09:50:02,360 --> 09:50:05,720
And so the goal here is just to get a sense for having more neurons
11827
09:50:05,720 --> 09:50:09,720
in these hidden layers allows us to learn more structure in the data,
11828
09:50:09,720 --> 09:50:12,640
allows us to figure out what the relevant and important decision
11829
09:50:12,640 --> 09:50:13,600
boundaries are.
11830
09:50:13,600 --> 09:50:15,840
And then using this backpropagation algorithm,
11831
09:50:15,840 --> 09:50:18,840
we're able to figure out what the values of these weights should be
11832
09:50:18,840 --> 09:50:23,640
in order to train this network to be able to classify one category of points
11833
09:50:23,640 --> 09:50:26,360
away from another category of points instead.
11834
09:50:26,360 --> 09:50:28,280
And this is ultimately what we're going to be trying
11835
09:50:28,280 --> 09:50:32,120
to do whenever we're training a neural network.
11836
09:50:32,120 --> 09:50:34,600
So let's go ahead and actually see an example of this.
11837
09:50:34,600 --> 09:50:38,160
You'll recall from last time that we had this banknotes file
11838
09:50:38,160 --> 09:50:41,360
that included information about counterfeit banknotes as opposed
11839
09:50:41,360 --> 09:50:45,960
to authentic banknotes, where I had four different values for each banknote
11840
09:50:45,960 --> 09:50:48,920
and then a categorization of whether that banknote is considered
11841
09:50:48,920 --> 09:50:51,560
to be authentic or a counterfeit note.
11842
09:50:51,560 --> 09:50:55,160
And what I wanted to do was, based on that input information,
11843
09:50:55,160 --> 09:50:57,120
figure out some function that could calculate
11844
09:50:57,120 --> 09:51:00,480
based on the input information what category it belonged to.
11845
09:51:00,480 --> 09:51:02,840
And what I've written here in banknotes.py
11846
09:51:02,840 --> 09:51:05,760
is a neural network that will learn just that, a network that
11847
09:51:05,760 --> 09:51:08,560
learns based on all of the input whether or not
11848
09:51:08,560 --> 09:51:13,040
we should categorize a banknote as authentic or as counterfeit.
11849
09:51:13,040 --> 09:51:15,520
The first step is the same as what we saw from last time.
11850
09:51:15,520 --> 09:51:17,840
I'm really just reading the data in and getting it
11851
09:51:17,840 --> 09:51:19,320
into an appropriate format.
11852
09:51:19,320 --> 09:51:22,960
And so this is where more of the writing Python code on your own
11853
09:51:22,960 --> 09:51:25,080
comes in, in terms of manipulating this data,
11854
09:51:25,080 --> 09:51:28,320
massaging the data into a format that will be understood
11855
09:51:28,320 --> 09:51:32,200
by a machine learning library like scikit-learn or like TensorFlow.
11856
09:51:32,200 --> 09:51:35,960
And so here I separate it into a training and a testing set.
11857
09:51:35,960 --> 09:51:40,280
And now what I'm doing down below is I'm creating a neural network.
11858
09:51:40,280 --> 09:51:42,760
Here I'm using TF, which stands for TensorFlow.
11859
09:51:42,760 --> 09:51:47,520
Up above, I said import TensorFlow as TF, TF just an abbreviation that we'll
11860
09:51:47,520 --> 09:51:49,560
often use so we don't need to write out TensorFlow
11861
09:51:49,560 --> 09:51:52,840
every time we want to use anything inside of the library.
11862
09:51:52,840 --> 09:51:55,160
I'm using TF.keras.
11863
09:51:55,160 --> 09:51:57,600
Keras is an API, a set of functions that we
11864
09:51:57,600 --> 09:52:02,000
can use in order to manipulate neural networks inside of TensorFlow.
11865
09:52:02,000 --> 09:52:04,440
And it turns out there are other machine learning libraries
11866
09:52:04,440 --> 09:52:06,720
that also use the Keras API.
11867
09:52:06,720 --> 09:52:08,920
But here I'm saying, all right, go ahead and give me
11868
09:52:08,920 --> 09:52:12,480
a model that is a sequential model, a sequential neural network,
11869
09:52:12,480 --> 09:52:14,920
meaning one layer after another.
11870
09:52:14,920 --> 09:52:17,960
And now I'm going to add to that model what layers
11871
09:52:17,960 --> 09:52:20,200
I want inside of my neural network.
11872
09:52:20,200 --> 09:52:22,080
So here I'm saying model.add.
11873
09:52:22,080 --> 09:52:24,400
Go ahead and add a dense layer.
11874
09:52:24,400 --> 09:52:28,040
And when we say a dense layer, we mean a layer that is just each
11875
09:52:28,040 --> 09:52:30,400
of the nodes inside of the layer is going to be connected
11876
09:52:30,400 --> 09:52:32,240
to each of the nodes from the previous layer.
11877
09:52:32,240 --> 09:52:35,600
So we have a densely connected layer.
11878
09:52:35,600 --> 09:52:38,280
This layer is going to have eight units inside of it.
11879
09:52:38,280 --> 09:52:40,840
So it's going to be a hidden layer inside of a neural network
11880
09:52:40,840 --> 09:52:43,720
with eight different units, eight artificial neurons, each of which
11881
09:52:43,720 --> 09:52:45,040
might learn something different.
11882
09:52:45,040 --> 09:52:47,000
And I just sort of chose eight arbitrarily.
11883
09:52:47,000 --> 09:52:50,760
You could choose a different number of hidden nodes inside of the layer.
11884
09:52:50,760 --> 09:52:53,520
And as we saw before, depending on the number of units
11885
09:52:53,520 --> 09:52:56,480
there are inside of your hidden layer, more units
11886
09:52:56,480 --> 09:52:58,400
means you can learn more complex functions.
11887
09:52:58,400 --> 09:53:01,600
So maybe you can more accurately model the training data.
11888
09:53:01,600 --> 09:53:02,720
But it comes at the cost.
11889
09:53:02,720 --> 09:53:05,760
More units means more weights that you need to figure out how to update.
11890
09:53:05,760 --> 09:53:08,280
So it might be more expensive to do that calculation.
11891
09:53:08,280 --> 09:53:10,640
And you also run the risk of overfitting on the data.
11892
09:53:10,640 --> 09:53:13,040
If you have too many units and you learn to just
11893
09:53:13,040 --> 09:53:15,600
overfit on the training data, that's not good either.
11894
09:53:15,600 --> 09:53:16,520
So there is a balance.
11895
09:53:16,520 --> 09:53:20,160
And there's often a testing process where you'll train on some data
11896
09:53:20,160 --> 09:53:23,240
and maybe validate how well you're doing on a separate set of data,
11897
09:53:23,240 --> 09:53:26,840
often called a validation set, to see, all right, which setting of parameters.
11898
09:53:26,840 --> 09:53:28,080
How many layers should I have?
11899
09:53:28,080 --> 09:53:29,800
How many units should be in each layer?
11900
09:53:29,800 --> 09:53:32,600
Which one of those performs the best on the validation set?
11901
09:53:32,600 --> 09:53:36,680
So you can do some testing to figure out what these hyper parameters, so called,
11902
09:53:36,680 --> 09:53:38,840
should be equal to.
11903
09:53:38,840 --> 09:53:41,480
Next, I specify what the input shape is.
11904
09:53:41,480 --> 09:53:43,480
Meaning, all right, what does my input look like?
11905
09:53:43,480 --> 09:53:44,840
My input has four values.
11906
09:53:44,840 --> 09:53:48,920
And so the input shape is just four, because we have four inputs.
11907
09:53:48,920 --> 09:53:51,240
And then I specify what the activation function is.
11908
09:53:51,240 --> 09:53:53,040
And the activation function, again, we can choose.
11909
09:53:53,040 --> 09:53:55,440
There are a number of different activation functions.
11910
09:53:55,440 --> 09:53:59,200
Here I'm using relu, which you might recall from earlier.
11911
09:53:59,200 --> 09:54:01,560
And then I'll add an output layer.
11912
09:54:01,560 --> 09:54:02,920
So I have my hidden layer.
11913
09:54:02,920 --> 09:54:05,960
Now I'm adding one more layer that will just have one unit,
11914
09:54:05,960 --> 09:54:07,960
because all I want to do is predict something
11915
09:54:07,960 --> 09:54:10,520
like counterfeit build or authentic build.
11916
09:54:10,520 --> 09:54:12,280
So I just need a single unit.
11917
09:54:12,280 --> 09:54:14,480
And the activation function I'm going to use here
11918
09:54:14,480 --> 09:54:16,840
is that sigmoid activation function, which, again,
11919
09:54:16,840 --> 09:54:20,800
was that S-shaped curve that just gave us a probability of what
11920
09:54:20,800 --> 09:54:24,160
is the probability that this is a counterfeit build,
11921
09:54:24,160 --> 09:54:26,400
as opposed to an authentic build.
11922
09:54:26,400 --> 09:54:29,240
So that, then, is the structure of my neural network,
11923
09:54:29,240 --> 09:54:32,880
a sequential neural network that has one hidden layer with eight units inside
11924
09:54:32,880 --> 09:54:37,040
of it, and then one output layer that just has a single unit inside of it.
11925
09:54:37,040 --> 09:54:38,760
And I can choose how many units there are.
11926
09:54:38,760 --> 09:54:40,960
I can choose the activation function.
11927
09:54:40,960 --> 09:54:44,240
Then I'm going to compile this model.
11928
09:54:44,240 --> 09:54:48,040
TensorFlow gives you a choice of how you would like to optimize the weights.
11929
09:54:48,040 --> 09:54:50,160
There are various different algorithms for doing that.
11930
09:54:50,160 --> 09:54:52,040
What type of loss function you want to use.
11931
09:54:52,040 --> 09:54:54,120
Again, many different options for doing that.
11932
09:54:54,120 --> 09:54:57,300
And then how I want to evaluate my model, well, I care about accuracy.
11933
09:54:57,300 --> 09:55:01,920
I care about how many of my points am I able to classify correctly
11934
09:55:01,920 --> 09:55:04,600
versus not correctly as counterfeit or not counterfeit.
11935
09:55:04,600 --> 09:55:09,920
And I would like it to report to me how accurate my model is performing.
11936
09:55:09,920 --> 09:55:12,360
Then, now that I've defined that model, I
11937
09:55:12,360 --> 09:55:15,520
call model.fit to say go ahead and train the model.
11938
09:55:15,520 --> 09:55:19,480
Train it on all the training data plus all of the training labels.
11939
09:55:19,480 --> 09:55:22,360
So labels for each of those pieces of training data.
11940
09:55:22,360 --> 09:55:25,440
And I'm saying run it for 20 epics, meaning go ahead and go
11941
09:55:25,440 --> 09:55:28,080
through each of these training points 20 times, effectively.
11942
09:55:28,080 --> 09:55:31,480
Go through the data 20 times and keep trying to update the weights.
11943
09:55:31,480 --> 09:55:33,720
If I did it for more, I could train for even longer
11944
09:55:33,720 --> 09:55:36,040
and maybe get a more accurate result.
11945
09:55:36,040 --> 09:55:39,640
But then after I fit it on all the data, I'll go ahead and just test it.
11946
09:55:39,640 --> 09:55:43,720
I'll evaluate my model using model.evaluate built into TensorFlow
11947
09:55:43,720 --> 09:55:47,380
that is just going to tell me how well do I perform on the testing data.
11948
09:55:47,380 --> 09:55:50,420
So ultimately, this is just going to give me some numbers that tell me
11949
09:55:50,420 --> 09:55:54,320
how well we did in this particular case.
11950
09:55:54,320 --> 09:55:57,840
So now what I'm going to do is go into banknotes and go ahead and run
11951
09:55:57,840 --> 09:55:59,240
banknotes.py.
11952
09:55:59,240 --> 09:56:02,280
And what's going to happen now is it's going to read in all of that training
11953
09:56:02,280 --> 09:56:02,880
data.
11954
09:56:02,880 --> 09:56:05,880
It's going to generate a neural network with all my inputs,
11955
09:56:05,880 --> 09:56:10,240
my eight hidden units inside my layer, and then an output unit.
11956
09:56:10,240 --> 09:56:11,880
And now what it's doing is it's training.
11957
09:56:11,880 --> 09:56:13,600
It's training 20 times.
11958
09:56:13,600 --> 09:56:17,200
And each time you can see how my accuracy is increasing on my training data.
11959
09:56:17,200 --> 09:56:20,200
It starts off the very first time not very accurate,
11960
09:56:20,200 --> 09:56:23,920
though better than random, something like 79% of the time.
11961
09:56:23,920 --> 09:56:26,880
It's able to accurately classify one bill from another.
11962
09:56:26,880 --> 09:56:29,600
But as I keep training, notice this accuracy value
11963
09:56:29,600 --> 09:56:33,320
improves and improves and improves until after I've trained through all
11964
09:56:33,320 --> 09:56:39,600
the data points 20 times, it looks like my accuracy is above 99% on the training
11965
09:56:39,600 --> 09:56:40,480
data.
11966
09:56:40,480 --> 09:56:43,840
And here's where I tested it on a whole bunch of testing data.
11967
09:56:43,840 --> 09:56:48,440
And it looks like in this case, I was also like 99.8% accurate.
11968
09:56:48,440 --> 09:56:51,200
So just using that, I was able to generate a neural network that
11969
09:56:51,200 --> 09:56:54,480
can detect counterfeit bills from authentic bills based on this input
11970
09:56:54,480 --> 09:56:59,280
data 99.8% of the time, at least based on this particular testing data.
11971
09:56:59,280 --> 09:57:01,480
And I might want to test it with more data as well,
11972
09:57:01,480 --> 09:57:03,040
just to be confident about that.
11973
09:57:03,040 --> 09:57:06,960
But this is really the value of using a machine learning library like TensorFlow.
11974
09:57:06,960 --> 09:57:10,000
And there are others available for Python and other languages as well.
11975
09:57:10,000 --> 09:57:13,520
But all I have to do is define the structure of the network
11976
09:57:13,520 --> 09:57:16,560
and define the data that I'm going to pass into the network.
11977
09:57:16,560 --> 09:57:19,840
And then TensorFlow runs the backpropagation algorithm
11978
09:57:19,840 --> 09:57:22,040
for learning what all of those weights should be,
11979
09:57:22,040 --> 09:57:24,640
for figuring out how to train this neural network to be
11980
09:57:24,640 --> 09:57:27,240
able to accurately, as accurately as possible,
11981
09:57:27,240 --> 09:57:31,920
figure out what the output values should be there as well.
11982
09:57:31,920 --> 09:57:36,400
And so this then was a look at what it is that neural networks can do just
11983
09:57:36,400 --> 09:57:39,520
using these sequences of layer after layer after layer.
11984
09:57:39,520 --> 09:57:43,240
And you can begin to imagine applying these to much more general problems.
11985
09:57:43,240 --> 09:57:45,920
And one big problem in computing and artificial intelligence
11986
09:57:45,920 --> 09:57:49,280
more generally is the problem of computer vision.
11987
09:57:49,280 --> 09:57:51,840
Computer vision is all about computational methods
11988
09:57:51,840 --> 09:57:54,600
for analyzing and understanding images.
11989
09:57:54,600 --> 09:57:57,400
You might have pictures that you want the computer to figure out
11990
09:57:57,400 --> 09:57:59,480
how to deal with, how to process those images
11991
09:57:59,480 --> 09:58:02,960
and figure out how to produce some sort of useful result out of this.
11992
09:58:02,960 --> 09:58:05,360
You've seen this in the context of social media websites
11993
09:58:05,360 --> 09:58:08,360
that are able to look at a photo that contains a whole bunch of faces.
11994
09:58:08,360 --> 09:58:10,520
And it's able to figure out what's a picture of whom
11995
09:58:10,520 --> 09:58:13,320
and label those and tag them with appropriate people.
11996
09:58:13,320 --> 09:58:15,360
This is becoming increasingly relevant as we
11997
09:58:15,360 --> 09:58:19,280
begin to discuss self-driving cars, that these cars now have cameras.
11998
09:58:19,280 --> 09:58:22,080
And we would like for the computer to have some sort of algorithm
11999
09:58:22,080 --> 09:58:26,760
that looks at the image and figures out what color is the light, what cars
12000
09:58:26,760 --> 09:58:29,200
are around us and in what direction, for example.
12001
09:58:29,200 --> 09:58:33,160
And so computer vision is all about taking an image and figuring out
12002
09:58:33,160 --> 09:58:35,600
what sort of computation, what sort of calculation
12003
09:58:35,600 --> 09:58:36,880
we can do with that image.
12004
09:58:36,880 --> 09:58:40,720
It's also relevant in the context of something like handwriting recognition.
12005
09:58:40,720 --> 09:58:43,800
This, what you're looking at, is an example of the MNIST data set.
12006
09:58:43,800 --> 09:58:46,240
It's a big data set just of handwritten digits
12007
09:58:46,240 --> 09:58:48,800
that we could use to ideally try and figure out
12008
09:58:48,800 --> 09:58:52,480
how to predict, given someone's handwriting, given a photo of a digit
12009
09:58:52,480 --> 09:58:57,120
that they have drawn, can you predict whether it's a 0, 1, 2, 3, 4, 5, 6, 7, 8,
12010
09:58:57,120 --> 09:58:58,320
or 9, for example.
12011
09:58:58,320 --> 09:59:01,080
So this sort of handwriting recognition is yet another task
12012
09:59:01,080 --> 09:59:04,280
that we might want to use computer vision tasks and tools
12013
09:59:04,280 --> 09:59:05,720
to be able to apply it towards.
12014
09:59:05,720 --> 09:59:08,840
This might be a task that we might care about.
12015
09:59:08,840 --> 09:59:11,360
So how, then, can we use neural networks to be
12016
09:59:11,360 --> 09:59:13,080
able to solve a problem like this?
12017
09:59:13,080 --> 09:59:15,600
Well, neural networks rely upon some sort of input
12018
09:59:15,600 --> 09:59:17,600
where that input is just numerical data.
12019
09:59:17,600 --> 09:59:19,880
We have a whole bunch of units where each one of them
12020
09:59:19,880 --> 09:59:22,080
just represents some sort of number.
12021
09:59:22,080 --> 09:59:24,920
And so in the context of something like handwriting recognition
12022
09:59:24,920 --> 09:59:29,200
or in the context of just an image, you might imagine that an image is really
12023
09:59:29,200 --> 09:59:34,160
just a grid of pixels, grid of dots where each dot has some sort of color.
12024
09:59:34,160 --> 09:59:36,800
And in the context of something like handwriting recognition,
12025
09:59:36,800 --> 09:59:39,880
you might imagine that if you just fill in each of these dots in a particular
12026
09:59:39,880 --> 09:59:42,640
way, you can generate a 2 or an 8, for example,
12027
09:59:42,640 --> 09:59:46,680
based on which dots happen to be shaded in and which dots are not.
12028
09:59:46,680 --> 09:59:50,400
And we can represent each of these pixel values just using numbers.
12029
09:59:50,400 --> 09:59:55,360
So for a particular pixel, for example, 0 might represent entirely black.
12030
09:59:55,360 --> 09:59:57,300
Depending on how you're representing color,
12031
09:59:57,300 --> 10:00:02,000
it's often common to represent color values on a 0 to 255 range
12032
10:00:02,000 --> 10:00:06,160
so that you can represent a color using 8 bits for a particular value,
12033
10:00:06,160 --> 10:00:08,400
like how much white is in the image.
12034
10:00:08,400 --> 10:00:10,920
So 0 might represent all black.
12035
10:00:10,920 --> 10:00:14,200
255 might represent entirely white as a pixel.
12036
10:00:14,200 --> 10:00:18,400
And somewhere in between might represent some shade of gray, for example.
12037
10:00:18,400 --> 10:00:20,760
But you might imagine not just having a single slider that
12038
10:00:20,760 --> 10:00:22,640
determines how much white is in the image,
12039
10:00:22,640 --> 10:00:24,760
but if you had a color image, you might imagine
12040
10:00:24,760 --> 10:00:28,080
three different numerical values, a red, green, and blue value,
12041
10:00:28,080 --> 10:00:30,760
where the red value controls how much red is in the image.
12042
10:00:30,760 --> 10:00:33,800
We have one value for controlling how much green is in the pixel
12043
10:00:33,800 --> 10:00:36,440
and one value for how much blue is in the pixel as well.
12044
10:00:36,440 --> 10:00:40,240
And depending on how it is that you set these values of red, green, and blue,
12045
10:00:40,240 --> 10:00:42,000
you can get a different color.
12046
10:00:42,000 --> 10:00:45,720
And so any pixel can really be represented, in this case,
12047
10:00:45,720 --> 10:00:50,640
by three numerical values, a red value, a green value, and a blue value.
12048
10:00:50,640 --> 10:00:54,160
And if you take a whole bunch of these pixels, assemble them together
12049
10:00:54,160 --> 10:00:56,840
inside of a grid of pixels, then you really
12050
10:00:56,840 --> 10:00:59,040
just have a whole bunch of numerical values
12051
10:00:59,040 --> 10:01:03,120
that you can use in order to perform some sort of prediction task.
12052
10:01:03,120 --> 10:01:05,800
And so what you might imagine doing is using the same techniques
12053
10:01:05,800 --> 10:01:08,680
we talked about before, just design a neural network
12054
10:01:08,680 --> 10:01:12,120
with a lot of inputs, that for each of the pixels,
12055
10:01:12,120 --> 10:01:13,960
we might have one or three different inputs
12056
10:01:13,960 --> 10:01:16,840
in the case of a color image, a different input that
12057
10:01:16,840 --> 10:01:20,080
is just connected to a deep neural network, for example.
12058
10:01:20,080 --> 10:01:22,920
And this deep neural network might take all of the pixels
12059
10:01:22,920 --> 10:01:27,040
inside of the image of what digit a person drew.
12060
10:01:27,040 --> 10:01:29,160
And the output might be like 10 neurons that
12061
10:01:29,160 --> 10:01:32,360
classify it as a 0, or a 1, or a 2, or a 3,
12062
10:01:32,360 --> 10:01:36,760
or just tells us in some way what that digit happens to be.
12063
10:01:36,760 --> 10:01:39,080
Now, there are a couple of drawbacks to this approach.
12064
10:01:39,080 --> 10:01:42,680
The first drawback to the approach is just the size of this input array,
12065
10:01:42,680 --> 10:01:44,600
that we have a whole bunch of inputs.
12066
10:01:44,600 --> 10:01:47,160
If we have a big image that has a lot of different channels,
12067
10:01:47,160 --> 10:01:50,040
we're looking at a lot of inputs, and therefore a lot of weights
12068
10:01:50,040 --> 10:01:51,960
that we have to calculate.
12069
10:01:51,960 --> 10:01:55,680
And a second problem is the fact that by flattening everything
12070
10:01:55,680 --> 10:01:58,040
into just this structure of all the pixels,
12071
10:01:58,040 --> 10:02:00,760
we've lost access to a lot of the information
12072
10:02:00,760 --> 10:02:03,280
about the structure of the image that's relevant,
12073
10:02:03,280 --> 10:02:05,800
that really, when a person looks at an image,
12074
10:02:05,800 --> 10:02:08,040
they're looking at particular features of the image.
12075
10:02:08,040 --> 10:02:09,000
They're looking at curves.
12076
10:02:09,000 --> 10:02:09,880
They're looking at shapes.
12077
10:02:09,880 --> 10:02:11,720
They're looking at what things can you identify
12078
10:02:11,720 --> 10:02:14,640
in different regions of the image, and maybe put those things together
12079
10:02:14,640 --> 10:02:18,200
in order to get a better picture of what the overall image is about.
12080
10:02:18,200 --> 10:02:22,200
And by just turning it into pixel values for each of the pixels,
12081
10:02:22,200 --> 10:02:24,600
sure, you might be able to learn that structure,
12082
10:02:24,600 --> 10:02:26,520
but it might be challenging in order to do so.
12083
10:02:26,520 --> 10:02:28,880
It might be helpful to take advantage of the fact
12084
10:02:28,880 --> 10:02:31,400
that you can use properties of the image itself, the fact
12085
10:02:31,400 --> 10:02:33,660
that it's structured in a particular way, to be
12086
10:02:33,660 --> 10:02:37,400
able to improve the way that we learn based on that image too.
12087
10:02:37,400 --> 10:02:40,480
So in order to figure out how we can train our neural networks to better
12088
10:02:40,480 --> 10:02:43,760
be able to deal with images, we'll introduce a couple of ideas,
12089
10:02:43,760 --> 10:02:45,960
a couple of algorithms that we can apply that
12090
10:02:45,960 --> 10:02:50,080
allow us to take the image and extract some useful information out
12091
10:02:50,080 --> 10:02:50,880
of that image.
12092
10:02:50,880 --> 10:02:54,720
And the first idea we'll introduce is the notion of image convolution.
12093
10:02:54,720 --> 10:02:58,240
And what image convolution is all about is it's about filtering an image,
12094
10:02:58,240 --> 10:03:01,600
sort of extracting useful or relevant features out of the image.
12095
10:03:01,600 --> 10:03:05,680
And the way we do that is by applying a particular filter that
12096
10:03:05,680 --> 10:03:09,040
basically adds the value for every pixel with the values
12097
10:03:09,040 --> 10:03:11,480
for all of the neighboring pixels to it, according
12098
10:03:11,480 --> 10:03:14,080
to some sort of kernel matrix, which we'll see in a moment,
12099
10:03:14,080 --> 10:03:17,560
is going to allow us to weight these pixels in various different ways.
12100
10:03:17,560 --> 10:03:19,560
And the goal of image convolution, then, is
12101
10:03:19,560 --> 10:03:22,960
to extract some sort of interesting or useful features out of an image,
12102
10:03:22,960 --> 10:03:26,320
to be able to take a pixel and, based on its neighboring pixels,
12103
10:03:26,320 --> 10:03:29,200
maybe predict some sort of valuable information.
12104
10:03:29,200 --> 10:03:32,120
Something like taking a pixel and looking at its neighboring pixels,
12105
10:03:32,120 --> 10:03:33,560
you might be able to predict whether or not
12106
10:03:33,560 --> 10:03:35,360
there's some sort of curve inside the image,
12107
10:03:35,360 --> 10:03:38,440
or whether it's forming the outline of a particular line or a shape,
12108
10:03:38,440 --> 10:03:39,280
for example.
12109
10:03:39,280 --> 10:03:42,040
And that might be useful if you're trying to use
12110
10:03:42,040 --> 10:03:44,680
all of these various different features to combine them
12111
10:03:44,680 --> 10:03:48,120
to say something meaningful about an image as a whole.
12112
10:03:48,120 --> 10:03:50,080
So how, then, does image convolution work?
12113
10:03:50,080 --> 10:03:52,280
Well, we start with a kernel matrix.
12114
10:03:52,280 --> 10:03:54,440
And the kernel matrix looks something like this.
12115
10:03:54,440 --> 10:03:58,080
And the idea of this is that, given a pixel that will be the middle pixel,
12116
10:03:58,080 --> 10:04:00,960
we're going to multiply each of the neighboring pixels
12117
10:04:00,960 --> 10:04:04,440
by these values in order to get some sort of result
12118
10:04:04,440 --> 10:04:06,820
by summing up all the numbers together.
12119
10:04:06,820 --> 10:04:09,320
So if I take this kernel, which you can think of as a filter
12120
10:04:09,320 --> 10:04:13,440
that I'm going to apply to the image, and let's say that I take this image.
12121
10:04:13,440 --> 10:04:14,760
This is a 4 by 4 image.
12122
10:04:14,760 --> 10:04:16,680
We'll think of it as just a black and white image,
12123
10:04:16,680 --> 10:04:19,920
where each one is just a single pixel value.
12124
10:04:19,920 --> 10:04:22,840
So somewhere between 0 and 255, for example.
12125
10:04:22,840 --> 10:04:25,720
So we have a whole bunch of individual pixel values like this.
12126
10:04:25,720 --> 10:04:30,560
And what I'd like to do is apply this kernel, this filter, so to speak,
12127
10:04:30,560 --> 10:04:32,440
to this image.
12128
10:04:32,440 --> 10:04:35,040
And the way I'll do that is, all right, the kernel is 3 by 3.
12129
10:04:35,040 --> 10:04:38,200
You can imagine a 5 by 5 kernel or a larger kernel, too.
12130
10:04:38,200 --> 10:04:41,280
And I'll take it and just first apply it to the first 3
12131
10:04:41,280 --> 10:04:43,720
by 3 section of the image.
12132
10:04:43,720 --> 10:04:46,960
And what I'll do is I'll take each of these pixel values,
12133
10:04:46,960 --> 10:04:50,200
multiply it by its corresponding value in the filter matrix,
12134
10:04:50,200 --> 10:04:53,200
and add all of the results together.
12135
10:04:53,200 --> 10:04:59,320
So here, for example, I'll say 10 times 0, plus 20 times negative 1,
12136
10:04:59,320 --> 10:05:03,720
plus 30 times 0, so on and so forth, doing all of this calculation.
12137
10:05:03,720 --> 10:05:05,480
And at the end, if I take all these values,
12138
10:05:05,480 --> 10:05:08,240
multiply them by their corresponding value in the kernel,
12139
10:05:08,240 --> 10:05:11,680
add the results together, for this particular set of 9 pixels,
12140
10:05:11,680 --> 10:05:14,800
I get the value of 10, for example.
12141
10:05:14,800 --> 10:05:19,880
And then what I'll do is I'll slide this 3 by 3 grid, effectively, over.
12142
10:05:19,880 --> 10:05:24,520
I'll slide the kernel by 1 to look at the next 3 by 3 section.
12143
10:05:24,520 --> 10:05:26,440
Here, I'm just sliding it over by 1 pixel.
12144
10:05:26,440 --> 10:05:28,240
But you might imagine a different stride length,
12145
10:05:28,240 --> 10:05:31,040
or maybe I jump by multiple pixels at a time if you really wanted to.
12146
10:05:31,040 --> 10:05:32,400
You have different options here.
12147
10:05:32,400 --> 10:05:35,920
But here, I'm just sliding over, looking at the next 3 by 3 section.
12148
10:05:35,920 --> 10:05:40,240
And I'll do the same math, 20 times 0, plus 30 times negative 1,
12149
10:05:40,240 --> 10:05:45,240
plus 40 times 0, plus 20 times negative 1, so on and so forth, plus 30 times 5.
12150
10:05:45,240 --> 10:05:47,240
And what I end up getting is the number 20.
12151
10:05:47,240 --> 10:05:50,520
Then you can imagine shifting over to this one, doing the same thing,
12152
10:05:50,520 --> 10:05:54,320
calculating the number 40, for example, and then doing the same thing here,
12153
10:05:54,320 --> 10:05:56,920
and calculating a value there as well.
12154
10:05:56,920 --> 10:06:00,640
And so what we have now is what we'll call a feature map.
12155
10:06:00,640 --> 10:06:03,600
We have taken this kernel, applied it to each
12156
10:06:03,600 --> 10:06:06,320
of these various different regions, and what we get
12157
10:06:06,320 --> 10:06:11,240
is some representation of a filtered version of that image.
12158
10:06:11,240 --> 10:06:13,040
And so to give a more concrete example of why
12159
10:06:13,040 --> 10:06:14,920
it is that this kind of thing could be useful,
12160
10:06:14,920 --> 10:06:18,480
let's take this kernel matrix, for example, which is quite a famous one,
12161
10:06:18,480 --> 10:06:22,240
that has an 8 in the middle, and then all of the neighboring pixels
12162
10:06:22,240 --> 10:06:23,680
get a negative 1.
12163
10:06:23,680 --> 10:06:26,920
And let's imagine we wanted to apply that to a 3
12164
10:06:26,920 --> 10:06:31,320
by 3 part of an image that looks like this, where all the values are the same.
12165
10:06:31,320 --> 10:06:33,560
They're all 20, for instance.
12166
10:06:33,560 --> 10:06:38,160
Well, in this case, if you do 20 times 8, and then subtract 20, subtract 20,
12167
10:06:38,160 --> 10:06:40,920
subtract 20 for each of the eight neighbors, well, the result of that
12168
10:06:40,920 --> 10:06:44,680
is you just get that expression, which comes out to be 0.
12169
10:06:44,680 --> 10:06:47,200
You multiplied 20 by 8, but then you subtracted
12170
10:06:47,200 --> 10:06:50,200
20 eight times, according to that particular kernel.
12171
10:06:50,200 --> 10:06:52,400
The result of all that is just 0.
12172
10:06:52,400 --> 10:06:56,400
So the takeaway here is that when a lot of the pixels are the same value,
12173
10:06:56,400 --> 10:06:59,320
we end up getting a value close to 0.
12174
10:06:59,320 --> 10:07:02,720
If, though, we had something like this, 20 is along this first row,
12175
10:07:02,720 --> 10:07:05,720
then 50 is in the second row, and 50 is in the third row, well,
12176
10:07:05,720 --> 10:07:08,920
then when you do this, because it's the same kind of math, 20 times negative 1,
12177
10:07:08,920 --> 10:07:12,680
20 times negative 1, so on and so forth, then I get a higher value,
12178
10:07:12,680 --> 10:07:15,680
a value like 90 in this particular case.
12179
10:07:15,680 --> 10:07:21,040
And so the more general idea here is that by applying this kernel, negative 1s,
12180
10:07:21,040 --> 10:07:23,800
8 in the middle, and then negative 1s, what I get
12181
10:07:23,800 --> 10:07:29,240
is when this middle value is very different from the neighboring values,
12182
10:07:29,240 --> 10:07:31,880
like 50 is greater than these 20s, then you'll
12183
10:07:31,880 --> 10:07:34,640
end up with a value higher than 0.
12184
10:07:34,640 --> 10:07:36,760
If this number is higher than its neighbors,
12185
10:07:36,760 --> 10:07:38,280
you end up getting a bigger output.
12186
10:07:38,280 --> 10:07:41,360
But if this value is the same as all of its neighbors,
12187
10:07:41,360 --> 10:07:43,920
then you get a lower output, something like 0.
12188
10:07:43,920 --> 10:07:46,440
And it turns out that this sort of filter can therefore
12189
10:07:46,440 --> 10:07:49,720
be used in something like detecting edges in an image.
12190
10:07:49,720 --> 10:07:53,120
Or I want to detect the boundaries between various different objects
12191
10:07:53,120 --> 10:07:54,160
inside of an image.
12192
10:07:54,160 --> 10:07:57,200
I might use a filter like this, which is able to tell
12193
10:07:57,200 --> 10:08:00,000
whether the value of this pixel is different
12194
10:08:00,000 --> 10:08:02,080
from the values of the neighboring pixel,
12195
10:08:02,080 --> 10:08:06,800
if it's greater than the values of the pixels that happen to surround it.
12196
10:08:06,800 --> 10:08:09,480
And so we can use this in terms of image filtering.
12197
10:08:09,480 --> 10:08:11,520
And so I'll show you an example of that.
12198
10:08:11,520 --> 10:08:17,680
I have here in filter.py a file that uses Python's image library,
12199
10:08:17,680 --> 10:08:21,400
or PIL, to do some image filtering.
12200
10:08:21,400 --> 10:08:23,080
I go ahead and open an image.
12201
10:08:23,080 --> 10:08:26,840
And then all I'm going to do is apply a kernel to that image.
12202
10:08:26,840 --> 10:08:30,520
It's going to be a 3 by 3 kernel, same kind of kernel we saw before.
12203
10:08:30,520 --> 10:08:31,960
And here is the kernel.
12204
10:08:31,960 --> 10:08:34,880
This is just a list representation of the same matrix
12205
10:08:34,880 --> 10:08:36,160
that I showed you a moment ago.
12206
10:08:36,160 --> 10:08:38,160
It's negative 1, negative 1, negative 1.
12207
10:08:38,160 --> 10:08:40,920
The second row is negative 1, 8, negative 1.
12208
10:08:40,920 --> 10:08:43,200
And the third row is all negative 1s.
12209
10:08:43,200 --> 10:08:47,840
And then at the end, I'm going to go ahead and show the filtered image.
12210
10:08:47,840 --> 10:08:53,640
So if, for example, I go into convolution directory
12211
10:08:53,640 --> 10:08:56,600
and I open up an image, like bridge.png, this
12212
10:08:56,600 --> 10:09:02,560
is what an input image might look like, just an image of a bridge over a river.
12213
10:09:02,560 --> 10:09:07,640
Now I'm going to go ahead and run this filter program on the bridge.
12214
10:09:07,640 --> 10:09:10,080
And what I get is this image here.
12215
10:09:10,080 --> 10:09:13,280
Just by taking the original image and applying that filter
12216
10:09:13,280 --> 10:09:17,480
to each 3 by 3 grid, I've extracted all of the boundaries,
12217
10:09:17,480 --> 10:09:20,800
all of the edges inside the image that separate one part of the image
12218
10:09:20,800 --> 10:09:21,360
from another.
12219
10:09:21,360 --> 10:09:24,000
So here I've got a representation of boundaries
12220
10:09:24,000 --> 10:09:26,320
between particular parts of the image.
12221
10:09:26,320 --> 10:09:28,880
And you might imagine that if a machine learning algorithm is
12222
10:09:28,880 --> 10:09:33,120
trying to learn what an image is of, a filter like this could be pretty useful.
12223
10:09:33,120 --> 10:09:35,920
Maybe the machine learning algorithm doesn't
12224
10:09:35,920 --> 10:09:38,440
care about all of the details of the image.
12225
10:09:38,440 --> 10:09:40,480
It just cares about certain useful features.
12226
10:09:40,480 --> 10:09:42,640
It cares about particular shapes that are
12227
10:09:42,640 --> 10:09:45,160
able to help it determine that based on the image,
12228
10:09:45,160 --> 10:09:47,680
this is going to be a bridge, for example.
12229
10:09:47,680 --> 10:09:50,080
And so this type of idea of image convolution
12230
10:09:50,080 --> 10:09:55,480
can allow us to apply filters to images that allow us to extract useful results
12231
10:09:55,480 --> 10:09:59,680
out of those images, taking an image and extracting its edges, for example.
12232
10:09:59,680 --> 10:10:01,760
And you might imagine many other filters that
12233
10:10:01,760 --> 10:10:05,200
could be applied to an image that are able to extract particular values as
12234
10:10:05,200 --> 10:10:05,700
well.
12235
10:10:05,700 --> 10:10:08,880
And a filter might have separate kernels for the red values, the green values,
12236
10:10:08,880 --> 10:10:11,400
and the blue values that are all summed together at the end,
12237
10:10:11,400 --> 10:10:14,000
such that you could have particular filters looking for,
12238
10:10:14,000 --> 10:10:15,720
is there red in this part of the image?
12239
10:10:15,720 --> 10:10:17,560
Are there green in other parts of the image?
12240
10:10:17,560 --> 10:10:20,880
You can begin to assemble these relevant and useful filters
12241
10:10:20,880 --> 10:10:24,400
that are able to do these calculations as well.
12242
10:10:24,400 --> 10:10:26,840
So that then was the idea of image convolution,
12243
10:10:26,840 --> 10:10:29,400
applying some sort of filter to an image to be
12244
10:10:29,400 --> 10:10:32,760
able to extract some useful features out of that image.
12245
10:10:32,760 --> 10:10:35,840
But all the while, these images are still pretty big.
12246
10:10:35,840 --> 10:10:38,000
There's a lot of pixels involved in the image.
12247
10:10:38,000 --> 10:10:40,560
And realistically speaking, if you've got a really big image,
12248
10:10:40,560 --> 10:10:42,200
that poses a couple of problems.
12249
10:10:42,200 --> 10:10:45,080
One, it means a lot of input going into the neural network.
12250
10:10:45,080 --> 10:10:48,280
But two, it also means that we really have
12251
10:10:48,280 --> 10:10:50,600
to care about what's in each particular pixel.
12252
10:10:50,600 --> 10:10:54,200
Whereas realistically, we often, if you're looking at an image,
12253
10:10:54,200 --> 10:10:58,120
you don't care whether something is in one particular pixel versus the pixel
12254
10:10:58,120 --> 10:10:59,400
immediately to the right of it.
12255
10:10:59,400 --> 10:11:01,000
They're pretty close together.
12256
10:11:01,000 --> 10:11:03,920
You really just care about whether there's a particular feature
12257
10:11:03,920 --> 10:11:05,720
in some region of the image.
12258
10:11:05,720 --> 10:11:09,480
And maybe you don't care about exactly which pixel it happens to be in.
12259
10:11:09,480 --> 10:11:11,960
And so there's a technique we can use known as pooling.
12260
10:11:11,960 --> 10:11:15,920
And what pooling is, is it means reducing the size of an input
12261
10:11:15,920 --> 10:11:18,600
by sampling from regions inside of the input.
12262
10:11:18,600 --> 10:11:22,160
So we're going to take a big image and turn it into a smaller image
12263
10:11:22,160 --> 10:11:23,160
by using pooling.
12264
10:11:23,160 --> 10:11:25,800
And in particular, one of the most popular types of pooling
12265
10:11:25,800 --> 10:11:27,160
is called max pooling.
12266
10:11:27,160 --> 10:11:29,760
And what max pooling does is it pools just
12267
10:11:29,760 --> 10:11:33,640
by choosing the maximum value in a particular region.
12268
10:11:33,640 --> 10:11:36,800
So for example, let's imagine I had this 4 by 4 image.
12269
10:11:36,800 --> 10:11:38,640
But I wanted to reduce its dimensions.
12270
10:11:38,640 --> 10:11:42,640
I wanted to make it a smaller image so that I have fewer inputs to work with.
12271
10:11:42,640 --> 10:11:47,120
Well, what I could do is I could apply a 2 by 2 max pool,
12272
10:11:47,120 --> 10:11:50,880
where the idea would be that I'm going to first look at this 2 by 2 region
12273
10:11:50,880 --> 10:11:53,240
and say, what is the maximum value in that region?
12274
10:11:53,240 --> 10:11:54,600
Well, it's the number 50.
12275
10:11:54,600 --> 10:11:57,080
So we'll go ahead and just use the number 50.
12276
10:11:57,080 --> 10:11:58,600
And then we'll look at this 2 by 2 region.
12277
10:11:58,600 --> 10:12:00,160
What is the maximum value here?
12278
10:12:00,160 --> 10:12:02,360
It's 110, so that's going to be my value.
12279
10:12:02,360 --> 10:12:04,680
Likewise here, the maximum value looks like 20.
12280
10:12:04,680 --> 10:12:05,960
Go ahead and put that there.
12281
10:12:05,960 --> 10:12:09,200
Then for this last region, the maximum value was 40.
12282
10:12:09,200 --> 10:12:10,800
So we'll go ahead and use that.
12283
10:12:10,800 --> 10:12:14,520
And what I have now is a smaller representation
12284
10:12:14,520 --> 10:12:17,520
of this same original image that I obtained just
12285
10:12:17,520 --> 10:12:21,960
by picking the maximum value from each of these regions.
12286
10:12:21,960 --> 10:12:25,160
So again, the advantages here are now I only
12287
10:12:25,160 --> 10:12:27,960
have to deal with a 2 by 2 input instead of a 4 by 4.
12288
10:12:27,960 --> 10:12:31,160
And you can imagine shrinking the size of an image even more.
12289
10:12:31,160 --> 10:12:36,120
But in addition to that, I'm now able to make my analysis
12290
10:12:36,120 --> 10:12:40,280
independent of whether a particular value was in this pixel or this pixel.
12291
10:12:40,280 --> 10:12:42,720
I don't care if the 50 was here or here.
12292
10:12:42,720 --> 10:12:45,200
As long as it was generally in this region,
12293
10:12:45,200 --> 10:12:47,240
I'll still get access to that value.
12294
10:12:47,240 --> 10:12:51,480
So it makes our algorithms a little bit more robust as well.
12295
10:12:51,480 --> 10:12:54,520
So that then is pooling, taking the size of the image,
12296
10:12:54,520 --> 10:12:58,040
reducing it a little bit by just sampling from particular regions
12297
10:12:58,040 --> 10:12:59,520
inside of the image.
12298
10:12:59,520 --> 10:13:03,320
And now we can put all of these ideas together, pooling, image convolution,
12299
10:13:03,320 --> 10:13:06,960
and neural networks all together into another type of neural network
12300
10:13:06,960 --> 10:13:10,920
called a convolutional neural network, or a CNN, which
12301
10:13:10,920 --> 10:13:14,400
is a neural network that uses this convolution step usually
12302
10:13:14,400 --> 10:13:18,080
in the context of analyzing an image, for example.
12303
10:13:18,080 --> 10:13:20,600
And so the way that a convolutional neural network works
12304
10:13:20,600 --> 10:13:24,440
is that we start with some sort of input image, some grid of pixels.
12305
10:13:24,440 --> 10:13:27,840
But rather than immediately put that into the neural network layers
12306
10:13:27,840 --> 10:13:31,160
that we've seen before, we'll start by applying a convolution step,
12307
10:13:31,160 --> 10:13:33,440
where the convolution step involves applying
12308
10:13:33,440 --> 10:13:36,680
some number of different image filters to our original image
12309
10:13:36,680 --> 10:13:40,160
in order to get what we call a feature map, the result of applying
12310
10:13:40,160 --> 10:13:41,920
some filter to an image.
12311
10:13:41,920 --> 10:13:45,120
And we could do this once, but in general, we'll do this multiple times,
12312
10:13:45,120 --> 10:13:48,480
getting a whole bunch of different feature maps, each of which
12313
10:13:48,480 --> 10:13:51,600
might extract some different relevant feature out of the image,
12314
10:13:51,600 --> 10:13:53,920
some different important characteristic of the image
12315
10:13:53,920 --> 10:13:56,600
that we might care about using in order to calculate
12316
10:13:56,600 --> 10:13:58,160
what the result should be.
12317
10:13:58,160 --> 10:14:01,040
And in the same way that when we train neural networks,
12318
10:14:01,040 --> 10:14:04,520
we can train neural networks to learn the weights between particular units
12319
10:14:04,520 --> 10:14:07,240
inside of the neural networks, we can also train neural networks
12320
10:14:07,240 --> 10:14:09,560
to learn what those filters should be, what
12321
10:14:09,560 --> 10:14:11,840
the values of the filters should be in order
12322
10:14:11,840 --> 10:14:15,840
to get the most useful, most relevant information out of the original image
12323
10:14:15,840 --> 10:14:18,800
just by figuring out what setting of those filter values,
12324
10:14:18,800 --> 10:14:23,000
the values inside of that kernel, results in minimizing the loss function,
12325
10:14:23,000 --> 10:14:26,520
minimizing how poorly our hypothesis actually
12326
10:14:26,520 --> 10:14:30,920
performs in figuring out the classification of a particular image,
12327
10:14:30,920 --> 10:14:32,080
for example.
12328
10:14:32,080 --> 10:14:34,800
So we first apply this convolution step, get a whole bunch
12329
10:14:34,800 --> 10:14:36,760
of these various different feature maps.
12330
10:14:36,760 --> 10:14:38,720
But these feature maps are quite large.
12331
10:14:38,720 --> 10:14:41,480
There's a lot of pixel values that happen to be here.
12332
10:14:41,480 --> 10:14:44,720
And so a logical next step to take is a pooling step,
12333
10:14:44,720 --> 10:14:48,040
where we reduce the size of these images by using max pooling,
12334
10:14:48,040 --> 10:14:51,840
for example, extracting the maximum value from any particular region.
12335
10:14:51,840 --> 10:14:53,720
There are other pooling methods that exist as well,
12336
10:14:53,720 --> 10:14:54,880
depending on the situation.
12337
10:14:54,880 --> 10:14:57,040
You could use something like average pooling,
12338
10:14:57,040 --> 10:14:59,480
where instead of taking the maximum value from a region,
12339
10:14:59,480 --> 10:15:03,240
you take the average value from a region, which has its uses as well.
12340
10:15:03,240 --> 10:15:07,280
But in effect, what pooling will do is it will take these feature maps
12341
10:15:07,280 --> 10:15:09,460
and reduce their dimensions so that we end up
12342
10:15:09,460 --> 10:15:12,080
with smaller grids with fewer pixels.
12343
10:15:12,080 --> 10:15:14,320
And this then is going to be easier for us to deal with.
12344
10:15:14,320 --> 10:15:16,960
It's going to mean fewer inputs that we have to worry about.
12345
10:15:16,960 --> 10:15:19,480
And it's also going to mean we're more resilient,
12346
10:15:19,480 --> 10:15:22,560
more robust against potential movements of particular values,
12347
10:15:22,560 --> 10:15:24,680
just by one pixel, when ultimately we really
12348
10:15:24,680 --> 10:15:27,520
don't care about those one-pixel differences that
12349
10:15:27,520 --> 10:15:30,160
might arise in the original image.
12350
10:15:30,160 --> 10:15:32,120
And now, after we've done this pooling step,
12351
10:15:32,120 --> 10:15:36,500
now we have a whole bunch of values that we can then flatten out and just put
12352
10:15:36,500 --> 10:15:38,560
into a more traditional neural network.
12353
10:15:38,560 --> 10:15:40,320
So we go ahead and flatten it, and then we
12354
10:15:40,320 --> 10:15:42,240
end up with a traditional neural network that
12355
10:15:42,240 --> 10:15:46,480
has one input for each of these values in each of these resulting feature
12356
10:15:46,480 --> 10:15:51,400
maps after we do the convolution and after we do the pooling step.
12357
10:15:51,400 --> 10:15:54,720
And so this then is the general structure of a convolutional network.
12358
10:15:54,720 --> 10:15:58,200
We begin with the image, apply convolution, apply pooling,
12359
10:15:58,200 --> 10:16:01,200
flatten the results, and then put that into a more traditional neural
12360
10:16:01,200 --> 10:16:03,440
network that might itself have hidden layers.
12361
10:16:03,440 --> 10:16:05,540
You can have deep convolutional networks that
12362
10:16:05,540 --> 10:16:09,760
have hidden layers in between this flattened layer and the eventual output
12363
10:16:09,760 --> 10:16:13,360
to be able to calculate various different features of those values.
12364
10:16:13,360 --> 10:16:17,360
But this then can help us to be able to use convolution and pooling
12365
10:16:17,360 --> 10:16:19,760
to use our knowledge about the structure of an image
12366
10:16:19,760 --> 10:16:23,280
to be able to get better results, to be able to train our networks faster
12367
10:16:23,280 --> 10:16:27,320
in order to better capture particular parts of the image.
12368
10:16:27,320 --> 10:16:30,640
And there's no reason necessarily why you can only use these steps once.
12369
10:16:30,640 --> 10:16:33,520
In fact, in practice, you'll often use convolution and pooling
12370
10:16:33,520 --> 10:16:36,440
multiple times in multiple different steps.
12371
10:16:36,440 --> 10:16:39,560
See, what you might imagine doing is starting with an image,
12372
10:16:39,560 --> 10:16:42,240
first applying convolution to get a whole bunch of maps,
12373
10:16:42,240 --> 10:16:45,360
then applying pooling, then applying convolution again,
12374
10:16:45,360 --> 10:16:48,000
because these maps are still pretty big.
12375
10:16:48,000 --> 10:16:51,760
You can apply convolution to try and extract relevant features out
12376
10:16:51,760 --> 10:16:55,240
of this result. Then take those results, apply pooling
12377
10:16:55,240 --> 10:16:57,820
in order to reduce their dimensions, and then take that
12378
10:16:57,820 --> 10:17:01,280
and feed it into a neural network that maybe has fewer inputs.
12379
10:17:01,280 --> 10:17:04,040
So here I have two different convolution and pooling steps.
12380
10:17:04,040 --> 10:17:08,280
I do convolution and pooling once, and then I do convolution and pooling
12381
10:17:08,280 --> 10:17:11,400
a second time, each time extracting useful features
12382
10:17:11,400 --> 10:17:14,200
from the layer before it, each time using pooling
12383
10:17:14,200 --> 10:17:17,280
to reduce the dimensions of what you're ultimately looking at.
12384
10:17:17,280 --> 10:17:21,160
And the goal now of this sort of model is that in each of these steps,
12385
10:17:21,160 --> 10:17:25,400
you can begin to learn different types of features of the original image.
12386
10:17:25,400 --> 10:17:28,400
That maybe in the first step, you learn very low level features.
12387
10:17:28,400 --> 10:17:31,940
Just learn and look for features like edges and curves and shapes,
12388
10:17:31,940 --> 10:17:36,000
because based on pixels and their neighboring values, you can figure out,
12389
10:17:36,000 --> 10:17:37,320
all right, what are the edges?
12390
10:17:37,320 --> 10:17:38,040
What are the curves?
12391
10:17:38,040 --> 10:17:41,000
What are the various different shapes that might be present there?
12392
10:17:41,000 --> 10:17:43,760
But then once you have a mapping that just represents
12393
10:17:43,760 --> 10:17:46,520
where the edges and curves and shapes happen to be,
12394
10:17:46,520 --> 10:17:49,160
you can imagine applying the same sort of process again
12395
10:17:49,160 --> 10:17:51,760
to begin to look for higher level features, look for objects,
12396
10:17:51,760 --> 10:17:55,320
maybe look for people's eyes and facial recognition, for example.
12397
10:17:55,320 --> 10:17:59,200
Maybe look for more complex shapes like the curves on a particular number
12398
10:17:59,200 --> 10:18:02,440
if you're trying to recognize a digit in a handwriting recognition sort
12399
10:18:02,440 --> 10:18:03,620
of scenario.
12400
10:18:03,620 --> 10:18:06,680
And then after all of that, now that you have these results that
12401
10:18:06,680 --> 10:18:08,760
represent these higher level features, you
12402
10:18:08,760 --> 10:18:12,240
can pass them into a neural network, which is really just a deep neural
12403
10:18:12,240 --> 10:18:14,680
network that looks like this, where you might imagine
12404
10:18:14,680 --> 10:18:18,360
making a binary classification or classifying into multiple categories
12405
10:18:18,360 --> 10:18:23,400
or performing various different tasks on this sort of model.
12406
10:18:23,400 --> 10:18:26,600
So convolutional neural networks can be quite powerful and quite popular
12407
10:18:26,600 --> 10:18:28,800
when it comes towards trying to analyze images.
12408
10:18:28,800 --> 10:18:29,920
We don't strictly need them.
12409
10:18:29,920 --> 10:18:32,320
We could have just used a vanilla neural network
12410
10:18:32,320 --> 10:18:35,640
that just operates with layer after layer, as we've seen before.
12411
10:18:35,640 --> 10:18:38,240
But these convolutional neural networks can be quite helpful,
12412
10:18:38,240 --> 10:18:40,400
in particular, because of the way they model
12413
10:18:40,400 --> 10:18:43,280
the way a human might look at an image, that instead of a human looking
12414
10:18:43,280 --> 10:18:46,440
at every single pixel simultaneously and trying to convolve all of them
12415
10:18:46,440 --> 10:18:48,560
by multiplying them together, you might imagine
12416
10:18:48,560 --> 10:18:50,920
that what convolution is really doing is looking
12417
10:18:50,920 --> 10:18:53,120
at various different regions of the image
12418
10:18:53,120 --> 10:18:56,040
and extracting relevant information and features out
12419
10:18:56,040 --> 10:18:57,600
of those parts of the image, the same way
12420
10:18:57,600 --> 10:18:59,920
that a human might have visual receptors that
12421
10:18:59,920 --> 10:19:02,240
are looking at particular parts of what they see
12422
10:19:02,240 --> 10:19:04,720
and using those combining them to figure out
12423
10:19:04,720 --> 10:19:09,320
what meaning they can draw from all of those various different inputs.
12424
10:19:09,320 --> 10:19:11,840
And so you might imagine applying this to a situation
12425
10:19:11,840 --> 10:19:13,760
like handwriting recognition.
12426
10:19:13,760 --> 10:19:16,200
So we'll go ahead and see an example of that now,
12427
10:19:16,200 --> 10:19:19,160
where I'll go ahead and open up handwriting.py.
12428
10:19:19,160 --> 10:19:23,040
Again, what we do here is we first import TensorFlow.
12429
10:19:23,040 --> 10:19:26,680
And then TensorFlow, it turns out, has a few data sets
12430
10:19:26,680 --> 10:19:30,360
that are built into the library that you can just immediately access.
12431
10:19:30,360 --> 10:19:33,160
And one of the most famous data sets in machine learning
12432
10:19:33,160 --> 10:19:35,360
is the MNIST data set, which is just a data
12433
10:19:35,360 --> 10:19:38,560
set of a whole bunch of samples of people's handwritten digits.
12434
10:19:38,560 --> 10:19:41,200
I showed you a slide of that a little while ago.
12435
10:19:41,200 --> 10:19:43,720
And what we can do is just immediately access
12436
10:19:43,720 --> 10:19:45,880
that data set which is built into the library
12437
10:19:45,880 --> 10:19:47,760
so that if I want to do something like train
12438
10:19:47,760 --> 10:19:50,960
on a whole bunch of handwritten digits, I can just use the data set
12439
10:19:50,960 --> 10:19:52,040
that is provided to me.
12440
10:19:52,040 --> 10:19:55,400
Of course, if I had my own data set of handwritten images,
12441
10:19:55,400 --> 10:19:56,920
I can apply the same idea.
12442
10:19:56,920 --> 10:19:59,700
I'd first just need to take those images and turn them
12443
10:19:59,700 --> 10:20:02,640
into an array of pixels, because that's the way that these
12444
10:20:02,640 --> 10:20:03,380
are going to be formatted.
12445
10:20:03,380 --> 10:20:05,240
They're going to be formatted as, effectively,
12446
10:20:05,240 --> 10:20:08,320
an array of individual pixels.
12447
10:20:08,320 --> 10:20:10,560
Now there's a bit of reshaping I need to do,
12448
10:20:10,560 --> 10:20:12,520
just turning the data into a format that I
12449
10:20:12,520 --> 10:20:14,480
can put into my convolutional neural network.
12450
10:20:14,480 --> 10:20:17,400
So this is doing things like taking all the values
12451
10:20:17,400 --> 10:20:19,200
and dividing them by 255.
12452
10:20:19,200 --> 10:20:22,840
If you remember, these color values tend to range from 0 to 255.
12453
10:20:22,840 --> 10:20:25,200
So I can divide them by 255 just to put them
12454
10:20:25,200 --> 10:20:29,560
into 0 to 1 range, which might be a little bit easier to train on.
12455
10:20:29,560 --> 10:20:32,200
And then doing various other modifications to the data
12456
10:20:32,200 --> 10:20:34,560
just to get it into a nice usable format.
12457
10:20:34,560 --> 10:20:37,000
But here's the interesting and important part.
12458
10:20:37,000 --> 10:20:41,200
Here is where I create the convolutional neural network, the CNN,
12459
10:20:41,200 --> 10:20:44,160
where here I'm saying, go ahead and use a sequential model.
12460
10:20:44,160 --> 10:20:47,880
And before I could use model.add to say add a layer, add a layer, add a layer,
12461
10:20:47,880 --> 10:20:50,840
another way I could define it is just by passing as input
12462
10:20:50,840 --> 10:20:55,920
to this sequential neural network a list of all of the layers that I want.
12463
10:20:55,920 --> 10:21:00,120
And so here, the very first layer in my model is a convolution layer,
12464
10:21:00,120 --> 10:21:03,360
where I'm first going to apply convolution to my image.
12465
10:21:03,360 --> 10:21:05,640
I'm going to use 13 different filters.
12466
10:21:05,640 --> 10:21:09,680
So my model is going to learn 32, rather, 32 different filters
12467
10:21:09,680 --> 10:21:13,360
that I would like to learn on the input image, where each filter is going
12468
10:21:13,360 --> 10:21:15,120
to be a 3 by 3 kernel.
12469
10:21:15,120 --> 10:21:17,400
So we saw those 3 by 3 kernels before, where
12470
10:21:17,400 --> 10:21:20,560
we could multiply each value in a 3 by 3 grid by a value,
12471
10:21:20,560 --> 10:21:22,800
multiply it, and add all the results together.
12472
10:21:22,800 --> 10:21:27,600
So here, I'm going to learn 32 different of these 3 by 3 filters.
12473
10:21:27,600 --> 10:21:29,920
I can, again, specify my activation function.
12474
10:21:29,920 --> 10:21:32,560
And I specify what my input shape is.
12475
10:21:32,560 --> 10:21:34,880
My input shape in the banknotes case was just 4.
12476
10:21:34,880 --> 10:21:36,400
I had 4 inputs.
12477
10:21:36,400 --> 10:21:40,280
My input shape here is going to be 28, 28, 1,
12478
10:21:40,280 --> 10:21:42,920
because for each of these handwritten digits,
12479
10:21:42,920 --> 10:21:46,320
it turns out that the MNIST data set organizes their data.
12480
10:21:46,320 --> 10:21:49,080
Each image is a 28 by 28 pixel grid.
12481
10:21:49,080 --> 10:21:51,360
So we're going to have a 28 by 28 pixel grid.
12482
10:21:51,360 --> 10:21:54,640
And each one of those images only has one channel value.
12483
10:21:54,640 --> 10:21:56,720
These handwritten digits are just black and white.
12484
10:21:56,720 --> 10:21:59,220
So there's just a single color value representing
12485
10:21:59,220 --> 10:22:00,800
how much black or how much white.
12486
10:22:00,800 --> 10:22:02,800
You might imagine that in a color image, if you
12487
10:22:02,800 --> 10:22:05,560
were doing this sort of thing, you might have three different channels,
12488
10:22:05,560 --> 10:22:07,840
a red, a green, and a blue channel, for example.
12489
10:22:07,840 --> 10:22:09,960
But in the case of just handwriting recognition,
12490
10:22:09,960 --> 10:22:12,960
recognizing a digit, we're just going to use a single value for,
12491
10:22:12,960 --> 10:22:14,880
like, shaded in or not shaded in.
12492
10:22:14,880 --> 10:22:18,440
And it might range, but it's just a single color value.
12493
10:22:18,440 --> 10:22:22,040
And that, then, is the very first layer of our neural network,
12494
10:22:22,040 --> 10:22:24,560
a convolutional layer that will take the input
12495
10:22:24,560 --> 10:22:26,400
and learn a whole bunch of different filters
12496
10:22:26,400 --> 10:22:30,920
that we can apply to the input to extract meaningful features.
12497
10:22:30,920 --> 10:22:34,360
Next step is going to be a max pooling layer, also built right
12498
10:22:34,360 --> 10:22:37,640
into TensorFlow, where this is going to be a layer that
12499
10:22:37,640 --> 10:22:40,400
is going to use a pool size of 2 by 2, meaning
12500
10:22:40,400 --> 10:22:43,080
we're going to look at 2 by 2 regions inside of the image
12501
10:22:43,080 --> 10:22:45,160
and just extract the maximum value.
12502
10:22:45,160 --> 10:22:47,320
Again, we've seen why this can be helpful.
12503
10:22:47,320 --> 10:22:49,920
It'll help to reduce the size of our input.
12504
10:22:49,920 --> 10:22:53,120
And once we've done that, we'll go ahead and flatten all of the units
12505
10:22:53,120 --> 10:22:55,480
just into a single layer that we can then
12506
10:22:55,480 --> 10:22:57,560
pass into the rest of the neural network.
12507
10:22:57,560 --> 10:23:00,200
And now, here's the rest of the neural network.
12508
10:23:00,200 --> 10:23:02,880
Here, I'm saying, let's add a hidden layer to my neural network
12509
10:23:02,880 --> 10:23:06,160
with 128 units, so a whole bunch of hidden units
12510
10:23:06,160 --> 10:23:07,840
inside of the hidden layer.
12511
10:23:07,840 --> 10:23:11,400
And just to prevent overfitting, I can add a dropout to that.
12512
10:23:11,400 --> 10:23:14,200
Say, you know what, when you're training, randomly dropout half
12513
10:23:14,200 --> 10:23:16,520
of the nodes from this hidden layer just to make sure
12514
10:23:16,520 --> 10:23:19,440
we don't become over-reliant on any particular node,
12515
10:23:19,440 --> 10:23:22,820
we begin to really generalize and stop ourselves from overfitting.
12516
10:23:22,820 --> 10:23:25,640
So TensorFlow allows us, just by adding a single line,
12517
10:23:25,640 --> 10:23:28,920
to add dropout into our model as well, such that when it's training,
12518
10:23:28,920 --> 10:23:31,360
it will perform this dropout step in order
12519
10:23:31,360 --> 10:23:36,000
to help make sure that we don't overfit on this particular data.
12520
10:23:36,000 --> 10:23:38,760
And then finally, I add an output layer.
12521
10:23:38,760 --> 10:23:42,840
The output layer is going to have 10 units, one for each category
12522
10:23:42,840 --> 10:23:45,640
that I would like to classify digits into, so 0 through 9,
12523
10:23:45,640 --> 10:23:47,560
10 different categories.
12524
10:23:47,560 --> 10:23:49,960
And the activation function I'm going to use here
12525
10:23:49,960 --> 10:23:52,880
is called the softmax activation function.
12526
10:23:52,880 --> 10:23:55,760
And in short, what the softmax activation function is going to do
12527
10:23:55,760 --> 10:23:57,760
is it's going to take the output and turn it
12528
10:23:57,760 --> 10:23:59,600
into a probability distribution.
12529
10:23:59,600 --> 10:24:01,600
So ultimately, it's going to tell me, what
12530
10:24:01,600 --> 10:24:03,620
did we estimate the probability is that this
12531
10:24:03,620 --> 10:24:06,180
is a 2 versus a 3 versus a 4.
12532
10:24:06,180 --> 10:24:10,320
And so it will turn it into that probability distribution for me.
12533
10:24:10,320 --> 10:24:12,680
Next up, I'll go ahead and compile my model
12534
10:24:12,680 --> 10:24:15,680
and fit it on all of my training data.
12535
10:24:15,680 --> 10:24:19,760
And then I can evaluate how well the neural network performs.
12536
10:24:19,760 --> 10:24:21,800
And then I've added to my Python program,
12537
10:24:21,800 --> 10:24:24,560
if I've provided a command line argument like the name of a file,
12538
10:24:24,560 --> 10:24:27,440
I'm going to go ahead and save the model to a file.
12539
10:24:27,440 --> 10:24:29,040
And so this can be quite useful too.
12540
10:24:29,040 --> 10:24:31,960
Once you've done the training step, which could take some time in terms
12541
10:24:31,960 --> 10:24:34,400
of taking all the time, going through the data,
12542
10:24:34,400 --> 10:24:38,240
running back propagation with gradient descent to be able to say, all right,
12543
10:24:38,240 --> 10:24:40,720
how should we adjust the weight to this particular model?
12544
10:24:40,720 --> 10:24:42,840
You end up calculating values for these weights,
12545
10:24:42,840 --> 10:24:44,880
calculating values for these filters.
12546
10:24:44,880 --> 10:24:47,720
You'd like to remember that information so you can use it later.
12547
10:24:47,720 --> 10:24:51,480
And so TensorFlow allows us to just save a model to a file,
12548
10:24:51,480 --> 10:24:53,880
such that later, if we want to use the model we've learned,
12549
10:24:53,880 --> 10:24:57,280
use the weights that we've learned to make some sort of new prediction,
12550
10:24:57,280 --> 10:25:00,800
we can just use the model that already exists.
12551
10:25:00,800 --> 10:25:03,800
So what we're doing here is after we've done all the calculation,
12552
10:25:03,800 --> 10:25:07,320
we go ahead and save the model to a file, such
12553
10:25:07,320 --> 10:25:09,480
that we can use it a little bit later.
12554
10:25:09,480 --> 10:25:17,240
So for example, if I go into digits, I'm going to run handwriting.py.
12555
10:25:17,240 --> 10:25:18,200
I won't save it this time.
12556
10:25:18,200 --> 10:25:20,440
We'll just run it and go ahead and see what happens.
12557
10:25:20,440 --> 10:25:22,880
What will happen is we need to go through the model in order
12558
10:25:22,880 --> 10:25:26,120
to train on all of these samples of handwritten digits.
12559
10:25:26,120 --> 10:25:28,760
The MNIST data set gives us thousands and thousands
12560
10:25:28,760 --> 10:25:31,320
of sample handwritten digits in the same format
12561
10:25:31,320 --> 10:25:33,080
that we can use in order to train.
12562
10:25:33,080 --> 10:25:35,640
And so now what you're seeing is this training process.
12563
10:25:35,640 --> 10:25:39,320
And unlike the banknotes case, where there was much fewer data points,
12564
10:25:39,320 --> 10:25:42,280
the data was very, very simple, here this data is more complex
12565
10:25:42,280 --> 10:25:44,280
and this training process takes time.
12566
10:25:44,280 --> 10:25:48,920
And so this is another one of those cases where when training neural networks,
12567
10:25:48,920 --> 10:25:52,280
this is why computational power is so important that oftentimes you
12568
10:25:52,280 --> 10:25:55,440
see people wanting to use sophisticated GPUs in order
12569
10:25:55,440 --> 10:25:59,320
to more efficiently be able to do this sort of neural network training.
12570
10:25:59,320 --> 10:26:02,120
It also speaks to the reason why more data can be helpful.
12571
10:26:02,120 --> 10:26:04,560
The more sample data points you have, the better
12572
10:26:04,560 --> 10:26:06,280
you can begin to do this training.
12573
10:26:06,280 --> 10:26:10,680
So here we're going through 60,000 different samples of handwritten digits.
12574
10:26:10,680 --> 10:26:13,120
And I said we're going to go through them 10 times.
12575
10:26:13,120 --> 10:26:16,040
We're going to go through the data set 10 times, training each time,
12576
10:26:16,040 --> 10:26:18,640
hopefully improving upon our weights with every time
12577
10:26:18,640 --> 10:26:20,080
we run through this data set.
12578
10:26:20,080 --> 10:26:23,480
And we can see over here on the right what the accuracy is each time
12579
10:26:23,480 --> 10:26:26,200
we go ahead and run this model, that the first time it
12580
10:26:26,200 --> 10:26:29,600
looks like we got an accuracy of about 92% of the digits
12581
10:26:29,600 --> 10:26:31,600
correct based on this training set.
12582
10:26:31,600 --> 10:26:34,840
We increased that to 96% or 97%.
12583
10:26:34,840 --> 10:26:38,400
And every time we run this, we're going to see hopefully the accuracy
12584
10:26:38,400 --> 10:26:41,520
improve as we continue to try and use that gradient descent,
12585
10:26:41,520 --> 10:26:43,720
that process of trying to run the algorithm,
12586
10:26:43,720 --> 10:26:46,960
to minimize the loss that we get in order to more accurately
12587
10:26:46,960 --> 10:26:49,120
predict what the output should be.
12588
10:26:49,120 --> 10:26:52,360
And what this process is doing is it's learning not only the weights,
12589
10:26:52,360 --> 10:26:55,320
but it's learning the features to use, the kernel matrix
12590
10:26:55,320 --> 10:26:57,560
to use when performing that convolution step.
12591
10:26:57,560 --> 10:26:59,800
Because this is a convolutional neural network,
12592
10:26:59,800 --> 10:27:02,080
where I'm first performing those convolutions
12593
10:27:02,080 --> 10:27:05,400
and then doing the more traditional neural network structure,
12594
10:27:05,400 --> 10:27:09,280
this is going to learn all of those individual steps as well.
12595
10:27:09,280 --> 10:27:12,720
And so here we see the TensorFlow provides me with some very nice output,
12596
10:27:12,720 --> 10:27:15,560
telling me about how many seconds are left with each of these training
12597
10:27:15,560 --> 10:27:18,880
runs that allows me to see just how well we're doing.
12598
10:27:18,880 --> 10:27:21,240
So we'll go ahead and see how this network performs.
12599
10:27:21,240 --> 10:27:23,760
It looks like we've gone through the data set seven times.
12600
10:27:23,760 --> 10:27:26,560
We're going through it an eighth time now.
12601
10:27:26,560 --> 10:27:28,560
And at this point, the accuracy is pretty high.
12602
10:27:28,560 --> 10:27:32,200
We saw we went from 92% up to 97%.
12603
10:27:32,200 --> 10:27:33,760
Now it looks like 98%.
12604
10:27:33,760 --> 10:27:36,440
And at this point, it seems like things are starting to level out.
12605
10:27:36,440 --> 10:27:39,360
It's probably a limit to how accurate we can ultimately be
12606
10:27:39,360 --> 10:27:41,280
without running the risk of overfitting.
12607
10:27:41,280 --> 10:27:42,600
Of course, with enough nodes, you would just
12608
10:27:42,600 --> 10:27:44,880
memorize the input and overfit upon them.
12609
10:27:44,880 --> 10:27:46,160
But we'd like to avoid doing that.
12610
10:27:46,160 --> 10:27:48,560
And Dropout will help us with this.
12611
10:27:48,560 --> 10:27:53,920
But now we see we're almost done finishing our training step.
12612
10:27:53,920 --> 10:27:55,520
We're at 55,000.
12613
10:27:55,520 --> 10:27:56,920
All right, we finished training.
12614
10:27:56,920 --> 10:28:00,200
And now it's going to go ahead and test for us on 10,000 samples.
12615
10:28:00,200 --> 10:28:04,880
And it looks like on the testing set, we were at 98.8% accurate.
12616
10:28:04,880 --> 10:28:06,880
So we ended up doing pretty well, it seems,
12617
10:28:06,880 --> 10:28:10,280
on this testing set to see how accurately can we
12618
10:28:10,280 --> 10:28:13,320
predict these handwritten digits.
12619
10:28:13,320 --> 10:28:15,840
And so what we could do then is actually test it out.
12620
10:28:15,840 --> 10:28:19,720
I've written a program called Recognition.py using PyGame.
12621
10:28:19,720 --> 10:28:21,560
If you pass it a model that's been trained,
12622
10:28:21,560 --> 10:28:26,120
and I pre-trained an example model using this input data, what we can do
12623
10:28:26,120 --> 10:28:27,960
is see whether or not we've been able to train
12624
10:28:27,960 --> 10:28:31,720
this convolutional neural network to be able to predict handwriting,
12625
10:28:31,720 --> 10:28:32,360
for example.
12626
10:28:32,360 --> 10:28:35,320
So I can try, just like drawing a handwritten digit.
12627
10:28:35,320 --> 10:28:39,400
I'll go ahead and draw the number 2, for example.
12628
10:28:39,400 --> 10:28:40,560
So there's my number 2.
12629
10:28:40,560 --> 10:28:41,440
Again, this is messy.
12630
10:28:41,440 --> 10:28:44,320
If you tried to imagine, how would you write a program with just ifs
12631
10:28:44,320 --> 10:28:46,640
and thens to be able to do this sort of calculation,
12632
10:28:46,640 --> 10:28:48,120
it would be tricky to do so.
12633
10:28:48,120 --> 10:28:50,080
But here I'll press Classify, and all right,
12634
10:28:50,080 --> 10:28:53,600
it seems I was able to correctly classify that what I drew was the number 2.
12635
10:28:53,600 --> 10:28:55,320
I'll go ahead and reset it, try it again.
12636
10:28:55,320 --> 10:28:57,880
We'll draw an 8, for example.
12637
10:28:57,880 --> 10:29:00,480
So here is an 8.
12638
10:29:00,480 --> 10:29:01,640
Press Classify.
12639
10:29:01,640 --> 10:29:05,080
And all right, it predicts that the digit that I drew was an 8.
12640
10:29:05,080 --> 10:29:08,080
And the key here is this really begins to show the power of what
12641
10:29:08,080 --> 10:29:09,920
the neural network is doing, somehow looking
12642
10:29:09,920 --> 10:29:12,440
at various different features of these different pixels,
12643
10:29:12,440 --> 10:29:14,840
figuring out what the relevant features are,
12644
10:29:14,840 --> 10:29:17,600
and figuring out how to combine them to get a classification.
12645
10:29:17,600 --> 10:29:21,600
And this would be a difficult task to provide explicit instructions
12646
10:29:21,600 --> 10:29:24,840
to the computer on how to do, to use a whole bunch of ifs ands
12647
10:29:24,840 --> 10:29:27,480
to process all these pixel values to figure out
12648
10:29:27,480 --> 10:29:28,920
what the handwritten digit is.
12649
10:29:28,920 --> 10:29:31,360
Everyone's going to draw their 8s a little bit differently.
12650
10:29:31,360 --> 10:29:33,920
If I drew the 8 again, it would look a little bit different.
12651
10:29:33,920 --> 10:29:37,800
And yet, ideally, we want to train a network to be robust enough
12652
10:29:37,800 --> 10:29:40,600
so that it begins to learn these patterns on its own.
12653
10:29:40,600 --> 10:29:43,200
All I said was, here is the structure of the network,
12654
10:29:43,200 --> 10:29:45,880
and here is the data on which to train the network.
12655
10:29:45,880 --> 10:29:47,880
And the network learning algorithm just tries
12656
10:29:47,880 --> 10:29:50,320
to figure out what is the optimal set of weights, what
12657
10:29:50,320 --> 10:29:52,800
is the optimal set of filters to use them in order
12658
10:29:52,800 --> 10:29:57,280
to be able to accurately classify a digit into one category or another.
12659
10:29:57,280 --> 10:30:00,680
Just going to show the power of these sorts of convolutional neural
12660
10:30:00,680 --> 10:30:02,280
networks.
12661
10:30:02,280 --> 10:30:06,560
And so that then was a look at how we can use convolutional neural networks
12662
10:30:06,560 --> 10:30:10,640
to begin to solve problems with regards to computer vision,
12663
10:30:10,640 --> 10:30:13,600
the ability to take an image and begin to analyze it.
12664
10:30:13,600 --> 10:30:15,920
So this is the type of analysis you might imagine
12665
10:30:15,920 --> 10:30:18,000
that's happening in self-driving cars that
12666
10:30:18,000 --> 10:30:21,000
are able to figure out what filters to apply to an image
12667
10:30:21,000 --> 10:30:24,040
to understand what it is that the computer is looking at,
12668
10:30:24,040 --> 10:30:26,160
or the same type of idea that might be applied
12669
10:30:26,160 --> 10:30:28,240
to facial recognition and social media to be
12670
10:30:28,240 --> 10:30:31,840
able to determine how to recognize faces in an image as well.
12671
10:30:31,840 --> 10:30:34,440
You can imagine a neural network that instead of classifying
12672
10:30:34,440 --> 10:30:38,280
into one of 10 different digits could instead classify like,
12673
10:30:38,280 --> 10:30:40,880
is this person A or is this person B, trying
12674
10:30:40,880 --> 10:30:45,000
to tell those people apart just based on convolution.
12675
10:30:45,000 --> 10:30:48,160
And so now what we'll take a look at is yet another type of neural network
12676
10:30:48,160 --> 10:30:50,520
that can be quite popular for certain types of tasks.
12677
10:30:50,520 --> 10:30:54,400
But to do so, we'll try to generalize and think about our neural network
12678
10:30:54,400 --> 10:30:55,760
a little bit more abstractly.
12679
10:30:55,760 --> 10:30:58,200
That here we have a sample deep neural network
12680
10:30:58,200 --> 10:31:01,400
where we have this input layer, a whole bunch of different hidden layers
12681
10:31:01,400 --> 10:31:04,080
that are performing certain types of calculations,
12682
10:31:04,080 --> 10:31:07,360
and then an output layer here that just generates some sort of output
12683
10:31:07,360 --> 10:31:09,600
that we care about calculating.
12684
10:31:09,600 --> 10:31:14,000
But we could imagine representing this a little more simply like this.
12685
10:31:14,000 --> 10:31:17,360
Here is just a more abstract representation of our neural network.
12686
10:31:17,360 --> 10:31:20,040
We have some input that might be like a vector
12687
10:31:20,040 --> 10:31:22,360
of a whole bunch of different values as our input.
12688
10:31:22,360 --> 10:31:25,640
That gets passed into a network that performs some sort of calculation
12689
10:31:25,640 --> 10:31:29,600
or computation, and that network produces some sort of output.
12690
10:31:29,600 --> 10:31:31,280
That output might be a single value.
12691
10:31:31,280 --> 10:31:33,120
It might be a whole bunch of different values.
12692
10:31:33,120 --> 10:31:36,040
But this is the general structure of the neural network that we've seen.
12693
10:31:36,040 --> 10:31:39,520
There is some sort of input that gets fed into the network.
12694
10:31:39,520 --> 10:31:43,440
And using that input, the network calculates what the output should be.
12695
10:31:43,440 --> 10:31:46,000
And this sort of model for a neural network
12696
10:31:46,000 --> 10:31:49,040
is what we might call a feed-forward neural network.
12697
10:31:49,040 --> 10:31:52,920
Feed-forward neural networks have connections only in one direction.
12698
10:31:52,920 --> 10:31:56,480
They move from one layer to the next layer to the layer after that,
12699
10:31:56,480 --> 10:31:59,800
such that the inputs pass through various different hidden layers
12700
10:31:59,800 --> 10:32:02,840
and then ultimately produce some sort of output.
12701
10:32:02,840 --> 10:32:05,760
So feed-forward neural networks were very helpful
12702
10:32:05,760 --> 10:32:08,640
for solving these types of classification problems that we saw before.
12703
10:32:08,640 --> 10:32:10,040
We have a whole bunch of input.
12704
10:32:10,040 --> 10:32:12,120
We want to learn what setting of weights will allow us
12705
10:32:12,120 --> 10:32:14,040
to calculate the output effectively.
12706
10:32:14,040 --> 10:32:16,560
But there are some limitations on feed-forward neural networks
12707
10:32:16,560 --> 10:32:17,680
that we'll see in a moment.
12708
10:32:17,680 --> 10:32:20,640
In particular, the input needs to be of a fixed shape,
12709
10:32:20,640 --> 10:32:23,200
like a fixed number of neurons are in the input layer.
12710
10:32:23,200 --> 10:32:24,920
And there's a fixed shape for the output,
12711
10:32:24,920 --> 10:32:28,040
like a fixed number of neurons in the output layer.
12712
10:32:28,040 --> 10:32:30,640
And that has some limitations of its own.
12713
10:32:30,640 --> 10:32:33,360
And a possible solution to this, and we'll
12714
10:32:33,360 --> 10:32:36,440
see examples of the types of problems we can solve for this in just a second,
12715
10:32:36,440 --> 10:32:38,480
is instead of just a feed-forward neural network,
12716
10:32:38,480 --> 10:32:41,840
where there are only connections in one direction from left to right
12717
10:32:41,840 --> 10:32:46,000
effectively across the network, we could also imagine a recurrent neural
12718
10:32:46,000 --> 10:32:48,720
network, where a recurrent neural network generates
12719
10:32:48,720 --> 10:32:54,840
output that gets fed back into itself as input for future runs of that network.
12720
10:32:54,840 --> 10:32:57,080
So whereas in a traditional neural network,
12721
10:32:57,080 --> 10:33:00,920
we have inputs that get fed into the network, that get fed into the output.
12722
10:33:00,920 --> 10:33:02,840
And the only thing that determines the output
12723
10:33:02,840 --> 10:33:05,560
is based on the original input and based on the calculation
12724
10:33:05,560 --> 10:33:08,040
we do inside of the network itself.
12725
10:33:08,040 --> 10:33:11,040
This goes in contrast with a recurrent neural network,
12726
10:33:11,040 --> 10:33:14,680
where in a recurrent neural network, you can imagine output from the network
12727
10:33:14,680 --> 10:33:18,320
feeding back to itself into the network again as input
12728
10:33:18,320 --> 10:33:22,360
for the next time you do the calculations inside of the network.
12729
10:33:22,360 --> 10:33:27,160
What this allows is it allows the network to maintain some sort of state,
12730
10:33:27,160 --> 10:33:33,080
to store some sort of information that can be used on future runs of the network.
12731
10:33:33,080 --> 10:33:35,440
Previously, the network just defined some weights,
12732
10:33:35,440 --> 10:33:38,280
and we passed inputs through the network, and it generated outputs.
12733
10:33:38,280 --> 10:33:42,000
But the network wasn't saving any information based on those inputs
12734
10:33:42,000 --> 10:33:45,400
to be able to remember for future iterations or for future runs.
12735
10:33:45,400 --> 10:33:47,320
What a recurrent neural network will let us do
12736
10:33:47,320 --> 10:33:51,040
is let the network store information that gets passed back in as input
12737
10:33:51,040 --> 10:33:55,560
to the network again the next time we try and perform some sort of action.
12738
10:33:55,560 --> 10:34:00,160
And this is particularly helpful when dealing with sequences of data.
12739
10:34:00,160 --> 10:34:02,880
So we'll see a real world example of this right now, actually.
12740
10:34:02,880 --> 10:34:07,160
Microsoft has developed an AI known as the caption bot.
12741
10:34:07,160 --> 10:34:09,440
And what the caption bot does is it says,
12742
10:34:09,440 --> 10:34:11,760
I can understand the content of any photograph,
12743
10:34:11,760 --> 10:34:13,960
and I'll try to describe it as well as any human.
12744
10:34:13,960 --> 10:34:16,280
I'll analyze your photo, but I won't store it or share it.
12745
10:34:16,280 --> 10:34:19,360
And so what Microsoft's caption bot seems to be claiming to do
12746
10:34:19,360 --> 10:34:22,880
is it can take an image and figure out what's in the image
12747
10:34:22,880 --> 10:34:25,600
and just give us a caption to describe it.
12748
10:34:25,600 --> 10:34:26,760
So let's try it out.
12749
10:34:26,760 --> 10:34:29,640
Here, for example, is an image of Harvard Square.
12750
10:34:29,640 --> 10:34:32,640
It's some people walking in front of one of the buildings at Harvard Square.
12751
10:34:32,640 --> 10:34:34,960
I'll go ahead and take the URL for that image,
12752
10:34:34,960 --> 10:34:39,000
and I'll paste it into caption bot and just press Go.
12753
10:34:39,000 --> 10:34:41,560
So caption bot is analyzing the image, and then it
12754
10:34:41,560 --> 10:34:44,760
says, I think it's a group of people walking
12755
10:34:44,760 --> 10:34:46,800
in front of a building, which seems amazing.
12756
10:34:46,800 --> 10:34:50,720
The AI is able to look at this image and figure out what's in the image.
12757
10:34:50,720 --> 10:34:52,680
And the important thing to recognize here
12758
10:34:52,680 --> 10:34:55,160
is that this is no longer just a classification task.
12759
10:34:55,160 --> 10:34:58,600
We saw being able to classify images with a convolutional neural network
12760
10:34:58,600 --> 10:35:01,800
where the job was take the image and then figure out,
12761
10:35:01,800 --> 10:35:05,920
is it a 0 or a 1 or a 2, or is it this person's face or that person's face?
12762
10:35:05,920 --> 10:35:09,320
What seems to be happening here is the input is an image,
12763
10:35:09,320 --> 10:35:12,440
and we know how to get networks to take input of images,
12764
10:35:12,440 --> 10:35:14,520
but the output is text.
12765
10:35:14,520 --> 10:35:15,240
It's a sentence.
12766
10:35:15,240 --> 10:35:19,640
It's a phrase, like a group of people walking in front of a building.
12767
10:35:19,640 --> 10:35:23,320
And this would seem to pose a challenge for our more traditional feed-forward
12768
10:35:23,320 --> 10:35:28,360
neural networks, for the reason being that in traditional neural networks,
12769
10:35:28,360 --> 10:35:31,840
we just have a fixed-size input and a fixed-size output.
12770
10:35:31,840 --> 10:35:35,160
There are a certain number of neurons in the input to our neural network
12771
10:35:35,160 --> 10:35:37,720
and a certain number of outputs for our neural network,
12772
10:35:37,720 --> 10:35:39,920
and then some calculation that goes on in between.
12773
10:35:39,920 --> 10:35:42,560
But the size of the inputs and the number of values in the input
12774
10:35:42,560 --> 10:35:44,440
and the number of values in the output, those
12775
10:35:44,440 --> 10:35:49,120
are always going to be fixed based on the structure of the neural network.
12776
10:35:49,120 --> 10:35:52,200
And that makes it difficult to imagine how a neural network could take an image
12777
10:35:52,200 --> 10:35:56,080
like this and say it's a group of people walking in front of the building
12778
10:35:56,080 --> 10:36:00,760
because the output is text, like it's a sequence of words.
12779
10:36:00,760 --> 10:36:02,800
Now, it might be possible for a neural network
12780
10:36:02,800 --> 10:36:06,880
to output one word, one word you could represent as a vector of values,
12781
10:36:06,880 --> 10:36:08,600
and you can imagine ways of doing that.
12782
10:36:08,600 --> 10:36:10,520
Next time, we'll talk a little bit more about AI
12783
10:36:10,520 --> 10:36:13,160
as it relates to language and language processing.
12784
10:36:13,160 --> 10:36:15,440
But a sequence of words is much more challenging
12785
10:36:15,440 --> 10:36:18,320
because depending on the image, you might imagine the output
12786
10:36:18,320 --> 10:36:19,800
is a different number of words.
12787
10:36:19,800 --> 10:36:22,400
We could have sequences of different lengths,
12788
10:36:22,400 --> 10:36:26,640
and somehow we still want to be able to generate the appropriate output.
12789
10:36:26,640 --> 10:36:30,560
And so the strategy here is to use a recurrent neural network,
12790
10:36:30,560 --> 10:36:34,080
a neural network that can feed its own output back into itself
12791
10:36:34,080 --> 10:36:36,200
as input for the next time.
12792
10:36:36,200 --> 10:36:40,400
And this allows us to do what we call a one-to-many relationship
12793
10:36:40,400 --> 10:36:43,960
for inputs to outputs, that in vanilla, more traditional neural networks,
12794
10:36:43,960 --> 10:36:47,080
these are what we might consider to be one-to-one neural networks.
12795
10:36:47,080 --> 10:36:49,800
You pass in one set of values as input.
12796
10:36:49,800 --> 10:36:53,240
You get one vector of values as the output.
12797
10:36:53,240 --> 10:36:56,960
But in this case, we want to pass in one value as input, the image,
12798
10:36:56,960 --> 10:36:59,560
and we want to get a sequence, many values as output,
12799
10:36:59,560 --> 10:37:02,400
where each value is like one of these words that
12800
10:37:02,400 --> 10:37:05,640
gets produced by this particular algorithm.
12801
10:37:05,640 --> 10:37:08,200
And so the way we might do this is we might imagine starting
12802
10:37:08,200 --> 10:37:11,400
by providing input, the image, into our neural network.
12803
10:37:11,400 --> 10:37:13,560
And the neural network is going to generate output,
12804
10:37:13,560 --> 10:37:16,000
but the output is not going to be the whole sequence of words,
12805
10:37:16,000 --> 10:37:18,320
because we can't represent the whole sequence of words
12806
10:37:18,320 --> 10:37:20,920
using just a fixed set of neurons.
12807
10:37:20,920 --> 10:37:24,360
Instead, the output is just going to be the first word.
12808
10:37:24,360 --> 10:37:27,800
We're going to train the network to output what the first word of the caption
12809
10:37:27,800 --> 10:37:28,080
should be.
12810
10:37:28,080 --> 10:37:30,320
And you could imagine that Microsoft has trained this
12811
10:37:30,320 --> 10:37:33,320
by running a whole bunch of training samples through the AI,
12812
10:37:33,320 --> 10:37:36,680
giving it a whole bunch of pictures and what the appropriate caption was,
12813
10:37:36,680 --> 10:37:39,800
and having the AI begin to learn from that.
12814
10:37:39,800 --> 10:37:42,080
But now, because the network generates output
12815
10:37:42,080 --> 10:37:44,280
that can be fed back into itself, you could
12816
10:37:44,280 --> 10:37:47,800
imagine the output of the network being fed back into the same network.
12817
10:37:47,800 --> 10:37:50,160
This here looks like a separate network, but it's really
12818
10:37:50,160 --> 10:37:53,440
the same network that's just getting different input,
12819
10:37:53,440 --> 10:37:57,640
that this network's output gets fed back into itself,
12820
10:37:57,640 --> 10:37:59,680
but it's going to generate another output.
12821
10:37:59,680 --> 10:38:04,160
And that other output is going to be the second word in the caption.
12822
10:38:04,160 --> 10:38:06,520
And this recurrent neural network then, this network
12823
10:38:06,520 --> 10:38:09,720
is going to generate other output that can be fed back into itself
12824
10:38:09,720 --> 10:38:12,200
to generate yet another word, fed back into itself
12825
10:38:12,200 --> 10:38:13,680
to generate another word.
12826
10:38:13,680 --> 10:38:18,200
And so recurrent neural networks allow us to represent this one-to-many
12827
10:38:18,200 --> 10:38:18,880
structure.
12828
10:38:18,880 --> 10:38:21,680
You provide one image as input, and the neural network
12829
10:38:21,680 --> 10:38:25,800
can pass data into the next run of the network, and then again and again,
12830
10:38:25,800 --> 10:38:28,240
such that you could run the network multiple times,
12831
10:38:28,240 --> 10:38:33,960
each time generating a different output still based on that original input.
12832
10:38:33,960 --> 10:38:37,320
And this is where recurrent neural networks become particularly useful
12833
10:38:37,320 --> 10:38:40,040
when dealing with sequences of inputs or outputs.
12834
10:38:40,040 --> 10:38:43,360
And my output is a sequence of words, and since I can't very easily
12835
10:38:43,360 --> 10:38:45,960
represent outputting an entire sequence of words,
12836
10:38:45,960 --> 10:38:49,160
I'll instead output that sequence one word at a time
12837
10:38:49,160 --> 10:38:52,680
by allowing my network to pass information about what still
12838
10:38:52,680 --> 10:38:56,840
needs to be said about the photo into the next stage of running the network.
12839
10:38:56,840 --> 10:38:59,480
So you could run the network multiple times, the same network
12840
10:38:59,480 --> 10:39:02,960
with the same weights, just getting different input each time.
12841
10:39:02,960 --> 10:39:06,440
First, getting input from the image, and then getting input from the network
12842
10:39:06,440 --> 10:39:09,880
itself as additional information about what additionally
12843
10:39:09,880 --> 10:39:13,920
needs to be given in a particular caption, for example.
12844
10:39:13,920 --> 10:39:17,400
So this then is a one-to-many relationship inside of a recurrent neural
12845
10:39:17,400 --> 10:39:20,440
network, but it turns out there are other models that we can use,
12846
10:39:20,440 --> 10:39:23,320
other ways we can try and use recurrent neural networks
12847
10:39:23,320 --> 10:39:26,760
to be able to represent data that might be stored in other forms as well.
12848
10:39:26,760 --> 10:39:29,880
We saw how we could use neural networks in order to analyze images
12849
10:39:29,880 --> 10:39:33,200
in the context of convolutional neural networks that take an image,
12850
10:39:33,200 --> 10:39:35,280
figure out various different properties of the image,
12851
10:39:35,280 --> 10:39:38,760
and are able to draw some sort of conclusion based on that.
12852
10:39:38,760 --> 10:39:40,960
But you might imagine that something like YouTube,
12853
10:39:40,960 --> 10:39:44,080
they need to be able to do a lot of learning based on video.
12854
10:39:44,080 --> 10:39:46,920
They need to look through videos to detect if they're like copyright
12855
10:39:46,920 --> 10:39:50,160
violations, or they need to be able to look through videos to maybe identify
12856
10:39:50,160 --> 10:39:53,680
what particular items are inside of the video, for example.
12857
10:39:53,680 --> 10:39:56,680
And video, you might imagine, is much more difficult to put in
12858
10:39:56,680 --> 10:40:00,200
as input to a neural network, because whereas an image, you could just
12859
10:40:00,200 --> 10:40:03,680
treat each pixel as a different value, videos are sequences.
12860
10:40:03,680 --> 10:40:07,760
They're sequences of images, and each sequence might be of different length.
12861
10:40:07,760 --> 10:40:10,720
And so it might be challenging to represent that entire video
12862
10:40:10,720 --> 10:40:15,320
as a single vector of values that you could pass in to a neural network.
12863
10:40:15,320 --> 10:40:17,600
And so here, too, recurrent neural networks
12864
10:40:17,600 --> 10:40:21,320
can be a valuable solution for trying to solve this type of problem.
12865
10:40:21,320 --> 10:40:25,320
Then instead of just passing in a single input into our neural network,
12866
10:40:25,320 --> 10:40:28,440
we could pass in the input one frame at a time, you might imagine.
12867
10:40:28,440 --> 10:40:32,720
First, taking the first frame of the video, passing it into the network,
12868
10:40:32,720 --> 10:40:35,520
and then maybe not having the network output anything at all yet.
12869
10:40:35,520 --> 10:40:40,120
Let it take in another input, and this time, pass it into the network.
12870
10:40:40,120 --> 10:40:43,000
But the network gets information from the last time
12871
10:40:43,000 --> 10:40:45,000
we provided an input into the network.
12872
10:40:45,000 --> 10:40:47,480
Then we pass in a third input, and then a fourth input,
12873
10:40:47,480 --> 10:40:51,200
where each time, what the network gets is it gets the most recent input,
12874
10:40:51,200 --> 10:40:53,600
like each frame of the video.
12875
10:40:53,600 --> 10:40:56,280
But it also gets information the network processed
12876
10:40:56,280 --> 10:40:58,080
from all of the previous iterations.
12877
10:40:58,080 --> 10:41:02,400
So on frame number four, you end up getting the input for frame number four
12878
10:41:02,400 --> 10:41:06,880
plus information the network has calculated from the first three frames.
12879
10:41:06,880 --> 10:41:10,000
And using all of that data combined, this recurrent neural network
12880
10:41:10,000 --> 10:41:14,160
can begin to learn how to extract patterns from a sequence of data
12881
10:41:14,160 --> 10:41:14,960
as well.
12882
10:41:14,960 --> 10:41:17,280
And so you might imagine, if you want to classify a video
12883
10:41:17,280 --> 10:41:20,040
into a number of different genres, like an educational video,
12884
10:41:20,040 --> 10:41:22,220
or a music video, or different types of videos,
12885
10:41:22,220 --> 10:41:24,400
that's a classification task, where you want
12886
10:41:24,400 --> 10:41:27,020
to take as input each of the frames of the video,
12887
10:41:27,020 --> 10:41:31,560
and you want to output something like what it is, what category
12888
10:41:31,560 --> 10:41:33,320
that it happens to belong to.
12889
10:41:33,320 --> 10:41:35,040
And you can imagine doing this sort of thing,
12890
10:41:35,040 --> 10:41:39,840
this sort of many-to-one learning, any time your input is a sequence.
12891
10:41:39,840 --> 10:41:43,240
And so input is a sequence in the context of video.
12892
10:41:43,240 --> 10:41:45,740
It could be in the context of, like, if someone has typed a message
12893
10:41:45,740 --> 10:41:47,840
and you want to be able to categorize that message,
12894
10:41:47,840 --> 10:41:51,560
like if you're trying to take a movie review and trying to classify it
12895
10:41:51,560 --> 10:41:54,080
as, is it a positive review or a negative review?
12896
10:41:54,080 --> 10:41:56,720
That input is a sequence of words, and the output
12897
10:41:56,720 --> 10:41:59,360
is a classification, positive or negative.
12898
10:41:59,360 --> 10:42:01,440
There, too, a recurrent neural network might
12899
10:42:01,440 --> 10:42:04,040
be helpful for analyzing sequences of words.
12900
10:42:04,040 --> 10:42:07,600
And they're quite popular when it comes to dealing with language.
12901
10:42:07,600 --> 10:42:09,880
Could even be used for spoken language as well,
12902
10:42:09,880 --> 10:42:12,480
that spoken language is an audio waveform that
12903
10:42:12,480 --> 10:42:14,800
can be segmented into distinct chunks.
12904
10:42:14,800 --> 10:42:17,440
And each of those could be passed in as an input
12905
10:42:17,440 --> 10:42:21,000
into a recurrent neural network to be able to classify someone's voice,
12906
10:42:21,000 --> 10:42:21,560
for instance.
12907
10:42:21,560 --> 10:42:24,880
If you want to do voice recognition to say, is this one person or is this
12908
10:42:24,880 --> 10:42:27,360
another, here are also cases where you might
12909
10:42:27,360 --> 10:42:32,240
want this many-to-one architecture for a recurrent neural network.
12910
10:42:32,240 --> 10:42:34,040
And then as one final problem, just to take
12911
10:42:34,040 --> 10:42:37,040
a look at in terms of what we can do with these sorts of networks,
12912
10:42:37,040 --> 10:42:39,080
imagine what Google Translate is doing.
12913
10:42:39,080 --> 10:42:42,560
So what Google Translate is doing is it's taking some text written
12914
10:42:42,560 --> 10:42:47,200
in one language and converting it into text written in some other language,
12915
10:42:47,200 --> 10:42:50,440
for example, where now this input is a sequence of data.
12916
10:42:50,440 --> 10:42:52,000
It's a sequence of words.
12917
10:42:52,000 --> 10:42:54,320
And the output is a sequence of words as well.
12918
10:42:54,320 --> 10:42:55,560
It's also a sequence.
12919
10:42:55,560 --> 10:42:58,560
So here we want effectively a many-to-many relationship.
12920
10:42:58,560 --> 10:43:02,560
Our input is a sequence and our output is a sequence as well.
12921
10:43:02,560 --> 10:43:05,000
And it's not quite going to work to just say,
12922
10:43:05,000 --> 10:43:09,840
take each word in the input and translate it into a word in the output.
12923
10:43:09,840 --> 10:43:13,040
Because ultimately, different languages put their words in different orders.
12924
10:43:13,040 --> 10:43:15,200
And maybe one language uses two words for something,
12925
10:43:15,200 --> 10:43:17,240
whereas another language only uses one.
12926
10:43:17,240 --> 10:43:22,240
So we really want some way to take this information, this input,
12927
10:43:22,240 --> 10:43:25,840
encode it somehow, and use that encoding to generate
12928
10:43:25,840 --> 10:43:27,440
what the output ultimately should be.
12929
10:43:27,440 --> 10:43:30,720
And this has been one of the big advancements in automated translation
12930
10:43:30,720 --> 10:43:34,080
technology, is the ability to use the neural networks to do this instead
12931
10:43:34,080 --> 10:43:35,800
of older, more traditional methods.
12932
10:43:35,800 --> 10:43:37,920
And this has improved accuracy dramatically.
12933
10:43:37,920 --> 10:43:40,240
And the way you might imagine doing this is, again,
12934
10:43:40,240 --> 10:43:44,200
using a recurrent neural network with multiple inputs and multiple outputs.
12935
10:43:44,200 --> 10:43:45,800
We start by passing in all the input.
12936
10:43:45,800 --> 10:43:47,320
Input goes into the network.
12937
10:43:47,320 --> 10:43:49,560
Another input, like another word, goes into the network.
12938
10:43:49,560 --> 10:43:53,280
And we do this multiple times, like once for each word in the input
12939
10:43:53,280 --> 10:43:54,680
that I'm trying to translate.
12940
10:43:54,680 --> 10:43:58,000
And only after all of that is done does the network now
12941
10:43:58,000 --> 10:44:01,200
start to generate output, like the first word of the translated sentence,
12942
10:44:01,200 --> 10:44:04,240
and the next word of the translated sentence, so on and so forth,
12943
10:44:04,240 --> 10:44:08,640
where each time the network passes information to itself
12944
10:44:08,640 --> 10:44:12,480
by allowing for this model of giving some sort of state
12945
10:44:12,480 --> 10:44:15,120
from one run in the network to the next run,
12946
10:44:15,120 --> 10:44:17,280
assembling information about all the inputs,
12947
10:44:17,280 --> 10:44:20,600
and then passing in information about which part of the output
12948
10:44:20,600 --> 10:44:22,440
in order to generate next.
12949
10:44:22,440 --> 10:44:25,640
And there are a number of different types of these sorts of recurrent neural
12950
10:44:25,640 --> 10:44:26,140
networks.
12951
10:44:26,140 --> 10:44:29,640
One of the most popular is known as the long short-term memory neural network,
12952
10:44:29,640 --> 10:44:31,400
otherwise known as LSTM.
12953
10:44:31,400 --> 10:44:35,160
But in general, these types of networks can be very, very powerful whenever
12954
10:44:35,160 --> 10:44:38,120
we're dealing with sequences, whether those are sequences of images
12955
10:44:38,120 --> 10:44:40,600
or especially sequences of words when it comes
12956
10:44:40,600 --> 10:44:43,600
towards dealing with natural language.
12957
10:44:43,600 --> 10:44:46,160
And so that then were just some of the different types
12958
10:44:46,160 --> 10:44:49,840
of neural networks that can be used to do all sorts of different computations.
12959
10:44:49,840 --> 10:44:52,080
And these are incredibly versatile tools that
12960
10:44:52,080 --> 10:44:54,200
can be applied to a number of different domains.
12961
10:44:54,200 --> 10:44:57,600
We only looked at a couple of the most popular types of neural networks
12962
10:44:57,600 --> 10:45:00,840
from more traditional feed-forward neural networks, convolutional neural
12963
10:45:00,840 --> 10:45:02,920
networks, and recurrent neural networks.
12964
10:45:02,920 --> 10:45:04,240
But there are other types as well.
12965
10:45:04,240 --> 10:45:07,120
There are adversarial networks where networks compete with each other
12966
10:45:07,120 --> 10:45:10,160
to try and be able to generate new types of data,
12967
10:45:10,160 --> 10:45:13,000
as well as other networks that can solve other tasks based
12968
10:45:13,000 --> 10:45:15,680
on what they happen to be structured and adapted for.
12969
10:45:15,680 --> 10:45:18,080
And these are very powerful tools in machine learning
12970
10:45:18,080 --> 10:45:21,880
from being able to very easily learn based on some set of input data
12971
10:45:21,880 --> 10:45:25,040
and to be able to, therefore, figure out how to calculate some function
12972
10:45:25,040 --> 10:45:28,720
from inputs to outputs, whether it's input to some sort of classification
12973
10:45:28,720 --> 10:45:32,040
like analyzing an image and getting a digit or machine translation
12974
10:45:32,040 --> 10:45:34,920
where the input is in one language and the output is in another.
12975
10:45:34,920 --> 10:45:39,320
These tools have a lot of applications for machine learning more generally.
12976
10:45:39,320 --> 10:45:42,400
Next time, we'll look at machine learning and AI in particular
12977
10:45:42,400 --> 10:45:44,120
in the context of natural language.
12978
10:45:44,120 --> 10:45:47,520
We talked a little bit about this today, but looking at how it is that our AI
12979
10:45:47,520 --> 10:45:50,000
can begin to understand natural language and can
12980
10:45:50,000 --> 10:45:53,360
begin to be able to analyze and do useful tasks with regards
12981
10:45:53,360 --> 10:45:57,040
to human language, which turns out to be a challenging and interesting task.
12982
10:45:57,040 --> 10:46:00,000
So we'll see you next time.
12983
10:46:00,000 --> 10:46:21,360
And welcome back, everybody, to our final class
12984
10:46:21,360 --> 10:46:24,320
in an introduction to artificial intelligence with Python.
12985
10:46:24,320 --> 10:46:26,720
Now, so far in this class, we've been taking problems
12986
10:46:26,720 --> 10:46:29,040
that we want to solve intelligently and framing them
12987
10:46:29,040 --> 10:46:31,720
in ways that computers are going to be able to make sense of.
12988
10:46:31,720 --> 10:46:34,840
We've been taking problems and framing them as search problems
12989
10:46:34,840 --> 10:46:38,920
or constraint satisfaction problems or optimization problems, for example.
12990
10:46:38,920 --> 10:46:40,840
In essence, we have been trying to communicate
12991
10:46:40,840 --> 10:46:45,120
about problems in ways that our computer is going to be able to understand.
12992
10:46:45,120 --> 10:46:47,560
Today, the goal is going to be to get computers
12993
10:46:47,560 --> 10:46:50,280
to understand the way you and I communicate naturally
12994
10:46:50,280 --> 10:46:53,800
via our own natural languages, languages like English.
12995
10:46:53,800 --> 10:46:57,400
But natural language contains a lot of nuance and complexity
12996
10:46:57,400 --> 10:47:00,600
that's going to make it challenging for computers to be able to understand.
12997
10:47:00,600 --> 10:47:04,080
So we'll need to explore some new tools and some new techniques
12998
10:47:04,080 --> 10:47:07,800
to allow computers to make sense of natural language.
12999
10:47:07,800 --> 10:47:10,640
So what is it exactly that we're trying to get computers to do?
13000
10:47:10,640 --> 10:47:14,520
Well, they all fall under this general heading of natural language processing,
13001
10:47:14,520 --> 10:47:17,360
getting computers to work with natural language.
13002
10:47:17,360 --> 10:47:20,840
And these tasks include tasks like automatic summarization.
13003
10:47:20,840 --> 10:47:23,600
Given a long text, can we train the computer
13004
10:47:23,600 --> 10:47:26,240
to be able to come up with a shorter representation of it?
13005
10:47:26,240 --> 10:47:28,280
Information extraction, getting the computer
13006
10:47:28,280 --> 10:47:31,120
to pull out relevant facts or details out of some text.
13007
10:47:31,120 --> 10:47:33,400
Machine translation, like Google Translate,
13008
10:47:33,400 --> 10:47:36,680
translating some text from one language into another language.
13009
10:47:36,680 --> 10:47:39,880
Question answering, if you've ever asked a question to your phone
13010
10:47:39,880 --> 10:47:43,800
or had a conversation with an AI chatbot where you provide some text
13011
10:47:43,800 --> 10:47:47,400
to the computer, the computer is able to understand that text
13012
10:47:47,400 --> 10:47:50,360
and then generate some text in response.
13013
10:47:50,360 --> 10:47:53,680
Text classification, where we provide some text to the computer
13014
10:47:53,680 --> 10:47:56,720
and the computer assigns it a label, positive or negative,
13015
10:47:56,720 --> 10:47:58,600
inbox or spam, for example.
13016
10:47:58,600 --> 10:48:00,360
And there are several other kinds of tasks
13017
10:48:00,360 --> 10:48:03,800
that all fall under this heading of natural language processing.
13018
10:48:03,800 --> 10:48:06,240
But before we take a look at how the computer might
13019
10:48:06,240 --> 10:48:09,240
try to solve these kinds of tasks, it might be useful for us
13020
10:48:09,240 --> 10:48:11,540
to think about language in general.
13021
10:48:11,540 --> 10:48:14,360
What are the kinds of challenges that we might need to deal with
13022
10:48:14,360 --> 10:48:17,320
as we start to think about language and getting a computer
13023
10:48:17,320 --> 10:48:18,880
to be able to understand it?
13024
10:48:18,880 --> 10:48:21,080
So one part of language that we'll need to consider
13025
10:48:21,080 --> 10:48:22,760
is the syntax of language.
13026
10:48:22,760 --> 10:48:25,040
Syntax is all about the structure of language.
13027
10:48:25,040 --> 10:48:27,400
Language is composed of individual words.
13028
10:48:27,400 --> 10:48:31,280
And those words are composed together in some kind of structured whole.
13029
10:48:31,280 --> 10:48:33,960
And if our computer is going to be able to understand language,
13030
10:48:33,960 --> 10:48:37,440
it's going to need to understand something about that structure.
13031
10:48:37,440 --> 10:48:39,160
So let's take a couple of examples.
13032
10:48:39,160 --> 10:48:40,920
Here, for instance, is a sentence.
13033
10:48:40,920 --> 10:48:44,740
Just before 9 o'clock, Sherlock Holmes stepped briskly into the room.
13034
10:48:44,740 --> 10:48:46,680
That sentence is made up of words.
13035
10:48:46,680 --> 10:48:49,640
And those words together form a structured whole.
13036
10:48:49,640 --> 10:48:52,520
This is syntactically valid as a sentence.
13037
10:48:52,520 --> 10:48:55,120
But we could take some of those same words,
13038
10:48:55,120 --> 10:48:59,640
rearrange them, and come up with a sentence that is not syntactically valid.
13039
10:48:59,640 --> 10:49:03,640
Here, for example, just before Sherlock Holmes 9 o'clock stepped briskly
13040
10:49:03,640 --> 10:49:06,640
the room is still composed of valid words.
13041
10:49:06,640 --> 10:49:08,960
But they're not in any kind of logical whole.
13042
10:49:08,960 --> 10:49:12,800
This is not a syntactically well-formed sentence.
13043
10:49:12,800 --> 10:49:15,800
Another interesting challenge is that some sentences will
13044
10:49:15,800 --> 10:49:18,920
have multiple possible valid structures.
13045
10:49:18,920 --> 10:49:20,440
Here's a sentence, for example.
13046
10:49:20,440 --> 10:49:23,480
I saw the man on the mountain with a telescope.
13047
10:49:23,480 --> 10:49:25,200
And here, this is a valid sentence.
13048
10:49:25,200 --> 10:49:28,680
But it actually has two different possible structures
13049
10:49:28,680 --> 10:49:31,360
that lend themselves to two different interpretations
13050
10:49:31,360 --> 10:49:32,520
and two different meanings.
13051
10:49:32,520 --> 10:49:36,040
Maybe I, the one doing the seeing, am the one with the telescope.
13052
10:49:36,040 --> 10:49:39,280
Or maybe the man on the mountain is the one with the telescope.
13053
10:49:39,280 --> 10:49:41,440
And so natural language is ambiguous.
13054
10:49:41,440 --> 10:49:44,800
Sometimes the same sentence can be interpreted in multiple ways.
13055
10:49:44,800 --> 10:49:47,520
And that's something that we'll need to think about as well.
13056
10:49:47,520 --> 10:49:50,000
And this lends itself to another problem within language
13057
10:49:50,000 --> 10:49:52,480
that we'll need to think about, which is semantics.
13058
10:49:52,480 --> 10:49:55,080
While syntax is all about the structure of language,
13059
10:49:55,080 --> 10:49:57,360
semantics is about the meaning of language.
13060
10:49:57,360 --> 10:49:59,880
It's not enough for a computer just to know
13061
10:49:59,880 --> 10:50:02,040
that a sentence is well-structured if it doesn't
13062
10:50:02,040 --> 10:50:04,200
know what that sentence means.
13063
10:50:04,200 --> 10:50:06,240
And so semantics is going to concern itself
13064
10:50:06,240 --> 10:50:09,440
with the meaning of words and the meaning of sentences.
13065
10:50:09,440 --> 10:50:11,680
So if we go back to that same sentence as before,
13066
10:50:11,680 --> 10:50:16,000
just before 9 o'clock, Sherlock Holmes stepped briskly into the room,
13067
10:50:16,000 --> 10:50:19,360
I could come up with another sentence, say the sentence,
13068
10:50:19,360 --> 10:50:23,600
a few minutes before 9, Sherlock Holmes walked quickly into the room.
13069
10:50:23,600 --> 10:50:26,480
And those are two different sentences with some of the words the same
13070
10:50:26,480 --> 10:50:28,000
and some of the words different.
13071
10:50:28,000 --> 10:50:31,280
But the two sentences have essentially the same meaning.
13072
10:50:31,280 --> 10:50:33,440
And so ideally, whatever model we build, we'll
13073
10:50:33,440 --> 10:50:36,560
be able to understand that these two sentences, while different,
13074
10:50:36,560 --> 10:50:38,800
mean something very similar.
13075
10:50:38,800 --> 10:50:42,440
Some syntactically well-formed sentences don't mean anything at all.
13076
10:50:42,440 --> 10:50:44,920
A famous example from linguist Noam Chomsky
13077
10:50:44,920 --> 10:50:48,920
is the sentence, colorless green ideas sleep furiously.
13078
10:50:48,920 --> 10:50:52,120
This is a syntactically, structurally well-formed sentence.
13079
10:50:52,120 --> 10:50:55,280
We've got adjectives modifying a noun, ideas.
13080
10:50:55,280 --> 10:50:58,040
We've got a verb and an adverb in the correct positions.
13081
10:50:58,040 --> 10:51:01,880
But when taken as a whole, the sentence doesn't really mean anything.
13082
10:51:01,880 --> 10:51:05,080
And so if our computers are going to be able to work with natural language
13083
10:51:05,080 --> 10:51:07,520
and perform tasks in natural language processing,
13084
10:51:07,520 --> 10:51:09,520
these are some concerns we'll need to think about.
13085
10:51:09,520 --> 10:51:11,760
We'll need to be thinking about syntax.
13086
10:51:11,760 --> 10:51:14,520
And we'll need to be thinking about semantics.
13087
10:51:14,520 --> 10:51:17,480
So how could we go about trying to teach a computer how
13088
10:51:17,480 --> 10:51:20,280
to understand the structure of natural language?
13089
10:51:20,280 --> 10:51:22,680
Well, one approach we might take is by starting
13090
10:51:22,680 --> 10:51:25,400
by thinking about the rules of natural language.
13091
10:51:25,400 --> 10:51:27,160
Our natural languages have rules.
13092
10:51:27,160 --> 10:51:30,360
In English, for example, nouns tend to come before verbs.
13093
10:51:30,360 --> 10:51:33,240
Nouns can be modified by adjectives, for example.
13094
10:51:33,240 --> 10:51:36,040
And so if only we could formalize those rules,
13095
10:51:36,040 --> 10:51:38,280
then we could give those rules to a computer,
13096
10:51:38,280 --> 10:51:41,880
and the computer would be able to make sense of them and understand them.
13097
10:51:41,880 --> 10:51:43,720
And so let's try to do exactly that.
13098
10:51:43,720 --> 10:51:46,360
We're going to try to define a formal grammar.
13099
10:51:46,360 --> 10:51:49,400
Where a formal grammar is some system of rules
13100
10:51:49,400 --> 10:51:52,040
for generating sentences in a language.
13101
10:51:52,040 --> 10:51:56,000
This is going to be a rule-based approach to natural language processing.
13102
10:51:56,000 --> 10:51:59,400
We're going to give the computer some rules that we know about language
13103
10:51:59,400 --> 10:52:01,840
and have the computer use those rules to make
13104
10:52:01,840 --> 10:52:04,280
sense of the structure of language.
13105
10:52:04,280 --> 10:52:06,600
And there are a number of different types of formal grammars.
13106
10:52:06,600 --> 10:52:09,080
Each one of them has slightly different use cases.
13107
10:52:09,080 --> 10:52:11,080
But today, we're going to focus specifically
13108
10:52:11,080 --> 10:52:14,560
on one kind of grammar known as a context-free grammar.
13109
10:52:14,560 --> 10:52:16,480
So how does the context-free grammar work?
13110
10:52:16,480 --> 10:52:19,720
Well, here is a sentence that we might want a computer to generate.
13111
10:52:19,720 --> 10:52:21,520
She saw the city.
13112
10:52:21,520 --> 10:52:24,760
And we're going to call each of these words a terminal symbol.
13113
10:52:24,760 --> 10:52:27,920
A terminal symbol, because once our computer has generated the word,
13114
10:52:27,920 --> 10:52:29,500
there's nothing else for it to generate.
13115
10:52:29,500 --> 10:52:32,800
Once it's generated the sentence, the computer is done.
13116
10:52:32,800 --> 10:52:35,520
We're going to associate each of these terminal symbols
13117
10:52:35,520 --> 10:52:39,320
with a non-terminal symbol that generates it.
13118
10:52:39,320 --> 10:52:43,200
So here we've got n, which stands for noun, like she or city.
13119
10:52:43,200 --> 10:52:46,600
We've got v as a non-terminal symbol, which stands for a verb.
13120
10:52:46,600 --> 10:52:48,720
And then we have d, which stands for determiner.
13121
10:52:48,720 --> 10:52:52,880
A determiner is a word like the or a or an in English, for example.
13122
10:52:52,880 --> 10:52:57,040
So each of these non-terminal symbols can generate the terminal symbols
13123
10:52:57,040 --> 10:52:59,600
that we ultimately care about generating.
13124
10:52:59,600 --> 10:53:01,720
But how do we know, or how does the computer
13125
10:53:01,720 --> 10:53:05,720
know which non-terminal symbols are associated with which terminal symbols?
13126
10:53:05,720 --> 10:53:08,280
Well, to do that, we need some kind of rule.
13127
10:53:08,280 --> 10:53:11,040
Here are some what we call rewriting rules that
13128
10:53:11,040 --> 10:53:14,320
have a non-terminal symbol on the left-hand side of an arrow.
13129
10:53:14,320 --> 10:53:18,800
And on the right side is what that non-terminal symbol can be replaced with.
13130
10:53:18,800 --> 10:53:21,560
So here we're saying the non-terminal symbol n, again,
13131
10:53:21,560 --> 10:53:25,520
which stands for noun, could be replaced by any of these options separated
13132
10:53:25,520 --> 10:53:26,800
by vertical bars.
13133
10:53:26,800 --> 10:53:30,760
n could be replaced by she or city or car or hairy.
13134
10:53:30,760 --> 10:53:34,800
d for determiner could be replaced by the a or an and so forth.
13135
10:53:34,800 --> 10:53:40,240
Each of these non-terminal symbols could be replaced by any of these words.
13136
10:53:40,240 --> 10:53:42,720
We can also have non-terminal symbols that
13137
10:53:42,720 --> 10:53:45,680
are replaced by other non-terminal symbols.
13138
10:53:45,680 --> 10:53:50,840
Here is an interesting rule, np arrow n bar dn.
13139
10:53:50,840 --> 10:53:52,000
So what does that mean?
13140
10:53:52,000 --> 10:53:55,080
Well, np stands for a noun phrase.
13141
10:53:55,080 --> 10:53:57,400
Sometimes when we have a noun phrase in a sentence,
13142
10:53:57,400 --> 10:54:00,200
it's not just a single word, it could be multiple words.
13143
10:54:00,200 --> 10:54:04,400
And so here we're saying a noun phrase could be just a noun,
13144
10:54:04,400 --> 10:54:07,920
or it could be a determiner followed by a noun.
13145
10:54:07,920 --> 10:54:11,200
So we might have a noun phrase that's just a noun, like she,
13146
10:54:11,200 --> 10:54:12,680
that's a noun phrase.
13147
10:54:12,680 --> 10:54:15,360
Or we could have a noun phrase that's multiple words, something
13148
10:54:15,360 --> 10:54:18,440
like the city also acts as a noun phrase.
13149
10:54:18,440 --> 10:54:22,440
But in this case, it's composed of two words, a determiner, the,
13150
10:54:22,440 --> 10:54:24,520
and a noun city.
13151
10:54:24,520 --> 10:54:26,480
We could do the same for verb phrases.
13152
10:54:26,480 --> 10:54:30,040
A verb phrase, or VP, might be just a verb,
13153
10:54:30,040 --> 10:54:33,160
or it might be a verb followed by a noun phrase.
13154
10:54:33,160 --> 10:54:35,920
So we could have a verb phrase that's just a single word,
13155
10:54:35,920 --> 10:54:38,760
like the word walked, or we could have a verb phrase
13156
10:54:38,760 --> 10:54:42,600
that is an entire phrase, something like saw the city,
13157
10:54:42,600 --> 10:54:45,040
as an entire verb phrase.
13158
10:54:45,040 --> 10:54:48,680
A sentence, meanwhile, we might then define as a noun phrase
13159
10:54:48,680 --> 10:54:50,840
followed by a verb phrase.
13160
10:54:50,840 --> 10:54:54,600
And so this would allow us to generate a sentence like she saw the city,
13161
10:54:54,600 --> 10:54:59,000
an entire sentence made up of a noun phrase, which is just the word she,
13162
10:54:59,000 --> 10:55:03,120
and then a verb phrase, which is saw the city, saw which is a verb,
13163
10:55:03,120 --> 10:55:07,880
and then the city, which itself is also a noun phrase.
13164
10:55:07,880 --> 10:55:11,200
And so if we could give these rules to a computer explaining to it
13165
10:55:11,200 --> 10:55:15,080
what non-terminal symbols could be replaced by what other symbols,
13166
10:55:15,080 --> 10:55:17,400
then a computer could take a sentence and begin
13167
10:55:17,400 --> 10:55:20,520
to understand the structure of that sentence.
13168
10:55:20,520 --> 10:55:23,320
And so let's take a look at an example of how we might do that.
13169
10:55:23,320 --> 10:55:26,960
And to do that, we're going to use a Python library called NLTK,
13170
10:55:26,960 --> 10:55:30,160
or the Natural Language Toolkit, which we'll see a couple of times today.
13171
10:55:30,160 --> 10:55:33,280
It contains a lot of helpful features and functions that we can use
13172
10:55:33,280 --> 10:55:36,440
for trying to deal with and process natural language.
13173
10:55:36,440 --> 10:55:39,540
So here we'll take a look at how we can use NLTK in order
13174
10:55:39,540 --> 10:55:42,280
to parse a context-free grammar.
13175
10:55:42,280 --> 10:55:47,840
So let's go ahead and open up cfg0.py, cfg standing for context-free grammar.
13176
10:55:47,840 --> 10:55:51,680
And what you'll see in this file is that I first import NLTK, the Natural
13177
10:55:51,680 --> 10:55:53,160
Language Toolkit.
13178
10:55:53,160 --> 10:55:57,000
And the first thing I do is define a context-free grammar,
13179
10:55:57,000 --> 10:56:00,400
saying that a sentence is a noun phrase followed by a verb phrase.
13180
10:56:00,400 --> 10:56:03,840
I'm defining what a noun phrase is, defining what a verb phrase is,
13181
10:56:03,840 --> 10:56:05,800
and then giving some examples of what I can
13182
10:56:05,800 --> 10:56:10,400
do with these non-terminal symbols, D for determiner, N for noun,
13183
10:56:10,400 --> 10:56:12,280
and V for verb.
13184
10:56:12,280 --> 10:56:15,400
We're going to use NLTK to parse that grammar.
13185
10:56:15,400 --> 10:56:18,280
Then we'll ask the user for some input in the form of a sentence
13186
10:56:18,280 --> 10:56:20,360
and split it into words.
13187
10:56:20,360 --> 10:56:23,560
And then we'll use this context-free grammar parser
13188
10:56:23,560 --> 10:56:28,400
to try to parse that sentence and print out the resulting syntax tree.
13189
10:56:28,400 --> 10:56:30,760
So let's take a look at an example.
13190
10:56:30,760 --> 10:56:35,560
We'll go ahead and go into my cfg directory, and we'll run cfg0.py.
13191
10:56:35,560 --> 10:56:37,160
And here I'm asked to type in a sentence.
13192
10:56:37,160 --> 10:56:40,600
Let's say I type in she walked.
13193
10:56:40,600 --> 10:56:43,960
And when I do that, I see that she walked is a valid sentence,
13194
10:56:43,960 --> 10:56:49,680
where she is a noun phrase, and walked is the corresponding verb phrase.
13195
10:56:49,680 --> 10:56:52,600
I could try to do this with a more complex sentence too.
13196
10:56:52,600 --> 10:56:55,920
I could do something like she saw the city.
13197
10:56:55,920 --> 10:56:58,920
And here we see that she is the noun phrase,
13198
10:56:58,920 --> 10:57:04,560
and then saw the city is the entire verb phrase that makes up this sentence.
13199
10:57:04,560 --> 10:57:06,200
So that was a very simple grammar.
13200
10:57:06,200 --> 10:57:08,840
Let's take a look at a slightly more complex grammar.
13201
10:57:08,840 --> 10:57:13,440
Here is cfg1.py, where a sentence is still a noun phrase followed
13202
10:57:13,440 --> 10:57:17,680
by a verb phrase, but I've added some other possible non-terminal symbols too.
13203
10:57:17,680 --> 10:57:22,760
I have AP for adjective phrase and PP for prepositional phrase.
13204
10:57:22,760 --> 10:57:25,480
And we specified that we could have an adjective phrase
13205
10:57:25,480 --> 10:57:30,440
before a noun phrase or a prepositional phrase after a noun, for example.
13206
10:57:30,440 --> 10:57:34,320
So lots of additional ways that we might try to structure a sentence
13207
10:57:34,320 --> 10:57:37,880
and interpret and parse one of those resulting sentences.
13208
10:57:37,880 --> 10:57:39,280
So let's see that one in action.
13209
10:57:39,280 --> 10:57:43,600
We'll go ahead and run cfg1.py with this new grammar.
13210
10:57:43,600 --> 10:57:48,400
And we'll try a sentence like she saw the wide street.
13211
10:57:48,400 --> 10:57:51,680
Here, Python's NLTK is able to parse that sentence
13212
10:57:51,680 --> 10:57:55,840
and identify that she saw the wide street has this particular structure,
13213
10:57:55,840 --> 10:57:58,400
a sentence with a noun phrase and a verb phrase,
13214
10:57:58,400 --> 10:58:00,600
where that verb phrase has a noun phrase that within it
13215
10:58:00,600 --> 10:58:02,080
contains an adjective.
13216
10:58:02,080 --> 10:58:06,120
And so it's able to get some sense for what the structure of this language
13217
10:58:06,120 --> 10:58:07,840
actually is.
13218
10:58:07,840 --> 10:58:09,280
Let's try another example.
13219
10:58:09,280 --> 10:58:14,680
Let's say she saw the dog with the binoculars.
13220
10:58:14,680 --> 10:58:16,680
And we'll try that sentence.
13221
10:58:16,680 --> 10:58:19,840
And here, we get one possible syntax tree,
13222
10:58:19,840 --> 10:58:21,840
she saw the dog with the binoculars.
13223
10:58:21,840 --> 10:58:24,120
But notice that this sentence is actually a little bit
13224
10:58:24,120 --> 10:58:26,320
ambiguous in our own natural language.
13225
10:58:26,320 --> 10:58:27,400
Who has the binoculars?
13226
10:58:27,400 --> 10:58:31,320
Is it she who has the binoculars or the dog who has the binoculars?
13227
10:58:31,320 --> 10:58:35,880
And NLTK is able to identify both possible structures for the sentence.
13228
10:58:35,880 --> 10:58:38,720
In this case, the dog with the binoculars
13229
10:58:38,720 --> 10:58:40,440
is an entire noun phrase.
13230
10:58:40,440 --> 10:58:42,720
It's all underneath this NP here.
13231
10:58:42,720 --> 10:58:45,280
So it's the dog that has the binoculars.
13232
10:58:45,280 --> 10:58:48,720
But we also got an alternative parse tree,
13233
10:58:48,720 --> 10:58:52,440
where the dog is just the noun phrase.
13234
10:58:52,440 --> 10:58:57,080
And with the binoculars is a prepositional phrase modifying saw.
13235
10:58:57,080 --> 10:59:01,080
So she saw the dog and she used the binoculars in order
13236
10:59:01,080 --> 10:59:03,120
to see the dog as well.
13237
10:59:03,120 --> 10:59:06,120
So this allows us to get a sense for the structure of natural language.
13238
10:59:06,120 --> 10:59:08,840
But it relies on us writing all of these rules.
13239
10:59:08,840 --> 10:59:12,120
And it would take a lot of effort to write all of the rules for any possible
13240
10:59:12,120 --> 10:59:15,320
sentence that someone might write or say in the English language.
13241
10:59:15,320 --> 10:59:16,520
Language is complicated.
13242
10:59:16,520 --> 10:59:20,080
And as a result, there are going to be some very complex rules.
13243
10:59:20,080 --> 10:59:21,680
So what else might we try?
13244
10:59:21,680 --> 10:59:24,840
We might try to take a statistical lens towards approaching
13245
10:59:24,840 --> 10:59:27,320
this problem of natural language processing.
13246
10:59:27,320 --> 10:59:31,160
If we were able to give the computer a lot of existing data of sentences
13247
10:59:31,160 --> 10:59:35,160
written in the English language, what could we try to learn from that data?
13248
10:59:35,160 --> 10:59:38,480
Well, it might be difficult to try and interpret long pieces of text all
13249
10:59:38,480 --> 10:59:39,200
at once.
13250
10:59:39,200 --> 10:59:42,680
So instead, what we might want to do is break up that longer text
13251
10:59:42,680 --> 10:59:45,120
into smaller pieces of information instead.
13252
10:59:45,120 --> 10:59:50,360
In particular, we might try to create n-grams out of a longer sequence of text.
13253
10:59:50,360 --> 10:59:55,560
An n-gram is just some contiguous sequence of n items from a sample of text.
13254
10:59:55,560 --> 10:59:59,800
It might be n characters in a row or n words in a row, for example.
13255
10:59:59,800 --> 11:00:02,320
So let's take a passage from Sherlock Holmes.
13256
11:00:02,320 --> 11:00:04,560
And let's look for all of the trigrams.
13257
11:00:04,560 --> 11:00:07,640
A trigram is an n-gram where n is equal to 3.
13258
11:00:07,640 --> 11:00:11,480
So in this case, we're looking for sequences of three words in a row.
13259
11:00:11,480 --> 11:00:15,240
So the trigrams here would be phrases like how often have.
13260
11:00:15,240 --> 11:00:16,680
That's three words in a row.
13261
11:00:16,680 --> 11:00:18,640
Often have I is another trigram.
13262
11:00:18,640 --> 11:00:22,080
Have I said, I said to, said to you, to you that.
13263
11:00:22,080 --> 11:00:27,040
These are all trigrams, sequences of three words that appear in sequence.
13264
11:00:27,040 --> 11:00:30,140
And if we could give the computer a large corpus of text
13265
11:00:30,140 --> 11:00:33,320
and have it pull out all of the trigrams in this case,
13266
11:00:33,320 --> 11:00:36,800
it could get a sense for what sequences of three words
13267
11:00:36,800 --> 11:00:40,720
tend to appear next to each other in our own natural language
13268
11:00:40,720 --> 11:00:45,240
and, as a result, get some sense for what the structure of the language
13269
11:00:45,240 --> 11:00:46,840
actually is.
13270
11:00:46,840 --> 11:00:48,560
So let's take a look at an example of that.
13271
11:00:48,560 --> 11:00:55,280
How can we use NLTK to try to get access to information about n-grams?
13272
11:00:55,280 --> 11:00:58,440
So here, we're going to open up ngrams.py.
13273
11:00:58,440 --> 11:01:02,440
And this is a Python program that's going to load a corpus of data, just
13274
11:01:02,440 --> 11:01:05,240
some text files, into our computer's memory.
13275
11:01:05,240 --> 11:01:08,760
And then we're going to use NLTK's ngrams function, which
13276
11:01:08,760 --> 11:01:12,520
is going to go through the corpus of text, pulling out all of the ngrams
13277
11:01:12,520 --> 11:01:14,480
for a particular value of n.
13278
11:01:14,480 --> 11:01:17,720
And then, by using Python's counter class,
13279
11:01:17,720 --> 11:01:21,640
we're going to figure out what are the most common ngrams inside
13280
11:01:21,640 --> 11:01:24,280
of this entire corpus of text.
13281
11:01:24,280 --> 11:01:26,480
And we're going to need a data set in order to do this.
13282
11:01:26,480 --> 11:01:29,960
And I've prepared a data set of some of the stories of Sherlock Holmes.
13283
11:01:29,960 --> 11:01:32,000
So it's just a bunch of text files.
13284
11:01:32,000 --> 11:01:33,680
A lot of words for it to analyze.
13285
11:01:33,680 --> 11:01:38,040
And as a result, we'll get a sense for what sequences of two words or three
13286
11:01:38,040 --> 11:01:42,440
words that tend to be most common in natural language.
13287
11:01:42,440 --> 11:01:43,440
So let's give this a try.
13288
11:01:43,440 --> 11:01:45,360
We'll go into my ngrams directory.
13289
11:01:45,360 --> 11:01:47,440
And we'll run ngrams.py.
13290
11:01:47,440 --> 11:01:49,200
We'll try an n value of 2.
13291
11:01:49,200 --> 11:01:51,960
So we're looking for sequences of two words in a row.
13292
11:01:51,960 --> 11:01:55,760
And we'll use our corpus of stories from Sherlock Holmes.
13293
11:01:55,760 --> 11:01:59,680
And when we run this program, we get a list of the most common ngrams
13294
11:01:59,680 --> 11:02:02,440
where n is equal to 2, otherwise known as a bigram.
13295
11:02:02,440 --> 11:02:04,720
So the most common one is of the.
13296
11:02:04,720 --> 11:02:07,440
That's a sequence of two words that appears quite frequently
13297
11:02:07,440 --> 11:02:08,720
in natural language.
13298
11:02:08,720 --> 11:02:09,720
Then in the.
13299
11:02:09,720 --> 11:02:10,720
And it was.
13300
11:02:10,720 --> 11:02:14,800
These are all common sequences of two words that appear in a row.
13301
11:02:14,800 --> 11:02:18,980
Let's instead now try running ngrams with n equal to 3.
13302
11:02:18,980 --> 11:02:21,760
Let's get all of the trigrams and see what we get.
13303
11:02:21,760 --> 11:02:25,360
And now we see the most common trigrams are it was a.
13304
11:02:25,360 --> 11:02:26,520
One of the.
13305
11:02:26,520 --> 11:02:27,760
I think that.
13306
11:02:27,760 --> 11:02:32,040
These are all sequences of three words that appear quite frequently.
13307
11:02:32,040 --> 11:02:36,040
And we were able to do this essentially via a process known as tokenization.
13308
11:02:36,040 --> 11:02:39,440
Tokenization is the process of splitting a sequence of characters
13309
11:02:39,440 --> 11:02:40,280
into pieces.
13310
11:02:40,280 --> 11:02:44,400
In this case, we're splitting a long sequence of text into individual words
13311
11:02:44,400 --> 11:02:46,640
and then looking at sequences of those words
13312
11:02:46,640 --> 11:02:49,840
to get a sense for the structure of natural language.
13313
11:02:49,840 --> 11:02:52,400
So once we've done this, once we've done the tokenization,
13314
11:02:52,400 --> 11:02:55,520
once we've built up our corpus of ngrams, what
13315
11:02:55,520 --> 11:02:57,160
can we do with that information?
13316
11:02:57,160 --> 11:03:00,040
So the one thing that we might try is we could build a Markov chain,
13317
11:03:00,040 --> 11:03:02,680
which you might recall from when we talked about probability.
13318
11:03:02,680 --> 11:03:05,800
Recall that a Markov chain is some sequence of values
13319
11:03:05,800 --> 11:03:10,160
where we can predict one value based on the values that came before it.
13320
11:03:10,160 --> 11:03:14,760
And as a result, if we know all of the common ngrams in the English language,
13321
11:03:14,760 --> 11:03:18,480
what words tend to be associated with what other words in sequence,
13322
11:03:18,480 --> 11:03:23,520
we can use that to predict what word might come next in a sequence of words.
13323
11:03:23,520 --> 11:03:26,180
And so we could build a Markov chain for language
13324
11:03:26,180 --> 11:03:28,640
in order to try to generate natural language that
13325
11:03:28,640 --> 11:03:33,280
follows the same statistical patterns as some input data.
13326
11:03:33,280 --> 11:03:37,520
So let's take a look at that and build a Markov chain for natural language.
13327
11:03:37,520 --> 11:03:41,960
And as input, I'm going to use the works of William Shakespeare.
13328
11:03:41,960 --> 11:03:45,120
So here I have a file Shakespeare.txt, which
13329
11:03:45,120 --> 11:03:48,120
is just a bunch of the works of William Shakespeare.
13330
11:03:48,120 --> 11:03:51,440
It's a long text file, so plenty of data to analyze.
13331
11:03:51,440 --> 11:03:55,480
And here in generator.py, I'm using a third party Python library
13332
11:03:55,480 --> 11:03:57,520
in order to do this analysis.
13333
11:03:57,520 --> 11:04:00,240
We're going to read in the sample of text,
13334
11:04:00,240 --> 11:04:03,960
and then we're going to train a Markov model based on that text.
13335
11:04:03,960 --> 11:04:07,840
And then we're going to have the Markov chain generate some sentences.
13336
11:04:07,840 --> 11:04:11,520
We're going to generate a sentence that doesn't appear in the original text,
13337
11:04:11,520 --> 11:04:14,920
but that follows the same statistical patterns that's generating it
13338
11:04:14,920 --> 11:04:19,360
based on the ngrams trying to predict what word is likely to come next
13339
11:04:19,360 --> 11:04:23,120
that we would expect based on those statistical patterns.
13340
11:04:23,120 --> 11:04:27,280
So we'll go ahead and go into our Markov directory,
13341
11:04:27,280 --> 11:04:31,200
run this generator with the works of William Shakespeare's input.
13342
11:04:31,200 --> 11:04:34,760
And what we're going to get are five new sentences, where
13343
11:04:34,760 --> 11:04:37,280
these sentences are not necessarily sentences
13344
11:04:37,280 --> 11:04:39,800
from the original input text itself, but just that
13345
11:04:39,800 --> 11:04:41,920
follow the same statistical patterns.
13346
11:04:41,920 --> 11:04:45,720
It's predicting what word is likely to come next based on the input data
13347
11:04:45,720 --> 11:04:47,720
that we've seen and the types of words that
13348
11:04:47,720 --> 11:04:50,200
tend to appear in sequence there too.
13349
11:04:50,200 --> 11:04:53,000
And so we're able to generate these sentences.
13350
11:04:53,000 --> 11:04:56,360
Of course, so far, there's no guarantee that any of the sentences that
13351
11:04:56,360 --> 11:04:59,040
are generated actually mean anything or make any sense.
13352
11:04:59,040 --> 11:05:01,880
They just happen to follow the statistical patterns
13353
11:05:01,880 --> 11:05:04,040
that our computer is already aware of.
13354
11:05:04,040 --> 11:05:06,520
So we'll return to this issue of how to generate text
13355
11:05:06,520 --> 11:05:09,840
in perhaps a more accurate or more meaningful way a little bit later.
13356
11:05:09,840 --> 11:05:12,800
So let's now turn our attention to a slightly different problem,
13357
11:05:12,800 --> 11:05:15,280
and that's the problem of text classification.
13358
11:05:15,280 --> 11:05:18,360
Text classification is the problem where we have some text
13359
11:05:18,360 --> 11:05:21,320
and we want to put that text into some kind of category.
13360
11:05:21,320 --> 11:05:24,240
We want to apply some sort of label to that text.
13361
11:05:24,240 --> 11:05:27,280
And this kind of problem shows up in a wide variety of places.
13362
11:05:27,280 --> 11:05:29,800
A commonplace might be your email inbox, for example.
13363
11:05:29,800 --> 11:05:31,920
You get an email and you want your computer
13364
11:05:31,920 --> 11:05:35,080
to be able to identify whether the email belongs in your inbox
13365
11:05:35,080 --> 11:05:37,320
or whether it should be filtered out into spam.
13366
11:05:37,320 --> 11:05:39,360
So we need to classify the text.
13367
11:05:39,360 --> 11:05:42,040
Is it a good email or is it spam?
13368
11:05:42,040 --> 11:05:44,760
Another common use case is sentiment analysis.
13369
11:05:44,760 --> 11:05:47,640
We might want to know whether the sentiment of some text
13370
11:05:47,640 --> 11:05:50,080
is positive or negative.
13371
11:05:50,080 --> 11:05:51,280
And so how might we do that?
13372
11:05:51,280 --> 11:05:53,920
This comes up in situations like product reviews,
13373
11:05:53,920 --> 11:05:57,120
where we might have a bunch of reviews for a product on some website.
13374
11:05:57,120 --> 11:05:58,840
My grandson loved it so much fun.
13375
11:05:58,840 --> 11:06:00,600
Product broke after a few days.
13376
11:06:00,600 --> 11:06:03,800
One of the best games I've played in a long time and kind of cheap
13377
11:06:03,800 --> 11:06:05,040
and flimsy, not worth it.
13378
11:06:05,040 --> 11:06:09,600
Here's some example sentences that you might see on a product review website.
13379
11:06:09,600 --> 11:06:12,680
And you and I could pretty easily look at this list of product reviews
13380
11:06:12,680 --> 11:06:15,960
and decide which ones are positive and which ones are negative.
13381
11:06:15,960 --> 11:06:17,880
We might say the first one and the third one,
13382
11:06:17,880 --> 11:06:20,160
those seem like positive sentiment messages.
13383
11:06:20,160 --> 11:06:24,160
But the second one and the fourth one seem like negative sentiment messages.
13384
11:06:24,160 --> 11:06:25,320
But how did we know that?
13385
11:06:25,320 --> 11:06:29,160
And how could we train a computer to be able to figure that out as well?
13386
11:06:29,160 --> 11:06:32,360
Well, you might have clued your eye in on particular key words,
13387
11:06:32,360 --> 11:06:36,520
where those particular words tend to mean something positive or negative.
13388
11:06:36,520 --> 11:06:40,160
So you might have identified words like loved and fun and best
13389
11:06:40,160 --> 11:06:42,880
tend to be associated with positive messages.
13390
11:06:42,880 --> 11:06:45,360
And words like broke and cheap and flimsy
13391
11:06:45,360 --> 11:06:48,000
tend to be associated with negative messages.
13392
11:06:48,000 --> 11:06:51,000
So if only we could train a computer to be able to learn
13393
11:06:51,000 --> 11:06:55,120
what words tend to be associated with positive versus negative messages,
13394
11:06:55,120 --> 11:06:59,000
then maybe we could train a computer to do this kind of sentiment analysis
13395
11:06:59,000 --> 11:07:00,160
as well.
13396
11:07:00,160 --> 11:07:01,760
So we're going to try to do just that.
13397
11:07:01,760 --> 11:07:05,120
We're going to use a model known as the bag of words model, which
13398
11:07:05,120 --> 11:07:09,720
is a model that represents text as just an unordered collection of words.
13399
11:07:09,720 --> 11:07:11,220
For the purpose of this model, we're not
13400
11:07:11,220 --> 11:07:13,760
going to worry about the sequence and the ordering of the words,
13401
11:07:13,760 --> 11:07:15,600
which word came first, second, or third.
13402
11:07:15,600 --> 11:07:18,440
We're just going to treat the text as a collection of words
13403
11:07:18,440 --> 11:07:19,680
in no particular order.
13404
11:07:19,680 --> 11:07:21,360
And we're losing information there, right?
13405
11:07:21,360 --> 11:07:22,880
The order of words is important.
13406
11:07:22,880 --> 11:07:24,880
And we'll come back to that a little bit later.
13407
11:07:24,880 --> 11:07:26,680
But for now, to simplify our model, it'll
13408
11:07:26,680 --> 11:07:29,440
help us tremendously just to think about text
13409
11:07:29,440 --> 11:07:32,320
as some unordered collection of words.
13410
11:07:32,320 --> 11:07:35,120
And in particular, we're going to use the bag of words model
13411
11:07:35,120 --> 11:07:38,240
to build something known as a naive Bayes classifier.
13412
11:07:38,240 --> 11:07:40,240
So what is a naive Bayes classifier?
13413
11:07:40,240 --> 11:07:43,960
Well, it's a tool that's going to allow us to classify text based on Bayes
13414
11:07:43,960 --> 11:07:47,200
rule, again, which you might remember from when we talked about probability.
13415
11:07:47,200 --> 11:07:51,520
Bayes rule says that the probability of B given A
13416
11:07:51,520 --> 11:07:54,920
is equal to the probability of A given B multiplied
13417
11:07:54,920 --> 11:07:59,480
by the probability of B divided by the probability of A.
13418
11:07:59,480 --> 11:08:03,560
So how are we going to use this rule to be able to analyze text?
13419
11:08:03,560 --> 11:08:04,920
Well, what are we interested in?
13420
11:08:04,920 --> 11:08:07,480
We're interested in the probability that a message has
13421
11:08:07,480 --> 11:08:10,360
a positive sentiment and the probability that a message has
13422
11:08:10,360 --> 11:08:12,920
a negative sentiment, which I'm here for simplicity
13423
11:08:12,920 --> 11:08:16,120
going to represent just with these emoji, happy face and frown face,
13424
11:08:16,120 --> 11:08:18,480
as positive and negative sentiment.
13425
11:08:18,480 --> 11:08:22,320
And so if I had a review, something like my grandson loved it,
13426
11:08:22,320 --> 11:08:25,460
then what I'm interested in is not just the probability
13427
11:08:25,460 --> 11:08:29,600
that a message has positive sentiment, but the conditional probability
13428
11:08:29,600 --> 11:08:32,120
that a message has positive sentiment given
13429
11:08:32,120 --> 11:08:35,120
that this is the message my grandson loved it.
13430
11:08:35,120 --> 11:08:38,360
But how do I go about calculating this value, the probability
13431
11:08:38,360 --> 11:08:42,880
that the message is positive given that the review is this sequence of words?
13432
11:08:42,880 --> 11:08:45,280
Well, here's where the bag of words model comes in.
13433
11:08:45,280 --> 11:08:49,680
Rather than treat this review as a string of a sequence of words in order,
13434
11:08:49,680 --> 11:08:52,840
we're just going to treat it as an unordered collection of words.
13435
11:08:52,840 --> 11:08:56,600
We're going to try to calculate the probability that the review is positive
13436
11:08:56,600 --> 11:08:59,800
given that all of these words, my grandson loved it,
13437
11:08:59,800 --> 11:09:02,400
are in the review in no particular order, just
13438
11:09:02,400 --> 11:09:05,120
this unordered collection of words.
13439
11:09:05,120 --> 11:09:09,920
And this is a conditional probability, which we can then apply Bayes rule
13440
11:09:09,920 --> 11:09:11,680
to try to make sense of.
13441
11:09:11,680 --> 11:09:16,080
And so according to Bayes rule, this conditional probability is equal to what?
13442
11:09:16,080 --> 11:09:19,480
It's equal to the probability that all of these four words
13443
11:09:19,480 --> 11:09:23,180
are in the review given that the review is positive multiplied
13444
11:09:23,180 --> 11:09:27,280
by the probability that the review is positive divided by the probability
13445
11:09:27,280 --> 11:09:30,680
that all of these words happen to be in the review.
13446
11:09:30,680 --> 11:09:33,880
So this is the value now that we're going to try to calculate.
13447
11:09:33,880 --> 11:09:36,440
Now, one thing you might notice is that the denominator here,
13448
11:09:36,440 --> 11:09:40,000
the probability that all of these words appear in the review,
13449
11:09:40,000 --> 11:09:42,280
doesn't actually depend on whether or not
13450
11:09:42,280 --> 11:09:45,680
we're looking at the positive sentiment or negative sentiment case.
13451
11:09:45,680 --> 11:09:47,640
So we can actually get rid of this denominator.
13452
11:09:47,640 --> 11:09:48,880
We don't need to calculate it.
13453
11:09:48,880 --> 11:09:53,280
We can just say that this probability is proportional to the numerator.
13454
11:09:53,280 --> 11:09:56,140
And then at the end, we're going to need to normalize the probability
13455
11:09:56,140 --> 11:10:00,840
distribution to make sure that all of the values sum up to the value 1.
13456
11:10:00,840 --> 11:10:03,480
So now, how do we calculate this value?
13457
11:10:03,480 --> 11:10:08,120
Well, this is the probability of all of these words given positive times
13458
11:10:08,120 --> 11:10:09,920
probability of positive.
13459
11:10:09,920 --> 11:10:12,640
And that, by the definition of joint probability,
13460
11:10:12,640 --> 11:10:15,680
is just one big joint probability, the probability
13461
11:10:15,680 --> 11:10:18,840
that all of these things are the case, that it's a positive review,
13462
11:10:18,840 --> 11:10:22,760
and that all four of these words are in the review.
13463
11:10:22,760 --> 11:10:26,720
But still, it's not entirely obvious how we calculate that value.
13464
11:10:26,720 --> 11:10:28,960
And here is where we need to make one more assumption.
13465
11:10:28,960 --> 11:10:32,240
And this is where the naive part of naive Bayes comes in.
13466
11:10:32,240 --> 11:10:34,880
We're going to make the assumption that all of the words
13467
11:10:34,880 --> 11:10:36,920
are independent of each other.
13468
11:10:36,920 --> 11:10:40,920
And by that, I mean that if the word grandson is in the review,
13469
11:10:40,920 --> 11:10:43,880
that doesn't change the probability that the word loved is in the review
13470
11:10:43,880 --> 11:10:46,320
or that the word it is in the review, for example.
13471
11:10:46,320 --> 11:10:48,840
And in practice, this assumption might not be true.
13472
11:10:48,840 --> 11:10:51,320
It's almost certainly the case that the probability of words
13473
11:10:51,320 --> 11:10:52,840
do depend on each other.
13474
11:10:52,840 --> 11:10:56,040
But it's going to simplify our analysis and still give us reasonably good
13475
11:10:56,040 --> 11:10:59,760
results just to assume that the words are independent of each other
13476
11:10:59,760 --> 11:11:03,880
and they only depend on whether it's positive or negative.
13477
11:11:03,880 --> 11:11:06,400
You might, for example, expect the word loved
13478
11:11:06,400 --> 11:11:10,480
to appear more often in a positive review than in a negative review.
13479
11:11:10,480 --> 11:11:11,640
So what does that mean?
13480
11:11:11,640 --> 11:11:13,600
Well, if we make this assumption, then we
13481
11:11:13,600 --> 11:11:16,840
can say that this value, the probability we're interested in,
13482
11:11:16,840 --> 11:11:22,200
is not directly proportional to, but it's naively proportional to this value.
13483
11:11:22,200 --> 11:11:26,280
The probability that the review is positive times the probability
13484
11:11:26,280 --> 11:11:29,120
that my is in the review, given that it's positive,
13485
11:11:29,120 --> 11:11:31,640
times the probability that grandson is in the review,
13486
11:11:31,640 --> 11:11:34,640
given that it's positive, and so on for the other two words that
13487
11:11:34,640 --> 11:11:36,320
happen to be in this review.
13488
11:11:36,320 --> 11:11:39,080
And now this value, which looks a little more complex,
13489
11:11:39,080 --> 11:11:42,720
is actually a value that we can calculate pretty easily.
13490
11:11:42,720 --> 11:11:46,320
So how are we going to estimate the probability that the review is positive?
13491
11:11:46,320 --> 11:11:50,360
Well, if we have some training data, some example data of example reviews
13492
11:11:50,360 --> 11:11:53,240
where each one has already been labeled as positive or negative,
13493
11:11:53,240 --> 11:11:56,280
then we can estimate the probability that a review is positive
13494
11:11:56,280 --> 11:11:58,760
just by counting the number of positive samples
13495
11:11:58,760 --> 11:12:02,520
and dividing by the total number of samples that we have in our training
13496
11:12:02,520 --> 11:12:03,600
data.
13497
11:12:03,600 --> 11:12:06,800
And for the conditional probabilities, the probability of loved,
13498
11:12:06,800 --> 11:12:08,760
given that it's positive, well, that's going
13499
11:12:08,760 --> 11:12:11,760
to be the number of positive samples with loved in it
13500
11:12:11,760 --> 11:12:15,360
divided by the total number of positive samples.
13501
11:12:15,360 --> 11:12:17,880
So let's take a look at an actual example to see how
13502
11:12:17,880 --> 11:12:19,760
we could try to calculate these values.
13503
11:12:19,760 --> 11:12:21,840
Here I've put together some sample data.
13504
11:12:21,840 --> 11:12:24,960
The way to interpret the sample data is that based on the training data,
13505
11:12:24,960 --> 11:12:29,200
49% of the reviews are positive, 51% are negative.
13506
11:12:29,200 --> 11:12:33,480
And then over here in this table, we have some conditional probabilities.
13507
11:12:33,480 --> 11:12:35,800
And then we have if the review is positive,
13508
11:12:35,800 --> 11:12:38,720
then there is a 30% chance that my appears in it.
13509
11:12:38,720 --> 11:12:42,880
And if the review is negative, there is a 20% chance that my appears in it.
13510
11:12:42,880 --> 11:12:45,840
And based on our training data among the positive reviews,
13511
11:12:45,840 --> 11:12:48,520
1% of them contain the word grandson.
13512
11:12:48,520 --> 11:12:52,360
And among the negative reviews, 2% contain the word grandson.
13513
11:12:52,360 --> 11:12:56,400
So using this data, let's try to calculate this value,
13514
11:12:56,400 --> 11:12:57,880
the value we're interested in.
13515
11:12:57,880 --> 11:13:02,040
And to do that, we'll need to multiply all of these values together.
13516
11:13:02,040 --> 11:13:04,280
The probability of positive, and then all
13517
11:13:04,280 --> 11:13:06,960
of these positive conditional probabilities.
13518
11:13:06,960 --> 11:13:09,400
And when we do that, we get some value.
13519
11:13:09,400 --> 11:13:12,160
And then we can do the same thing for the negative case.
13520
11:13:12,160 --> 11:13:15,520
We're going to do the same thing, take the probability that it's negative,
13521
11:13:15,520 --> 11:13:18,480
multiply it by all of these conditional probabilities,
13522
11:13:18,480 --> 11:13:20,680
and we're going to get some other value.
13523
11:13:20,680 --> 11:13:22,400
And now these values don't sum to one.
13524
11:13:22,400 --> 11:13:24,520
They're not a probability distribution yet.
13525
11:13:24,520 --> 11:13:27,320
But I can normalize them and get some values.
13526
11:13:27,320 --> 11:13:31,320
And that tells me that we're going to predict that my grandson loved it.
13527
11:13:31,320 --> 11:13:35,400
We think there's a 68% chance, probability 0.68,
13528
11:13:35,400 --> 11:13:40,080
that that is a positive sentiment review, and 0.32 probability
13529
11:13:40,080 --> 11:13:42,160
that it's a negative review.
13530
11:13:42,160 --> 11:13:44,480
So what problems might we run into here?
13531
11:13:44,480 --> 11:13:47,920
What could potentially go wrong when doing this kind of analysis
13532
11:13:47,920 --> 11:13:51,720
in order to analyze whether text has a positive or negative sentiment?
13533
11:13:51,720 --> 11:13:53,800
Well, a couple of problems might arise.
13534
11:13:53,800 --> 11:13:57,480
One problem might be, what if the word grandson never
13535
11:13:57,480 --> 11:14:00,960
appears for any of the positive reviews?
13536
11:14:00,960 --> 11:14:03,720
If that were the case, then when we try to calculate the value,
13537
11:14:03,720 --> 11:14:06,360
the probability that we think the review is positive,
13538
11:14:06,360 --> 11:14:08,600
we're going to multiply all these values together,
13539
11:14:08,600 --> 11:14:11,120
and we're just going to get 0 for the positive case,
13540
11:14:11,120 --> 11:14:14,520
because we're all going to ultimately multiply by that 0 value.
13541
11:14:14,520 --> 11:14:17,440
And so we're going to say that we think there is no chance
13542
11:14:17,440 --> 11:14:20,560
that the review is positive because it contains the word grandson.
13543
11:14:20,560 --> 11:14:23,040
And in our training data, we've never seen the word grandson
13544
11:14:23,040 --> 11:14:27,040
appear in a positive sentiment message before.
13545
11:14:27,040 --> 11:14:29,360
And that's probably not the right analysis,
13546
11:14:29,360 --> 11:14:32,080
because in cases of rare words, it might be the case
13547
11:14:32,080 --> 11:14:34,280
that in nowhere in our training data did we ever
13548
11:14:34,280 --> 11:14:38,360
see the word grandson appear in a message that has positive sentiment.
13549
11:14:38,360 --> 11:14:40,320
So what can we do to solve this problem?
13550
11:14:40,320 --> 11:14:43,160
Well, one thing we'll often do is some kind of additive smoothing,
13551
11:14:43,160 --> 11:14:46,640
where we add some value alpha to each value in our distribution
13552
11:14:46,640 --> 11:14:48,480
just to smooth out the data a little bit.
13553
11:14:48,480 --> 11:14:50,920
And a common form of this is Laplace smoothing,
13554
11:14:50,920 --> 11:14:53,680
where we add 1 to each value in our distribution.
13555
11:14:53,680 --> 11:14:56,880
In essence, we pretend we've seen each value one more time
13556
11:14:56,880 --> 11:14:58,000
than we actually have.
13557
11:14:58,000 --> 11:15:01,160
So if we've never seen the word grandson for a positive review,
13558
11:15:01,160 --> 11:15:02,400
we pretend we've seen it once.
13559
11:15:02,400 --> 11:15:04,880
If we've seen it once, we pretend we've seen it twice,
13560
11:15:04,880 --> 11:15:09,600
just to avoid the possibility that we might multiply by 0 and as a result,
13561
11:15:09,600 --> 11:15:12,560
get some results we don't want in our analysis.
13562
11:15:12,560 --> 11:15:14,520
So let's see what this looks like in practice.
13563
11:15:14,520 --> 11:15:18,360
Let's try to do some naive Bayes classification in order
13564
11:15:18,360 --> 11:15:22,360
to classify text as either positive or negative.
13565
11:15:22,360 --> 11:15:25,480
We'll take a look at sentiment.py.
13566
11:15:25,480 --> 11:15:28,960
And what this is going to do is load some sample data into memory,
13567
11:15:28,960 --> 11:15:32,440
some examples of positive reviews and negative reviews.
13568
11:15:32,440 --> 11:15:35,980
And then we're going to train a naive Bayes classifier
13569
11:15:35,980 --> 11:15:39,080
on all of this training data, training data that
13570
11:15:39,080 --> 11:15:42,260
includes all of the words we see in positive reviews
13571
11:15:42,260 --> 11:15:44,920
and all of the words we see in negative reviews.
13572
11:15:44,920 --> 11:15:48,160
And then we're going to try to classify some input.
13573
11:15:48,160 --> 11:15:50,840
And so we're going to do this based on a corpus of data.
13574
11:15:50,840 --> 11:15:52,520
I have some example positive reviews.
13575
11:15:52,520 --> 11:15:53,840
Here are some positive reviews.
13576
11:15:53,840 --> 11:15:56,080
It was great, so much fun, for example.
13577
11:15:56,080 --> 11:15:59,060
And then some negative reviews, not worth it, kind of cheap.
13578
11:15:59,060 --> 11:16:02,080
These are some examples of negative reviews.
13579
11:16:02,080 --> 11:16:04,640
So now let's try to run this classifier and see
13580
11:16:04,640 --> 11:16:09,400
how it would classify particular text as either positive or negative.
13581
11:16:09,400 --> 11:16:14,360
We'll go ahead and run our sentiment analysis on this corpus.
13582
11:16:14,360 --> 11:16:16,080
And we need to provide it with a review.
13583
11:16:16,080 --> 11:16:19,600
So I'll say something like, I enjoyed it.
13584
11:16:19,600 --> 11:16:23,880
And we see that the classifier says there is about a 0.92 probability
13585
11:16:23,880 --> 11:16:27,120
that we think that this particular review is positive.
13586
11:16:27,120 --> 11:16:28,520
Let's try something negative.
13587
11:16:28,520 --> 11:16:31,720
We'll try kind of overpriced.
13588
11:16:31,720 --> 11:16:34,400
And we see that there is a 0.96 probability
13589
11:16:34,400 --> 11:16:37,280
now that we think that this particular review is negative.
13590
11:16:37,280 --> 11:16:40,600
And so our naive Bayes classifier has learned what kinds of words
13591
11:16:40,600 --> 11:16:43,800
tend to appear in positive reviews and what kinds of words
13592
11:16:43,800 --> 11:16:45,480
tend to appear in negative reviews.
13593
11:16:45,480 --> 11:16:49,100
And as a result of that, we've been able to design a classifier that
13594
11:16:49,100 --> 11:16:54,240
can predict whether a particular review is positive or negative.
13595
11:16:54,240 --> 11:16:56,800
And so this definitely is a useful tool that we can use
13596
11:16:56,800 --> 11:16:58,400
to try and make some predictions.
13597
11:16:58,400 --> 11:17:01,000
But we had to make some assumptions in order to get there.
13598
11:17:01,000 --> 11:17:04,160
So what if we want to now try to build some more sophisticated models,
13599
11:17:04,160 --> 11:17:07,100
use some tools from machine learning to try and take
13600
11:17:07,100 --> 11:17:09,560
better advantage of language data to be able to draw
13601
11:17:09,560 --> 11:17:12,320
more accurate conclusions and solve new kinds of tasks
13602
11:17:12,320 --> 11:17:13,840
and new kinds of problems?
13603
11:17:13,840 --> 11:17:17,280
Well, we've seen a couple of times now that when we want to take some data
13604
11:17:17,280 --> 11:17:19,480
and take some input, put it in a way that the computer is
13605
11:17:19,480 --> 11:17:22,760
going to be able to make sense of, it can be helpful to take that data
13606
11:17:22,760 --> 11:17:25,040
and turn it into numbers, ultimately.
13607
11:17:25,040 --> 11:17:27,200
And so what we might want to try to do is come up
13608
11:17:27,200 --> 11:17:30,860
with some word representation, some way to take a word
13609
11:17:30,860 --> 11:17:33,480
and translate its meaning into numbers.
13610
11:17:33,480 --> 11:17:35,940
Because, for example, if we wanted to use a neural network
13611
11:17:35,940 --> 11:17:39,080
to be able to process language, give our language to a neural network
13612
11:17:39,080 --> 11:17:42,400
and have it make some predictions or perform some analysis there,
13613
11:17:42,400 --> 11:17:45,920
a neural network takes its input and produces its output
13614
11:17:45,920 --> 11:17:48,520
a vector of values, a vector of numbers.
13615
11:17:48,520 --> 11:17:51,280
And so what we might want to do is take our data
13616
11:17:51,280 --> 11:17:54,800
and somehow take words and convert them into some kind
13617
11:17:54,800 --> 11:17:56,760
of numeric representation.
13618
11:17:56,760 --> 11:17:57,880
So how might we do that?
13619
11:17:57,880 --> 11:18:01,600
How might we take words and turn them into numbers?
13620
11:18:01,600 --> 11:18:03,440
Let's take a look at an example.
13621
11:18:03,440 --> 11:18:05,680
Here's a sentence, he wrote a book.
13622
11:18:05,680 --> 11:18:08,080
And let's say I wanted to take each of those words
13623
11:18:08,080 --> 11:18:10,200
and turn it into a vector of values.
13624
11:18:10,200 --> 11:18:11,640
Here's one way I might do that.
13625
11:18:11,640 --> 11:18:15,720
We'll say he is going to be a vector that has a 1 in the first position
13626
11:18:15,720 --> 11:18:17,720
and the rest of the values are 0.
13627
11:18:17,720 --> 11:18:20,680
Wrote will have a 1 in the second position and the rest of the values
13628
11:18:20,680 --> 11:18:21,560
are 0.
13629
11:18:21,560 --> 11:18:24,960
A has a 1 in the third position with the rest of the value 0.
13630
11:18:24,960 --> 11:18:28,760
And book has a 1 in the fourth position with the rest of the value 0.
13631
11:18:28,760 --> 11:18:33,360
So each of these words now has a distinct vector representation.
13632
11:18:33,360 --> 11:18:36,760
And this is what we often call a one-hot representation,
13633
11:18:36,760 --> 11:18:41,400
a representation of the meaning of a word as a vector with a single 1
13634
11:18:41,400 --> 11:18:43,920
and all of the rest of the values are 0.
13635
11:18:43,920 --> 11:18:47,480
And so when doing this, we now have a numeric representation for every word
13636
11:18:47,480 --> 11:18:50,120
and we could pass in those vector representations
13637
11:18:50,120 --> 11:18:52,520
into a neural network or other models that
13638
11:18:52,520 --> 11:18:55,840
require some kind of numeric data as input.
13639
11:18:55,840 --> 11:18:59,080
But this one-hot representation actually has a couple of problems
13640
11:18:59,080 --> 11:19:01,360
and it's not ideal for a few reasons.
13641
11:19:01,360 --> 11:19:03,960
One reason is, here we're just looking at four words.
13642
11:19:03,960 --> 11:19:07,720
But if you imagine a vocabulary of thousands of words or more,
13643
11:19:07,720 --> 11:19:09,720
these vectors are going to get quite long in order
13644
11:19:09,720 --> 11:19:14,160
to have a distinct vector for every possible word in a vocabulary.
13645
11:19:14,160 --> 11:19:16,280
And as a result of that, these longer vectors
13646
11:19:16,280 --> 11:19:19,280
are going to be more difficult to deal with, more difficult to train,
13647
11:19:19,280 --> 11:19:19,760
and so forth.
13648
11:19:19,760 --> 11:19:21,720
And so that might be a problem.
13649
11:19:21,720 --> 11:19:24,280
Another problem is a little bit more subtle.
13650
11:19:24,280 --> 11:19:27,040
If we want to represent a word as a vector,
13651
11:19:27,040 --> 11:19:29,880
and in particular the meaning of a word as a vector,
13652
11:19:29,880 --> 11:19:33,960
then ideally it should be the case that words that have similar meanings
13653
11:19:33,960 --> 11:19:36,880
should also have similar vector representations,
13654
11:19:36,880 --> 11:19:40,800
so that they're close to each other together inside a vector space.
13655
11:19:40,800 --> 11:19:44,400
But that's not really going to be the case with these one-hot representations,
13656
11:19:44,400 --> 11:19:46,840
because if we take some similar words, say the word
13657
11:19:46,840 --> 11:19:50,240
wrote and the word authored, which means similar things,
13658
11:19:50,240 --> 11:19:54,040
they have entirely different vector representations.
13659
11:19:54,040 --> 11:19:57,880
Likewise, book and novel, those two words mean somewhat similar things,
13660
11:19:57,880 --> 11:20:00,840
but they have entirely different vector representations
13661
11:20:00,840 --> 11:20:04,120
because they each have a one in some different position.
13662
11:20:04,120 --> 11:20:05,960
And so that's not ideal either.
13663
11:20:05,960 --> 11:20:08,080
So what we might be interested in instead
13664
11:20:08,080 --> 11:20:10,640
is some kind of distributed representation.
13665
11:20:10,640 --> 11:20:13,320
A distributed representation is the representation
13666
11:20:13,320 --> 11:20:17,200
of the meaning of a word distributed across multiple values,
13667
11:20:17,200 --> 11:20:20,720
instead of just being one-hot with a one in one position.
13668
11:20:20,720 --> 11:20:25,000
Here is what a distributed representation of words might be.
13669
11:20:25,000 --> 11:20:28,360
Each word is associated with some vector of values,
13670
11:20:28,360 --> 11:20:31,080
with the meaning distributed across multiple values,
13671
11:20:31,080 --> 11:20:34,320
ideally in such a way that similar words have
13672
11:20:34,320 --> 11:20:37,080
a similar vector representation.
13673
11:20:37,080 --> 11:20:39,080
But how are we going to come up with those values?
13674
11:20:39,080 --> 11:20:40,600
Where do those values come from?
13675
11:20:40,600 --> 11:20:45,800
How can we define the meaning of a word in this distributed sequence of numbers?
13676
11:20:45,800 --> 11:20:47,840
Well, to do that, we're going to draw inspiration
13677
11:20:47,840 --> 11:20:50,880
from a quote from British linguist J.R. Firth, who said,
13678
11:20:50,880 --> 11:20:54,200
you shall know a word by the company it keeps.
13679
11:20:54,200 --> 11:20:56,920
In other words, we're going to define the meaning of a word
13680
11:20:56,920 --> 11:21:01,160
based on the words that appear around it, the context words around it.
13681
11:21:01,160 --> 11:21:05,200
Take, for example, this context, for blank he ate.
13682
11:21:05,200 --> 11:21:08,760
You might wonder, what words could reasonably fill in that blank?
13683
11:21:08,760 --> 11:21:11,920
Well, it might be words like breakfast or lunch or dinner.
13684
11:21:11,920 --> 11:21:14,520
All of those could reasonably fill in that blank.
13685
11:21:14,520 --> 11:21:17,920
And so what we're going to say is because the words breakfast and lunch
13686
11:21:17,920 --> 11:21:23,240
and dinner appear in a similar context, that they must have a similar meaning.
13687
11:21:23,240 --> 11:21:26,400
And that's something our computer could understand and try to learn.
13688
11:21:26,400 --> 11:21:28,880
A computer could look at a big corpus of text,
13689
11:21:28,880 --> 11:21:32,360
look at what words tend to appear in similar context to each other,
13690
11:21:32,360 --> 11:21:35,880
and use that to identify which words have a similar meaning
13691
11:21:35,880 --> 11:21:40,240
and should therefore appear close to each other inside a vector space.
13692
11:21:40,240 --> 11:21:44,200
And so one common model for doing this is known as the word to vec model.
13693
11:21:44,200 --> 11:21:48,640
It's a model for generating word vectors, a vector representation for every word
13694
11:21:48,640 --> 11:21:52,960
by looking at data and looking at the context in which a word appears.
13695
11:21:52,960 --> 11:21:54,240
The idea is going to be this.
13696
11:21:54,240 --> 11:21:58,680
If you start out with all of the words just in some random position in space
13697
11:21:58,680 --> 11:22:02,640
and train it on some training data, what the word to vec model will do
13698
11:22:02,640 --> 11:22:05,880
is start to learn what words appear in similar contexts.
13699
11:22:05,880 --> 11:22:08,720
And it will move these vectors around in such a way
13700
11:22:08,720 --> 11:22:12,600
that hopefully words with similar meanings, breakfast, lunch, and dinner,
13701
11:22:12,600 --> 11:22:17,040
book, memoir, novel, will hopefully appear to be near to each other
13702
11:22:17,040 --> 11:22:19,080
as vectors as well.
13703
11:22:19,080 --> 11:22:21,280
So let's now take a look at what word to vec
13704
11:22:21,280 --> 11:22:24,880
might look like in practice when implemented in code.
13705
11:22:24,880 --> 11:22:29,560
What I have here inside of words.txt is a pre-trained model
13706
11:22:29,560 --> 11:22:32,640
where each of these words has some vector representation
13707
11:22:32,640 --> 11:22:33,960
trained by word to vec.
13708
11:22:33,960 --> 11:22:38,600
Each of these words has some sequence of values representing its meaning,
13709
11:22:38,600 --> 11:22:43,600
hopefully in such a way that similar words are represented by similar vectors.
13710
11:22:43,600 --> 11:22:47,280
I also have this file vectors.py, which is going to open up the words
13711
11:22:47,280 --> 11:22:48,800
and form them into a dictionary.
13712
11:22:48,800 --> 11:22:51,400
And we also define some useful functions like distance
13713
11:22:51,400 --> 11:22:55,160
to get the distance between two word vectors and closest words
13714
11:22:55,160 --> 11:23:00,200
to find which words are nearby in terms of having close vectors to each other.
13715
11:23:00,200 --> 11:23:02,360
And so let's give this a try.
13716
11:23:02,360 --> 11:23:05,760
We'll go ahead and open a Python interpreter.
13717
11:23:05,760 --> 11:23:10,160
And I'm going to import these vectors.
13718
11:23:10,160 --> 11:23:13,360
And we might say, all right, what is the vector representation
13719
11:23:13,360 --> 11:23:15,680
of the word book?
13720
11:23:15,680 --> 11:23:19,520
And we get this big long vector that represents the word book
13721
11:23:19,520 --> 11:23:21,120
as a sequence of values.
13722
11:23:21,120 --> 11:23:24,320
And this sequence of values by itself is not all that meaningful.
13723
11:23:24,320 --> 11:23:27,440
But it is meaningful in the context of comparing it
13724
11:23:27,440 --> 11:23:30,400
to other vectors for other words.
13725
11:23:30,400 --> 11:23:32,280
So we could use this distance function, which
13726
11:23:32,280 --> 11:23:35,520
is going to get us the distance between two word vectors.
13727
11:23:35,520 --> 11:23:37,880
And we might say, what is the distance between the vector
13728
11:23:37,880 --> 11:23:42,200
representation for the word book and the vector representation
13729
11:23:42,200 --> 11:23:44,320
for the word novel?
13730
11:23:44,320 --> 11:23:46,280
And we see that it's 0.34.
13731
11:23:46,280 --> 11:23:49,360
You can kind of interpret 0 as being really close together and 1
13732
11:23:49,360 --> 11:23:51,040
being very far apart.
13733
11:23:51,040 --> 11:23:55,840
And so now, what is the distance between book and, let's say, breakfast?
13734
11:23:55,840 --> 11:23:58,560
Well, book and breakfast are more different from each other
13735
11:23:58,560 --> 11:23:59,840
than book and novel are.
13736
11:23:59,840 --> 11:24:02,600
So I would hopefully expect the distance to be larger.
13737
11:24:02,600 --> 11:24:05,600
And in fact, it is 0.64 approximately.
13738
11:24:05,600 --> 11:24:08,440
These two words are further away from each other.
13739
11:24:08,440 --> 11:24:13,600
And what about now the distance between, let's say, lunch and breakfast?
13740
11:24:13,600 --> 11:24:15,040
Well, that's about 0.2.
13741
11:24:15,040 --> 11:24:16,400
Those are even closer together.
13742
11:24:16,400 --> 11:24:19,920
They have a meaning that is closer to each other.
13743
11:24:19,920 --> 11:24:24,400
Another interesting thing we might do is calculate the closest words.
13744
11:24:24,400 --> 11:24:28,200
We might say, what are the closest words, according to Word2Vec,
13745
11:24:28,200 --> 11:24:29,840
to the word book?
13746
11:24:29,840 --> 11:24:32,120
And let's say, let's get the 10 closest words.
13747
11:24:32,120 --> 11:24:35,960
What are the 10 closest vectors to the vector representation
13748
11:24:35,960 --> 11:24:37,680
for the word book?
13749
11:24:37,680 --> 11:24:40,920
And when we perform that analysis, we get this list of words.
13750
11:24:40,920 --> 11:24:44,760
The closest one is book itself, but we also have books plural,
13751
11:24:44,760 --> 11:24:48,760
and then essay, memoir, essays, novella, anthology, and so on.
13752
11:24:48,760 --> 11:24:52,240
All of these words mean something similar to the word book,
13753
11:24:52,240 --> 11:24:54,320
according to Word2Vec, at least, because they
13754
11:24:54,320 --> 11:24:56,560
have a similar vector representation.
13755
11:24:56,560 --> 11:24:59,240
So it seems like we've done a pretty good job of trying
13756
11:24:59,240 --> 11:25:03,920
to capture this kind of vector representation of word meaning.
13757
11:25:03,920 --> 11:25:06,720
One other interesting side effect of Word2Vec
13758
11:25:06,720 --> 11:25:10,160
is that it's also able to capture something about the relationships
13759
11:25:10,160 --> 11:25:12,080
between words as well.
13760
11:25:12,080 --> 11:25:13,480
Let's take a look at an example.
13761
11:25:13,480 --> 11:25:16,880
Here, for instance, are two words, man and king.
13762
11:25:16,880 --> 11:25:20,480
And these are each represented by Word2Vec as vectors.
13763
11:25:20,480 --> 11:25:23,960
So what might happen if I subtracted one from the other,
13764
11:25:23,960 --> 11:25:27,360
calculated the value king minus man?
13765
11:25:27,360 --> 11:25:31,040
Well, that will be the vector that will take us from man to king,
13766
11:25:31,040 --> 11:25:35,000
somehow represent this relationship between the vector representation
13767
11:25:35,000 --> 11:25:38,960
of the word man and the vector representation of the word king.
13768
11:25:38,960 --> 11:25:42,520
And that's what this value, king minus man, represents.
13769
11:25:42,520 --> 11:25:45,920
So what would happen if I took the vector representation of the word
13770
11:25:45,920 --> 11:25:51,200
woman and added that same value, king minus man, to it?
13771
11:25:51,200 --> 11:25:54,720
What would we get as the closest word to that, for example?
13772
11:25:54,720 --> 11:25:55,680
Well, we could try it.
13773
11:25:55,680 --> 11:25:59,880
Let's go ahead and go back to our Python interpreter and give this a try.
13774
11:25:59,880 --> 11:26:03,680
I could say, what is the closest word to the vector representation
13775
11:26:03,680 --> 11:26:07,440
of the word king minus the representation of the word man
13776
11:26:07,440 --> 11:26:11,440
plus the representation of the word woman?
13777
11:26:11,440 --> 11:26:14,320
And we see that the closest word is the word queen.
13778
11:26:14,320 --> 11:26:17,760
We've somehow been able to capture the relationship between king and man.
13779
11:26:17,760 --> 11:26:19,920
And then when we apply it to the word woman,
13780
11:26:19,920 --> 11:26:24,120
we get, as the result, the word queen.
13781
11:26:24,120 --> 11:26:27,320
So Word2Vec has been able to capture not just the words
13782
11:26:27,320 --> 11:26:29,720
and how they're similar to each other, but also something
13783
11:26:29,720 --> 11:26:33,400
about the relationships between words and how those words are connected
13784
11:26:33,400 --> 11:26:34,760
to each other.
13785
11:26:34,760 --> 11:26:37,280
So now that we have this vector representation of words,
13786
11:26:37,280 --> 11:26:38,600
what can we now do with it?
13787
11:26:38,600 --> 11:26:40,680
Now we can represent words as numbers.
13788
11:26:40,680 --> 11:26:43,480
And so we might try to pass those words as input
13789
11:26:43,480 --> 11:26:45,080
to, say, a neural network.
13790
11:26:45,080 --> 11:26:47,200
Neural networks we've seen are very powerful tools
13791
11:26:47,200 --> 11:26:50,640
for identifying patterns and making predictions.
13792
11:26:50,640 --> 11:26:53,800
Recall that a neural network you can think of as all of these units.
13793
11:26:53,800 --> 11:26:55,720
But really what the neural network is doing
13794
11:26:55,720 --> 11:26:58,720
is taking some input, passing it into the network,
13795
11:26:58,720 --> 11:27:00,360
and then producing some output.
13796
11:27:00,360 --> 11:27:02,800
And by providing the neural network with training data,
13797
11:27:02,800 --> 11:27:05,600
we're able to update the weights inside of the network
13798
11:27:05,600 --> 11:27:09,160
so that the neural network can do a more accurate job of translating
13799
11:27:09,160 --> 11:27:11,760
those inputs into those outputs.
13800
11:27:11,760 --> 11:27:14,560
And now that we can represent words as numbers that
13801
11:27:14,560 --> 11:27:18,280
could be the input or output, you could imagine passing a word in
13802
11:27:18,280 --> 11:27:21,720
as input to a neural network and getting a word as output.
13803
11:27:21,720 --> 11:27:23,320
And so when might that be useful?
13804
11:27:23,320 --> 11:27:26,840
One common use for neural networks is in machine translation,
13805
11:27:26,840 --> 11:27:29,960
when we want to translate text from one language into another,
13806
11:27:29,960 --> 11:27:33,760
say translate English into French by passing English into the neural
13807
11:27:33,760 --> 11:27:36,000
network and getting some French output.
13808
11:27:36,000 --> 11:27:39,720
You might imagine, for instance, that we could take the English word for lamp,
13809
11:27:39,720 --> 11:27:43,760
pass it into the neural network, get the French word for lamp as output.
13810
11:27:43,760 --> 11:27:48,000
But in practice, when we're translating text from one language to another,
13811
11:27:48,000 --> 11:27:50,200
we're usually not just interested in translating
13812
11:27:50,200 --> 11:27:53,800
a single word from one language to another, but a sequence,
13813
11:27:53,800 --> 11:27:56,240
say a sentence or a paragraph of words.
13814
11:27:56,240 --> 11:27:58,440
Here, for example, is another paragraph, again taken
13815
11:27:58,440 --> 11:28:00,640
from Sherlock Holmes, written in English.
13816
11:28:00,640 --> 11:28:03,960
And what I might want to do is take that entire sentence,
13817
11:28:03,960 --> 11:28:08,300
pass it into the neural network, and get as output a French translation
13818
11:28:08,300 --> 11:28:10,120
of the same sentence.
13819
11:28:10,120 --> 11:28:12,680
But recall that a neural network's input and output
13820
11:28:12,680 --> 11:28:14,880
needs to be of some fixed size.
13821
11:28:14,880 --> 11:28:16,480
And a sentence is not a fixed size.
13822
11:28:16,480 --> 11:28:17,080
It's variable.
13823
11:28:17,080 --> 11:28:20,640
You might have shorter sentences, and you might have longer sentences.
13824
11:28:20,640 --> 11:28:23,480
So somehow, we need to solve the problem of translating
13825
11:28:23,480 --> 11:28:27,680
a sequence into another sequence by means of a neural network.
13826
11:28:27,680 --> 11:28:30,520
And that's going to be true not only for machine translation,
13827
11:28:30,520 --> 11:28:33,960
but also for other problems, problems like question answering.
13828
11:28:33,960 --> 11:28:36,360
If I want to pass as input a question, something
13829
11:28:36,360 --> 11:28:38,960
like what is the capital of Massachusetts,
13830
11:28:38,960 --> 11:28:41,280
feed that as input into the neural network,
13831
11:28:41,280 --> 11:28:43,160
I would hope that what I would get as output
13832
11:28:43,160 --> 11:28:46,360
is a sentence like the capital is Boston, again,
13833
11:28:46,360 --> 11:28:50,080
translating some sequence into some other sequence.
13834
11:28:50,080 --> 11:28:53,480
And if you've ever had a conversation with an AI chatbot,
13835
11:28:53,480 --> 11:28:55,960
or have ever asked your phone a question,
13836
11:28:55,960 --> 11:28:57,400
it needs to do something like this.
13837
11:28:57,400 --> 11:29:00,680
It needs to understand the sequence of words that you, the human,
13838
11:29:00,680 --> 11:29:02,000
provided as input.
13839
11:29:02,000 --> 11:29:06,160
And then the computer needs to generate some sequence of words as output.
13840
11:29:06,160 --> 11:29:07,520
So how can we do this?
13841
11:29:07,520 --> 11:29:10,880
Well, one tool that we can use is the recurrent neural network, which
13842
11:29:10,880 --> 11:29:13,280
we took a look at last time, which is a way for us
13843
11:29:13,280 --> 11:29:16,120
to provide a sequence of values to a neural network
13844
11:29:16,120 --> 11:29:18,640
by running the neural network multiple times.
13845
11:29:18,640 --> 11:29:22,280
And each time we run the neural network, what we're going to do
13846
11:29:22,280 --> 11:29:25,040
is we're going to keep track of some hidden state.
13847
11:29:25,040 --> 11:29:26,880
And that hidden state is going to be passed
13848
11:29:26,880 --> 11:29:30,200
from one run of the neural network to the next run of the neural network,
13849
11:29:30,200 --> 11:29:33,240
keeping track of all of the relevant information.
13850
11:29:33,240 --> 11:29:35,320
And so let's take a look at how we can apply that
13851
11:29:35,320 --> 11:29:36,440
to something like this.
13852
11:29:36,440 --> 11:29:39,280
And in particular, we're going to look at an architecture known
13853
11:29:39,280 --> 11:29:41,960
as an encoder-decoder architecture, where
13854
11:29:41,960 --> 11:29:46,320
we're going to encode this question into some kind of hidden state,
13855
11:29:46,320 --> 11:29:50,320
and then use a decoder to decode that hidden state into the output
13856
11:29:50,320 --> 11:29:52,080
that we're interested in.
13857
11:29:52,080 --> 11:29:53,560
So what's that going to look like?
13858
11:29:53,560 --> 11:29:55,760
We'll start with the first word, the word what.
13859
11:29:55,760 --> 11:29:58,040
That goes into our neural network, and it's
13860
11:29:58,040 --> 11:30:00,720
going to produce some hidden state.
13861
11:30:00,720 --> 11:30:04,760
This is some information about the word what that our neural network is
13862
11:30:04,760 --> 11:30:06,720
going to need to keep track of.
13863
11:30:06,720 --> 11:30:09,280
Then when the second word comes along, we're
13864
11:30:09,280 --> 11:30:12,360
going to feed it into that same encoder neural network,
13865
11:30:12,360 --> 11:30:15,920
but it's going to get as input that hidden state as well.
13866
11:30:15,920 --> 11:30:17,440
So we pass in the second word.
13867
11:30:17,440 --> 11:30:19,960
We also get the information about the hidden state,
13868
11:30:19,960 --> 11:30:23,360
and that's going to continue for the other words in the input.
13869
11:30:23,360 --> 11:30:25,520
This is going to produce a new hidden state.
13870
11:30:25,520 --> 11:30:30,200
And so then when we get to the third word, the, that goes into the encoder.
13871
11:30:30,200 --> 11:30:32,840
It also gets access to the hidden state, and then it
13872
11:30:32,840 --> 11:30:35,720
produces a new hidden state that gets passed into the next run
13873
11:30:35,720 --> 11:30:37,160
when we use the word capital.
13874
11:30:37,160 --> 11:30:39,720
And the same thing is going to repeat for the other words
13875
11:30:39,720 --> 11:30:41,520
that appear in the input.
13876
11:30:41,520 --> 11:30:47,320
So of Massachusetts, that produces one final piece of hidden state.
13877
11:30:47,320 --> 11:30:50,040
Now somehow, we need to signal the fact that we're done.
13878
11:30:50,040 --> 11:30:51,640
There's nothing left in the input.
13879
11:30:51,640 --> 11:30:54,440
And we typically do this by passing some kind of special token,
13880
11:30:54,440 --> 11:30:57,400
say an end token, into the neural network.
13881
11:30:57,400 --> 11:31:00,480
And now the decoding process is going to start.
13882
11:31:00,480 --> 11:31:03,320
We're going to generate the word the.
13883
11:31:03,320 --> 11:31:06,120
But in addition to generating the word the,
13884
11:31:06,120 --> 11:31:11,160
this decoder network is also going to generate some kind of hidden state.
13885
11:31:11,160 --> 11:31:13,160
And so what happens the next time?
13886
11:31:13,160 --> 11:31:15,200
Well, to generate the next word, it might
13887
11:31:15,200 --> 11:31:18,520
be helpful to know what the first word was.
13888
11:31:18,520 --> 11:31:22,840
So we might pass the first word the back into the decoder network.
13889
11:31:22,840 --> 11:31:24,880
It's going to get as input this hidden state,
13890
11:31:24,880 --> 11:31:27,640
and it's going to generate the next word capital.
13891
11:31:27,640 --> 11:31:30,040
And that's also going to generate some hidden state.
13892
11:31:30,040 --> 11:31:32,280
And we'll repeat that, passing capital into the network
13893
11:31:32,280 --> 11:31:35,400
to generate the third word is, and then one more time
13894
11:31:35,400 --> 11:31:38,040
in order to get the fourth word Boston.
13895
11:31:38,040 --> 11:31:39,400
And at that point, we're done.
13896
11:31:39,400 --> 11:31:40,840
But how do we know we're done?
13897
11:31:40,840 --> 11:31:45,560
Usually, we'll do this one more time, pass Boston into the decoder network,
13898
11:31:45,560 --> 11:31:50,720
and get an output some end token to indicate that that is the end of our input.
13899
11:31:50,720 --> 11:31:53,640
And so this then is how we could use a recurrent neural network
13900
11:31:53,640 --> 11:31:57,140
to take some input, encode it into some hidden state,
13901
11:31:57,140 --> 11:32:01,160
and then use that hidden state to decode it into the output we're interested in.
13902
11:32:01,160 --> 11:32:04,560
To visualize it in a slightly different way, we have some input sequence.
13903
11:32:04,560 --> 11:32:06,740
This is just some sequence of words.
13904
11:32:06,740 --> 11:32:10,280
That input sequence goes into the encoder, which in this case
13905
11:32:10,280 --> 11:32:14,160
is a recurrent neural network generating these hidden states along the way
13906
11:32:14,160 --> 11:32:17,320
until we generate some final hidden state, at which point
13907
11:32:17,320 --> 11:32:19,120
we start the decoding process.
13908
11:32:19,120 --> 11:32:21,360
Again, using a recurrent neural network, that's
13909
11:32:21,360 --> 11:32:23,960
going to generate the output sequence as well.
13910
11:32:23,960 --> 11:32:26,960
So we've got the encoder, which is encoding the information
13911
11:32:26,960 --> 11:32:29,560
about the input sequence into this hidden state,
13912
11:32:29,560 --> 11:32:32,360
and then the decoder, which takes that hidden state
13913
11:32:32,360 --> 11:32:36,320
and uses it in order to generate the output sequence.
13914
11:32:36,320 --> 11:32:37,640
But there are some problems.
13915
11:32:37,640 --> 11:32:39,840
And for many years, this was the state of the art.
13916
11:32:39,840 --> 11:32:42,360
The recurrent neural network and variance on this approach
13917
11:32:42,360 --> 11:32:44,480
were some of the best ways we knew in order
13918
11:32:44,480 --> 11:32:46,620
to perform tasks in natural language processing.
13919
11:32:46,620 --> 11:32:49,280
But there are some problems that we might want to try to deal with
13920
11:32:49,280 --> 11:32:51,320
and that have been dealt with over the years
13921
11:32:51,320 --> 11:32:54,460
to try and improve upon this kind of model.
13922
11:32:54,460 --> 11:32:58,240
And one problem you might notice happens in this encoder stage.
13923
11:32:58,240 --> 11:33:01,040
We've taken this input sequence, the sequence of words,
13924
11:33:01,040 --> 11:33:05,480
and encoded it all into this final piece of hidden state.
13925
11:33:05,480 --> 11:33:07,440
And that final piece of hidden state needs
13926
11:33:07,440 --> 11:33:10,560
to contain all of the information from the input sequence
13927
11:33:10,560 --> 11:33:14,800
that we need in order to generate the output sequence.
13928
11:33:14,800 --> 11:33:18,080
And while that's possible, it becomes increasingly difficult
13929
11:33:18,080 --> 11:33:20,260
as the sequence gets larger and larger.
13930
11:33:20,260 --> 11:33:22,720
For larger and larger input sequences, it's
13931
11:33:22,720 --> 11:33:24,800
going to become more and more difficult to store
13932
11:33:24,800 --> 11:33:27,180
all of the information we need about the input
13933
11:33:27,180 --> 11:33:30,600
inside this single hidden state piece of context.
13934
11:33:30,600 --> 11:33:33,720
That's a lot of information to pack into just a single value.
13935
11:33:33,720 --> 11:33:36,840
It might be useful for us, when generating output,
13936
11:33:36,840 --> 11:33:40,460
to not just refer to this one value, but to all
13937
11:33:40,460 --> 11:33:44,620
of the previous hidden values that have been generated by the encoder.
13938
11:33:44,620 --> 11:33:46,880
And so that might be useful, but how could we do that?
13939
11:33:46,880 --> 11:33:48,380
We've got a lot of different values.
13940
11:33:48,380 --> 11:33:50,080
We need to combine them somehow.
13941
11:33:50,080 --> 11:33:52,320
So you could imagine adding them together,
13942
11:33:52,320 --> 11:33:54,440
taking the average of them, for example.
13943
11:33:54,440 --> 11:33:57,960
But doing that would assume that all of these pieces of hidden state
13944
11:33:57,960 --> 11:33:59,680
are equally important.
13945
11:33:59,680 --> 11:34:01,280
But that's not necessarily true either.
13946
11:34:01,280 --> 11:34:03,480
Some of these pieces of hidden state are going
13947
11:34:03,480 --> 11:34:05,680
to be more important than others, depending
13948
11:34:05,680 --> 11:34:08,520
on what word they most closely correspond to.
13949
11:34:08,520 --> 11:34:11,040
This piece of hidden state very closely corresponds
13950
11:34:11,040 --> 11:34:13,040
to the first word of the input sequence.
13951
11:34:13,040 --> 11:34:16,600
This one very closely corresponds to the second word of the input sequence,
13952
11:34:16,600 --> 11:34:17,800
for example.
13953
11:34:17,800 --> 11:34:21,200
And some of those are going to be more important than others.
13954
11:34:21,200 --> 11:34:23,400
To make matters more complicated, depending
13955
11:34:23,400 --> 11:34:26,520
on which word of the output sequence we're generating,
13956
11:34:26,520 --> 11:34:30,000
different input words might be more or less important.
13957
11:34:30,000 --> 11:34:33,520
And so what we really want is some way to decide for ourselves
13958
11:34:33,520 --> 11:34:37,040
which of the input values are worth paying attention to,
13959
11:34:37,040 --> 11:34:38,640
at what point in time.
13960
11:34:38,640 --> 11:34:42,160
And this is the key idea behind a mechanism known as attention.
13961
11:34:42,160 --> 11:34:45,760
Attention is all about letting us decide which values
13962
11:34:45,760 --> 11:34:49,120
are important to pay attention to, when generating, in this case,
13963
11:34:49,120 --> 11:34:51,880
the next word in our sequence.
13964
11:34:51,880 --> 11:34:54,160
So let's take a look at an example of that.
13965
11:34:54,160 --> 11:34:55,200
Here's a sentence.
13966
11:34:55,200 --> 11:34:57,520
What is the capital of Massachusetts?
13967
11:34:57,520 --> 11:34:59,080
Same sentence as before.
13968
11:34:59,080 --> 11:35:02,120
And let's imagine that we were trying to answer that question
13969
11:35:02,120 --> 11:35:04,200
by generating tokens of output.
13970
11:35:04,200 --> 11:35:05,800
So what would the output look like?
13971
11:35:05,800 --> 11:35:09,080
Well, it's going to look like something like the capital is.
13972
11:35:09,080 --> 11:35:12,520
And let's say we're now trying to generate this last word here.
13973
11:35:12,520 --> 11:35:13,800
What is that last word?
13974
11:35:13,800 --> 11:35:16,680
How is the computer going to figure it out?
13975
11:35:16,680 --> 11:35:19,440
Well, what it's going to need to do is decide
13976
11:35:19,440 --> 11:35:22,320
which values it's going to pay attention to.
13977
11:35:22,320 --> 11:35:24,480
And so the attention mechanism will allow
13978
11:35:24,480 --> 11:35:28,120
us to calculate some attention scores for each word,
13979
11:35:28,120 --> 11:35:32,480
some value corresponding to each word, determining how relevant
13980
11:35:32,480 --> 11:35:36,320
is it for us to pay attention to that word right now?
13981
11:35:36,320 --> 11:35:39,880
And in this case, when generating the fourth word of the output sequence,
13982
11:35:39,880 --> 11:35:42,240
the most important words to pay attention to
13983
11:35:42,240 --> 11:35:46,240
might be capital and Massachusetts, for example.
13984
11:35:46,240 --> 11:35:49,000
That those words are going to be particularly relevant.
13985
11:35:49,000 --> 11:35:50,920
And there are a number of different mechanisms
13986
11:35:50,920 --> 11:35:53,760
that have been used in order to calculate these attention scores.
13987
11:35:53,760 --> 11:35:56,400
It could be something as simple as a dot product
13988
11:35:56,400 --> 11:35:58,600
to see how similar two vectors are, or we
13989
11:35:58,600 --> 11:36:02,000
could train an entire neural network to calculate these attention scores.
13990
11:36:02,000 --> 11:36:06,000
But the key idea is that during the training process for our neural network,
13991
11:36:06,000 --> 11:36:09,400
we're going to learn how to calculate these attention scores.
13992
11:36:09,400 --> 11:36:12,640
Our model is going to learn what is important to pay attention
13993
11:36:12,640 --> 11:36:17,120
to in order to decide what the next word should be.
13994
11:36:17,120 --> 11:36:20,360
So the result of all of this, calculating these attention scores,
13995
11:36:20,360 --> 11:36:24,520
is that we can calculate some value, some value for each input word,
13996
11:36:24,520 --> 11:36:28,080
determining how important is it for us to pay attention
13997
11:36:28,080 --> 11:36:29,880
to that particular value.
13998
11:36:29,880 --> 11:36:32,000
And recall that each of these input words
13999
11:36:32,000 --> 11:36:36,400
is also associated with one of these hidden state context vectors,
14000
11:36:36,400 --> 11:36:39,600
capturing information about the sentence up to that point,
14001
11:36:39,600 --> 11:36:43,560
but primarily focused on that word in particular.
14002
11:36:43,560 --> 11:36:46,440
And so what we can now do is if we have all of these vectors
14003
11:36:46,440 --> 11:36:49,560
and we have values representing how important is it for us
14004
11:36:49,560 --> 11:36:52,320
to pay attention to those particular vectors,
14005
11:36:52,320 --> 11:36:54,320
is we can take a weighted average.
14006
11:36:54,320 --> 11:36:58,560
We can take all of these vectors, multiply them by their attention scores,
14007
11:36:58,560 --> 11:37:01,600
and add them up to get some new vector value, which
14008
11:37:01,600 --> 11:37:04,160
is going to represent the context from the input,
14009
11:37:04,160 --> 11:37:07,000
but specifically paying attention to the words
14010
11:37:07,000 --> 11:37:09,520
that we think are most important.
14011
11:37:09,520 --> 11:37:12,400
And once we've done that, that context vector
14012
11:37:12,400 --> 11:37:14,840
can be fed into our decoder in order to say
14013
11:37:14,840 --> 11:37:18,640
that the word should be, in this case, Boston.
14014
11:37:18,640 --> 11:37:21,600
So attention is this very powerful tool that
14015
11:37:21,600 --> 11:37:24,280
allows any word when we're trying to decode it
14016
11:37:24,280 --> 11:37:28,400
to decide which words from the input should we pay attention to in order
14017
11:37:28,400 --> 11:37:33,440
to determine what's important for generating the next word of the output.
14018
11:37:33,440 --> 11:37:35,640
And one of the first places this was really used
14019
11:37:35,640 --> 11:37:37,920
was in the field of machine translation.
14020
11:37:37,920 --> 11:37:39,960
Here's an example of a diagram from the paper
14021
11:37:39,960 --> 11:37:42,160
that introduced this idea, which was focused
14022
11:37:42,160 --> 11:37:45,760
on trying to translate English sentences into French sentences.
14023
11:37:45,760 --> 11:37:48,560
So we have an input English sentence up along the top,
14024
11:37:48,560 --> 11:37:51,120
and then along the left side, the output French equivalent
14025
11:37:51,120 --> 11:37:52,680
of that same sentence.
14026
11:37:52,680 --> 11:37:56,280
And what you see in all of these squares are the attention scores
14027
11:37:56,280 --> 11:38:01,280
visualized, where a lighter square indicates a higher attention score.
14028
11:38:01,280 --> 11:38:04,200
And what you'll notice is that there's a strong correspondence
14029
11:38:04,200 --> 11:38:07,360
between the French word and the equivalent English word,
14030
11:38:07,360 --> 11:38:10,040
that the French word for agreement is really
14031
11:38:10,040 --> 11:38:12,600
paying attention to the English word for agreement
14032
11:38:12,600 --> 11:38:16,320
in order to decide what French word should be generated at that point
14033
11:38:16,320 --> 11:38:17,080
in time.
14034
11:38:17,080 --> 11:38:19,280
And sometimes you might pay attention to multiple words
14035
11:38:19,280 --> 11:38:22,280
if you look at the French word for economic.
14036
11:38:22,280 --> 11:38:25,800
That's primarily paying attention to the English word for economic,
14037
11:38:25,800 --> 11:38:30,440
but also paying attention to the English word for European in this case too.
14038
11:38:30,440 --> 11:38:33,460
And so attention scores are very easy to visualize
14039
11:38:33,460 --> 11:38:37,040
to get a sense for what is our machine learning model really
14040
11:38:37,040 --> 11:38:40,200
paying attention to, what information is it using in order
14041
11:38:40,200 --> 11:38:42,960
to determine what's important and what's not in order
14042
11:38:42,960 --> 11:38:46,800
to determine what the ultimate output token should be.
14043
11:38:46,800 --> 11:38:49,160
And so when we combine the attention mechanism
14044
11:38:49,160 --> 11:38:52,880
with a recurrent neural network, we can get very powerful and useful results
14045
11:38:52,880 --> 11:38:56,400
where we're able to generate an output sequence by paying attention
14046
11:38:56,400 --> 11:38:58,080
to the input sequence too.
14047
11:38:58,080 --> 11:39:00,080
But there are other problems with this approach
14048
11:39:00,080 --> 11:39:02,400
of using a recurrent neural network as well.
14049
11:39:02,400 --> 11:39:05,440
In particular, notice that every run of the neural network
14050
11:39:05,440 --> 11:39:07,760
depends on the output of the previous step.
14051
11:39:07,760 --> 11:39:09,520
And that was important for getting a sense
14052
11:39:09,520 --> 11:39:12,800
for the sequence of words and the ordering of those particular words.
14053
11:39:12,800 --> 11:39:15,880
But we can't run this unit of the neural network
14054
11:39:15,880 --> 11:39:19,680
until after we've calculated the hidden state from the run before it
14055
11:39:19,680 --> 11:39:21,600
from the previous input token.
14056
11:39:21,600 --> 11:39:25,800
And what that means is that it's very difficult to parallelize this process.
14057
11:39:25,800 --> 11:39:28,480
That as the input sequence get longer and longer,
14058
11:39:28,480 --> 11:39:31,280
we might want to use parallelism to try and speed up
14059
11:39:31,280 --> 11:39:33,400
this process of training the neural network
14060
11:39:33,400 --> 11:39:35,600
and making sense of all of this language data.
14061
11:39:35,600 --> 11:39:36,840
But it's difficult to do that.
14062
11:39:36,840 --> 11:39:39,320
And it's slow to do that with a recurrent neural network
14063
11:39:39,320 --> 11:39:42,480
because all of it needs to be performed in sequence.
14064
11:39:42,480 --> 11:39:45,040
And that's become an increasing challenge as we've
14065
11:39:45,040 --> 11:39:47,840
started to get larger and larger language models.
14066
11:39:47,840 --> 11:39:50,120
The more language data that we have available to us
14067
11:39:50,120 --> 11:39:52,480
to use to train our machine learning models,
14068
11:39:52,480 --> 11:39:55,640
the more accurate it can be, the better representation of language
14069
11:39:55,640 --> 11:39:58,000
it can have, the better understanding it can have,
14070
11:39:58,000 --> 11:40:00,160
and the better results that we can see.
14071
11:40:00,160 --> 11:40:02,880
And so we've seen this growth of large language models
14072
11:40:02,880 --> 11:40:05,120
that are using larger and larger data sets.
14073
11:40:05,120 --> 11:40:08,080
But as a result, they take longer and longer to train.
14074
11:40:08,080 --> 11:40:10,680
And so this problem that recurrent neural networks
14075
11:40:10,680 --> 11:40:15,120
are not easy to parallelize has become an increasing problem.
14076
11:40:15,120 --> 11:40:18,000
And as a result of that, that was one of the main motivations
14077
11:40:18,000 --> 11:40:20,640
for a different architecture, for thinking about how
14078
11:40:20,640 --> 11:40:22,600
to deal with natural language.
14079
11:40:22,600 --> 11:40:25,200
And that's known as the transformer architecture.
14080
11:40:25,200 --> 11:40:28,480
And this has been a significant milestone in the world of natural language
14081
11:40:28,480 --> 11:40:32,000
processing for really increasing how well we can perform
14082
11:40:32,000 --> 11:40:34,400
these kinds of natural language processing tasks,
14083
11:40:34,400 --> 11:40:37,760
as well as how quickly we can train a machine learning model to be
14084
11:40:37,760 --> 11:40:39,880
able to produce effective results.
14085
11:40:39,880 --> 11:40:42,080
There are a number of different types of transformers
14086
11:40:42,080 --> 11:40:43,280
in terms of how they work.
14087
11:40:43,280 --> 11:40:45,000
But what we're going to take a look at here
14088
11:40:45,000 --> 11:40:48,760
is the basic architecture for how one might work with a transformer
14089
11:40:48,760 --> 11:40:52,080
to get a sense for what's involved and what we're doing.
14090
11:40:52,080 --> 11:40:54,820
So let's start with the model we were looking at before,
14091
11:40:54,820 --> 11:40:59,040
specifically at this encoder part of our encoder-decoder architecture,
14092
11:40:59,040 --> 11:41:01,880
where we used a recurrent neural network to take this input
14093
11:41:01,880 --> 11:41:06,160
sequence and capture all of this information about the hidden state
14094
11:41:06,160 --> 11:41:09,520
and the information we need to know about that input sequence.
14095
11:41:09,520 --> 11:41:13,200
Right now, it all needs to happen in this linear progression.
14096
11:41:13,200 --> 11:41:15,600
But what the transformer is going to allow us to do
14097
11:41:15,600 --> 11:41:18,920
is process each of the words independently in a way that's
14098
11:41:18,920 --> 11:41:22,640
easy to parallelize, rather than have each word wait for some other word.
14099
11:41:22,640 --> 11:41:26,000
Each word is going to go through this same neural network
14100
11:41:26,000 --> 11:41:29,440
and produce some kind of encoded representation
14101
11:41:29,440 --> 11:41:31,160
of that particular input word.
14102
11:41:31,160 --> 11:41:33,800
And all of this is going to happen in parallel.
14103
11:41:33,800 --> 11:41:35,800
Now, it's happening for all of the words at once,
14104
11:41:35,800 --> 11:41:37,160
but we're really just going to focus on what's
14105
11:41:37,160 --> 11:41:39,240
happening for one word to make it clear.
14106
11:41:39,240 --> 11:41:41,880
But know that whatever you're seeing happen for this one word
14107
11:41:41,880 --> 11:41:45,680
is going to happen for all of the other input words, too.
14108
11:41:45,680 --> 11:41:47,280
So what's going on here?
14109
11:41:47,280 --> 11:41:49,800
Well, we start with some input word.
14110
11:41:49,800 --> 11:41:52,160
That input word goes into the neural network.
14111
11:41:52,160 --> 11:41:57,100
And the output is hopefully some encoded representation of the input word,
14112
11:41:57,100 --> 11:41:59,840
the information we need to know about the input word that's
14113
11:41:59,840 --> 11:42:03,320
going to be relevant to us as we're generating the output.
14114
11:42:03,320 --> 11:42:06,040
And because we're doing this each word independently,
14115
11:42:06,040 --> 11:42:07,200
it's easy to parallelize.
14116
11:42:07,200 --> 11:42:09,360
We don't have to wait for the previous word
14117
11:42:09,360 --> 11:42:12,800
before we run this word through the neural network.
14118
11:42:12,800 --> 11:42:16,800
But what did we lose in this process by trying to parallelize this whole thing?
14119
11:42:16,800 --> 11:42:19,640
Well, we've lost all notion of word ordering.
14120
11:42:19,640 --> 11:42:21,400
The order of words is important.
14121
11:42:21,400 --> 11:42:24,280
The sentence, Sherlock Holmes gave the book to Watson,
14122
11:42:24,280 --> 11:42:27,520
has a different meaning than Watson gave the book to Sherlock Holmes.
14123
11:42:27,520 --> 11:42:31,360
And so we want to keep track of that information about word position.
14124
11:42:31,360 --> 11:42:34,120
In the recurrent neural network, that happened for us automatically
14125
11:42:34,120 --> 11:42:37,640
because we could run each word one at a time through the neural network,
14126
11:42:37,640 --> 11:42:41,600
get the hidden state, pass it on to the next run of the neural network.
14127
11:42:41,600 --> 11:42:44,040
But that's not the case here with the transformer,
14128
11:42:44,040 --> 11:42:49,080
where each word is being processed independent of all of the other ones.
14129
11:42:49,080 --> 11:42:51,520
So what are we going to do to try to solve that problem?
14130
11:42:51,520 --> 11:42:57,040
One thing we can do is add some kind of positional encoding to the input word.
14131
11:42:57,040 --> 11:42:59,440
The positional encoding is some vector that
14132
11:42:59,440 --> 11:43:02,280
represents the position of the word in the sentence.
14133
11:43:02,280 --> 11:43:05,240
This is the first word, the second word, the third word, and so forth.
14134
11:43:05,240 --> 11:43:08,080
We're going to add that to the input word.
14135
11:43:08,080 --> 11:43:10,400
And the result of that is going to be a vector
14136
11:43:10,400 --> 11:43:12,840
that captures multiple pieces of information.
14137
11:43:12,840 --> 11:43:17,400
It captures the input word itself as well as where in the sentence it appears.
14138
11:43:17,400 --> 11:43:20,440
The result of that is we can pass the output of that addition,
14139
11:43:20,440 --> 11:43:23,760
the addition of the input word and the positional encoding
14140
11:43:23,760 --> 11:43:24,920
into the neural network.
14141
11:43:24,920 --> 11:43:27,440
That way, the neural network knows the word and where
14142
11:43:27,440 --> 11:43:31,320
it appears in the sentence and can use both of those pieces of information
14143
11:43:31,320 --> 11:43:34,720
to determine how best to represent the meaning of that word
14144
11:43:34,720 --> 11:43:38,240
in the encoded representation at the end of it.
14145
11:43:38,240 --> 11:43:40,160
In addition to what we have here, in addition
14146
11:43:40,160 --> 11:43:43,880
to the positional encoding and this feed forward neural network,
14147
11:43:43,880 --> 11:43:47,200
we're also going to add one additional component, which
14148
11:43:47,200 --> 11:43:49,920
is going to be a self-attention step.
14149
11:43:49,920 --> 11:43:52,440
This is going to be attention where we're paying attention
14150
11:43:52,440 --> 11:43:54,560
to the other input words.
14151
11:43:54,560 --> 11:43:57,240
Because the meaning or interpretation of an input word
14152
11:43:57,240 --> 11:44:00,880
might vary depending on the other words in the input as well.
14153
11:44:00,880 --> 11:44:03,520
And so we're going to allow each word in the input
14154
11:44:03,520 --> 11:44:06,800
to decide what other words in the input it should pay attention
14155
11:44:06,800 --> 11:44:10,800
to in order to decide on its encoded representation.
14156
11:44:10,800 --> 11:44:13,960
And that's going to allow us to get a better encoded representation
14157
11:44:13,960 --> 11:44:16,920
for each word because words are defined by their context,
14158
11:44:16,920 --> 11:44:21,400
by the words around them and how they're used in that particular context.
14159
11:44:21,400 --> 11:44:24,280
This kind of self-attention is so valuable, in fact,
14160
11:44:24,280 --> 11:44:28,560
that oftentimes the transformer will use multiple different self-attention
14161
11:44:28,560 --> 11:44:31,800
layers at the same time to allow for this model
14162
11:44:31,800 --> 11:44:36,400
to be able to pay attention to multiple facets of the input at the same time.
14163
11:44:36,400 --> 11:44:40,360
And we call this multi-headed attention, where each attention head can pay
14164
11:44:40,360 --> 11:44:41,880
attention to something different.
14165
11:44:41,880 --> 11:44:45,000
And as a result, this network can learn to pay attention
14166
11:44:45,000 --> 11:44:49,600
to many different parts of the input for this input word all at the same time.
14167
11:44:49,600 --> 11:44:52,160
And in the spirit of deep learning, these two steps,
14168
11:44:52,160 --> 11:44:56,120
this multi-headed self-attention layer and this neural network layer,
14169
11:44:56,120 --> 11:44:59,160
that itself can be repeated multiple times, too,
14170
11:44:59,160 --> 11:45:01,600
in order to get a deeper representation, in order
14171
11:45:01,600 --> 11:45:04,280
to learn deeper patterns within the input text
14172
11:45:04,280 --> 11:45:07,360
and ultimately get a better representation of language
14173
11:45:07,360 --> 11:45:11,620
in order to get useful encoded representations of all of the input
14174
11:45:11,620 --> 11:45:12,840
words.
14175
11:45:12,840 --> 11:45:15,520
And so this is the process that a transformer might
14176
11:45:15,520 --> 11:45:20,080
use in order to take an input word and get it its encoded representation.
14177
11:45:20,080 --> 11:45:23,760
And the key idea is to really rely on this attention step
14178
11:45:23,760 --> 11:45:26,280
in order to get information that's useful in order
14179
11:45:26,280 --> 11:45:29,000
to determine how to encode that word.
14180
11:45:29,000 --> 11:45:32,640
And that process is going to repeat for all of the input words that
14181
11:45:32,640 --> 11:45:33,760
are in the input sequence.
14182
11:45:33,760 --> 11:45:35,760
We're going to take all of the input words,
14183
11:45:35,760 --> 11:45:38,840
encode them with some kind of positional encoding,
14184
11:45:38,840 --> 11:45:42,480
feed those into these self-attention and feed-forward neural networks
14185
11:45:42,480 --> 11:45:46,920
in order to ultimately get these encoded representations of the words.
14186
11:45:46,920 --> 11:45:48,600
That's the result of the encoder.
14187
11:45:48,600 --> 11:45:51,680
We get all of these encoded representations
14188
11:45:51,680 --> 11:45:53,860
that will be useful to us when it comes time
14189
11:45:53,860 --> 11:45:57,080
then to try to decode all of this information
14190
11:45:57,080 --> 11:45:59,560
into the output sequence we're interested in.
14191
11:45:59,560 --> 11:46:02,920
And again, this might take place in the context of machine translation,
14192
11:46:02,920 --> 11:46:06,560
where the output is going to be the same sentence in a different language,
14193
11:46:06,560 --> 11:46:10,160
or it might be an answer to a question in the case of an AI chatbot,
14194
11:46:10,160 --> 11:46:11,240
for example.
14195
11:46:11,240 --> 11:46:15,960
And so now let's take a look at how that decoder is going to work.
14196
11:46:15,960 --> 11:46:19,040
Ultimately, it's going to have a very similar structure.
14197
11:46:19,040 --> 11:46:21,960
Any time we're trying to generate the next output word,
14198
11:46:21,960 --> 11:46:25,120
we need to know what the previous output word is,
14199
11:46:25,120 --> 11:46:27,000
as well as its positional encoding.
14200
11:46:27,000 --> 11:46:29,360
Where in the output sequence are we?
14201
11:46:29,360 --> 11:46:32,760
And we're going to have these same steps, self-attention,
14202
11:46:32,760 --> 11:46:34,640
because we might want an output word to be
14203
11:46:34,640 --> 11:46:37,880
able to pay attention to other words in that same output,
14204
11:46:37,880 --> 11:46:39,560
as well as a neural network.
14205
11:46:39,560 --> 11:46:42,440
And that might itself repeat multiple times.
14206
11:46:42,440 --> 11:46:45,840
But in this decoder, we're going to add one additional step.
14207
11:46:45,840 --> 11:46:48,600
We're going to add an additional attention step, where
14208
11:46:48,600 --> 11:46:51,200
instead of self-attention, where the output word is going
14209
11:46:51,200 --> 11:46:55,000
to pay attention to other output words, in this step,
14210
11:46:55,000 --> 11:46:58,080
we're going to allow the output word to pay attention
14211
11:46:58,080 --> 11:47:00,360
to the encoded representations.
14212
11:47:00,360 --> 11:47:04,160
So recall that the encoder is taking all of the input words
14213
11:47:04,160 --> 11:47:07,280
and transforming them into these encoded representations
14214
11:47:07,280 --> 11:47:08,760
of all of the input words.
14215
11:47:08,760 --> 11:47:11,560
But it's going to be important for us to be able to decide which
14216
11:47:11,560 --> 11:47:14,120
of those encoded representations we want to pay attention
14217
11:47:14,120 --> 11:47:18,640
to when generating any particular token in the output sequence.
14218
11:47:18,640 --> 11:47:22,520
And that's what this additional attention step is going to allow us to do.
14219
11:47:22,520 --> 11:47:26,160
It's saying that every time we're generating a word of the output,
14220
11:47:26,160 --> 11:47:28,600
we can pay attention to the other words in the output,
14221
11:47:28,600 --> 11:47:32,080
because we might want to know, what are the words we've generated previously?
14222
11:47:32,080 --> 11:47:33,920
And we want to pay attention to some of them
14223
11:47:33,920 --> 11:47:37,520
to decide what word is going to be next in the sequence.
14224
11:47:37,520 --> 11:47:41,080
But we also care about paying attention to the input words, too.
14225
11:47:41,080 --> 11:47:44,920
And we want the ability to decide which of these encoded representations
14226
11:47:44,920 --> 11:47:47,280
of the input words are going to be relevant in order
14227
11:47:47,280 --> 11:47:49,760
for us to generate the next step.
14228
11:47:49,760 --> 11:47:51,680
And so these two pieces combine together.
14229
11:47:51,680 --> 11:47:55,080
We have this encoder that takes all of the input words
14230
11:47:55,080 --> 11:47:57,640
and produces this encoded representation.
14231
11:47:57,640 --> 11:48:01,480
And we have this decoder that is able to take the previous output word,
14232
11:48:01,480 --> 11:48:06,280
pay attention to that encoded input, and then generate the next output word.
14233
11:48:06,280 --> 11:48:08,640
And this is one of the possible architectures
14234
11:48:08,640 --> 11:48:12,120
we could use for a transformer, with the key idea being
14235
11:48:12,120 --> 11:48:16,280
these attention steps that allow words to pay attention to each other.
14236
11:48:16,280 --> 11:48:20,240
During the training process here, we can now much more easily parallelize this,
14237
11:48:20,240 --> 11:48:23,440
because we don't have to wait for all of the words to happen in sequence.
14238
11:48:23,440 --> 11:48:26,960
And we can learn how we should perform these attention steps.
14239
11:48:26,960 --> 11:48:30,600
The model is able to learn what is important to pay attention to,
14240
11:48:30,600 --> 11:48:32,640
what things do I need to pay attention to,
14241
11:48:32,640 --> 11:48:37,240
in order to be more accurate at predicting what the output word is.
14242
11:48:37,240 --> 11:48:39,920
And this has proved to be a tremendously effective model
14243
11:48:39,920 --> 11:48:44,280
for conversational AI agents, for building machine translation systems.
14244
11:48:44,280 --> 11:48:47,000
And there have been many variants proposed on this model, too.
14245
11:48:47,000 --> 11:48:49,400
Some transformers only use an encoder.
14246
11:48:49,400 --> 11:48:51,080
Some only use a decoder.
14247
11:48:51,080 --> 11:48:54,720
Some use some other combination of these different particular features.
14248
11:48:54,720 --> 11:48:57,880
But the key ideas ultimately remain the same,
14249
11:48:57,880 --> 11:49:01,960
this real focus on trying to pay attention to what is most important.
14250
11:49:01,960 --> 11:49:04,080
And the world of natural language processing
14251
11:49:04,080 --> 11:49:06,080
is fast growing and fast evolving.
14252
11:49:06,080 --> 11:49:08,640
Year after year, we keep coming up with new models
14253
11:49:08,640 --> 11:49:11,760
that allow us to do an even better job of performing
14254
11:49:11,760 --> 11:49:14,600
these natural language related tasks, all on the surface
14255
11:49:14,600 --> 11:49:18,000
of solving the tricky problem, which is our own natural language.
14256
11:49:18,000 --> 11:49:21,800
We've seen how the syntax and semantics of our language is ambiguous,
14257
11:49:21,800 --> 11:49:24,000
and it introduces all of these new challenges
14258
11:49:24,000 --> 11:49:26,040
that we need to think about, if we're going
14259
11:49:26,040 --> 11:49:29,680
to be able to design AI agents that are able to work with language
14260
11:49:29,680 --> 11:49:30,800
effectively.
14261
11:49:30,800 --> 11:49:33,080
So as we think about where we've been in this class,
14262
11:49:33,080 --> 11:49:36,200
all of the different types of artificial intelligence we've considered,
14263
11:49:36,200 --> 11:49:38,960
we've looked at artificial intelligence in a wide variety
14264
11:49:38,960 --> 11:49:40,240
of different forms now.
14265
11:49:40,240 --> 11:49:42,880
We started by taking a look at search problems,
14266
11:49:42,880 --> 11:49:46,320
where we looked at how AI can search for solutions, play games,
14267
11:49:46,320 --> 11:49:48,680
and find the optimal decision to make.
14268
11:49:48,680 --> 11:49:53,080
We talked about knowledge, how AI can represent information that it knows
14269
11:49:53,080 --> 11:49:57,040
and use that information to generate new knowledge as well.
14270
11:49:57,040 --> 11:49:59,840
Then we looked at what AI can do when it's less certain,
14271
11:49:59,840 --> 11:50:01,760
when it doesn't know things for sure, and we
14272
11:50:01,760 --> 11:50:04,360
have to represent things in terms of probability.
14273
11:50:04,360 --> 11:50:06,360
We then took a look at optimization problems.
14274
11:50:06,360 --> 11:50:09,240
We saw how a lot of problems in AI can be boiled down
14275
11:50:09,240 --> 11:50:12,920
to trying to maximize or minimize some function.
14276
11:50:12,920 --> 11:50:15,040
And we looked at strategies that AI can use
14277
11:50:15,040 --> 11:50:18,240
in order to do that kind of maximizing and minimizing.
14278
11:50:18,240 --> 11:50:20,240
We then looked at the world of machine learning,
14279
11:50:20,240 --> 11:50:23,120
learning from data in order to figure out some patterns
14280
11:50:23,120 --> 11:50:26,600
and identify how to perform a task by looking at the training data
14281
11:50:26,600 --> 11:50:28,320
that we have available to it.
14282
11:50:28,320 --> 11:50:31,360
And one of the most powerful tools there was the neural network,
14283
11:50:31,360 --> 11:50:34,520
the sequence of units whose weights can be trained in order
14284
11:50:34,520 --> 11:50:37,680
to allow us to really effectively go from input to output
14285
11:50:37,680 --> 11:50:41,760
and predict how to get there by learning these underlying patterns.
14286
11:50:41,760 --> 11:50:44,240
And then today, we took a look at language itself,
14287
11:50:44,240 --> 11:50:47,080
trying to understand how can we train the computer to be
14288
11:50:47,080 --> 11:50:49,080
able to understand our natural language, to be
14289
11:50:49,080 --> 11:50:53,160
able to understand syntax and semantics, make sense of and generate
14290
11:50:53,160 --> 11:50:57,080
natural language, which introduces a number of interesting problems too.
14291
11:50:57,080 --> 11:51:00,120
And we've really just scratched the surface of artificial intelligence.
14292
11:51:00,120 --> 11:51:03,400
There is so much interesting research and interesting new techniques
14293
11:51:03,400 --> 11:51:05,480
and algorithms and ideas being introduced
14294
11:51:05,480 --> 11:51:07,800
to try to solve these types of problems.
14295
11:51:07,800 --> 11:51:10,160
So I hope you enjoyed this exploration into the world
14296
11:51:10,160 --> 11:51:11,480
of artificial intelligence.
14297
11:51:11,480 --> 11:51:14,520
A huge thanks to all of the course's teaching staff and production team
14298
11:51:14,520 --> 11:51:15,960
for making the class possible.
14299
11:51:15,960 --> 11:51:30,640
This was an introduction to artificial intelligence with Python.
1292369
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.