Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:06,006 --> 00:00:09,200
Hello, everybody, and welcome to chapter one of Python for Everybody.
2
00:00:09,200 --> 00:00:10,120
I'm Charles Severance.
3
00:00:10,120 --> 00:00:13,600
I'm your instructor, and I welcome you to this class.
4
00:00:13,600 --> 00:00:17,680
The basic goal of this class is to teach everybody how to program,
5
00:00:17,680 --> 00:00:19,000
regardless of your background.
6
00:00:19,000 --> 00:00:20,680
You don't have to be a math whiz.
7
00:00:20,680 --> 00:00:24,760
You don't have to be a computer expert.
8
00:00:24,760 --> 00:00:27,520
No matter how old you are or what your background is,
9
00:00:27,520 --> 00:00:28,840
we want to teach you how to program.
10
00:00:28,840 --> 00:00:30,160
So welcome to the course.
11
00:00:30,160 --> 00:00:32,439
Welcome to chapter one.
12
00:00:32,439 --> 00:00:39,040
So the first thing to understand is that the purpose to learn to program
13
00:00:39,040 --> 00:00:41,160
is because computers want to do things for us.
14
00:00:41,160 --> 00:00:46,800
They are built and created and designed, and their hardware is set up
15
00:00:46,800 --> 00:00:51,560
so that they basically ask us, what do you want to do next?
16
00:00:51,560 --> 00:01:00,560
If you grab your phone, your phone sort of does nothing until you tell it
17
00:01:00,560 --> 00:01:01,200
what to do.
18
00:01:01,200 --> 00:01:04,200
It waits for you, and it's just waiting for you.
19
00:01:04,200 --> 00:01:07,560
And all the hardware computer technology around you
20
00:01:07,560 --> 00:01:09,680
is generally waiting for you.
21
00:01:09,680 --> 00:01:12,400
And we can use this for useful things.
22
00:01:12,400 --> 00:01:14,080
We could play video games.
23
00:01:14,080 --> 00:01:18,200
We could have it help navigate for our cars.
24
00:01:18,200 --> 00:01:21,120
Someday we might even have self-driving cars.
25
00:01:21,120 --> 00:01:26,320
And it's really, in a sense, in my mind, silly
26
00:01:26,320 --> 00:01:30,400
if you spend your whole life not really understanding this technology.
27
00:01:30,400 --> 00:01:35,360
And I think it's important that we learn to tell these computers what to do,
28
00:01:35,360 --> 00:01:39,400
rather than just let them increasingly control our lives.
29
00:01:39,400 --> 00:01:43,440
And so as we'll see, computers aren't very smart on their own.
30
00:01:43,440 --> 00:01:45,920
We humans are the ones that imbue them with knowledge.
31
00:01:45,920 --> 00:01:48,840
And what we need to learn to speak their language.
32
00:01:48,840 --> 00:01:52,160
It is much easier for us to learn to speak their language
33
00:01:52,160 --> 00:01:54,340
than it is for them to learn to speak our language.
34
00:01:54,340 --> 00:01:57,760
Although with these cell phones, we're starting to see little bits
35
00:01:57,760 --> 00:01:59,920
where they can begin to understand.
36
00:01:59,920 --> 00:02:03,520
But you would be amazed at the 40 or 50 years
37
00:02:03,520 --> 00:02:10,520
that it has taken us to build programs to begin to understand.
38
00:02:10,520 --> 00:02:14,400
So I'm bringing you into something where you are going
39
00:02:14,400 --> 00:02:18,560
to learn the ways of programming and the ways of the computer.
40
00:02:18,560 --> 00:02:22,560
Because it's easier to teach you how to program than it is to teach this,
41
00:02:22,560 --> 00:02:23,880
how to work in your world.
42
00:02:23,880 --> 00:02:29,880
Even though, ultimately, the goal is to get this to do work for you.
43
00:02:29,880 --> 00:02:33,360
So part of what I'm trying to do is move you from a user perspective,
44
00:02:33,360 --> 00:02:35,640
where you just look at the computer as something
45
00:02:35,640 --> 00:02:40,960
that someone else has constructed and you are the user of,
46
00:02:40,960 --> 00:02:42,800
to the point where you construct new things.
47
00:02:42,800 --> 00:02:44,840
Now the first kinds of things that you're going to construct
48
00:02:44,840 --> 00:02:47,360
are actually things to solve your own problems.
49
00:02:47,360 --> 00:02:50,680
And it's very popular now to work on data.
50
00:02:50,680 --> 00:02:54,600
And Python is an excellent programming language for data mining and data
51
00:02:54,600 --> 00:02:55,280
analysis.
52
00:02:55,280 --> 00:02:57,200
And that's a lot of what we're going to do in this course.
53
00:02:57,200 --> 00:02:59,600
Although really, it's a gateway to all kinds of things,
54
00:02:59,600 --> 00:03:03,600
like artificial intelligence, or gaming, or navigation,
55
00:03:03,600 --> 00:03:07,440
or mobile applications, or entertainment, all kinds of things.
56
00:03:07,440 --> 00:03:09,240
But first, we have to learn to program.
57
00:03:09,240 --> 00:03:12,520
We have to move from using the computer as a tool
58
00:03:12,520 --> 00:03:14,640
to using the tools within the computer that
59
00:03:14,640 --> 00:03:19,840
allow us to change how the computer sees the world.
60
00:03:19,840 --> 00:03:22,680
So there's a couple of reasons that you might want to be a programmer.
61
00:03:22,680 --> 00:03:26,200
Some of you are looking to improve your career,
62
00:03:26,200 --> 00:03:28,040
to be paid to work on programming.
63
00:03:28,040 --> 00:03:31,600
I've been a paid programmer most of my life, and I like it.
64
00:03:31,600 --> 00:03:33,080
It's a good job.
65
00:03:33,080 --> 00:03:36,600
You don't have to stand in the mud.
66
00:03:36,600 --> 00:03:38,600
You don't have to lift things.
67
00:03:38,600 --> 00:03:39,840
You have to use your brain.
68
00:03:39,840 --> 00:03:43,880
And I'll just say that it has been nice for my career
69
00:03:43,880 --> 00:03:47,040
to not be exposed to the elements, but to be able to work often
70
00:03:47,040 --> 00:03:49,400
wherever I want.
71
00:03:49,400 --> 00:03:51,160
But that's actually our secondary goal.
72
00:03:51,160 --> 00:03:54,520
Our first goal is to get you to write programs that solve problems
73
00:03:54,520 --> 00:03:55,440
that you have to solve.
74
00:03:55,440 --> 00:03:58,800
Maybe you have a job as an accountant, or a lawyer,
75
00:03:58,800 --> 00:04:00,000
or something else.
76
00:04:00,000 --> 00:04:02,560
And maybe you run across some data.
77
00:04:02,560 --> 00:04:05,600
Maybe there's some system that logs your time,
78
00:04:05,600 --> 00:04:07,960
and it's not quite giving the report that you want to give.
79
00:04:07,960 --> 00:04:10,520
And so you say, could I just grab the log data myself
80
00:04:10,520 --> 00:04:13,800
and write a program to do some analysis to say,
81
00:04:13,800 --> 00:04:16,399
well, what's the average this versus that,
82
00:04:16,399 --> 00:04:19,199
or the average of some other thing?
83
00:04:19,200 --> 00:04:23,760
And so that's the basic idea, that you'll initially
84
00:04:23,760 --> 00:04:26,920
use computers to serve your own ends.
85
00:04:26,920 --> 00:04:28,800
That makes it a lot easier to write programs,
86
00:04:28,800 --> 00:04:31,720
because you don't have to worry about a million users using
87
00:04:31,720 --> 00:04:32,400
your software.
88
00:04:32,400 --> 00:04:34,620
If it works for you, then we're happy.
89
00:04:34,620 --> 00:04:37,000
And so it takes a little more training
90
00:04:37,000 --> 00:04:38,640
to write software for other people,
91
00:04:38,640 --> 00:04:42,960
or for thousands and thousands of other people.
92
00:04:42,960 --> 00:04:44,440
And so part of what I want to do is
93
00:04:44,440 --> 00:04:47,480
I want to change your perspective.
94
00:04:47,480 --> 00:04:50,680
You look at this from the outside,
95
00:04:50,680 --> 00:04:54,640
and you see it from the outside, and you click on things.
96
00:04:54,640 --> 00:04:56,240
I want to turn this around, and I
97
00:04:56,240 --> 00:05:00,600
want you to be the person inside this looking out at the world.
98
00:05:00,600 --> 00:05:03,420
And as a programmer, we are making things
99
00:05:03,420 --> 00:05:06,480
inside these computers for the world.
100
00:05:06,480 --> 00:05:10,280
And so we want to pull you into being part of this.
101
00:05:10,280 --> 00:05:14,440
We want you inside this, or thinking inside this.
102
00:05:14,440 --> 00:05:19,520
And what you learn is that if you're inside this computer,
103
00:05:19,520 --> 00:05:22,560
and you are taking your instructions to build programs
104
00:05:22,560 --> 00:05:28,520
to be used by the human outside the computer,
105
00:05:28,520 --> 00:05:31,480
you have things that you need to take advantage of.
106
00:05:31,480 --> 00:05:33,440
There's things like the central processing unit,
107
00:05:33,440 --> 00:05:35,360
the memory of this system, the network connection
108
00:05:35,360 --> 00:05:38,440
of this system, the disk drive, or permanent storage
109
00:05:38,440 --> 00:05:39,400
on this system.
110
00:05:39,400 --> 00:05:41,560
And as a programmer, you are kind
111
00:05:41,560 --> 00:05:44,840
of mediating between all those internal resources
112
00:05:44,840 --> 00:05:48,600
that this has that are not very smart, but highly powerful,
113
00:05:48,600 --> 00:05:51,040
and mediating with what that user wants.
114
00:05:51,040 --> 00:05:54,180
And so we take the end user, and we programmers,
115
00:05:54,180 --> 00:05:57,640
we serve the end user, but the computer serves us.
116
00:05:57,640 --> 00:06:01,520
So together, between us and all the computer's resources,
117
00:06:01,520 --> 00:06:04,460
we can serve the needs of the end user.
118
00:06:04,460 --> 00:06:10,120
And we do this by writing code, or programming.
119
00:06:10,120 --> 00:06:11,040
And what is that?
120
00:06:11,040 --> 00:06:15,080
Well, programming is a sequence of instructions
121
00:06:15,080 --> 00:06:17,520
where we are giving instructions to the resources
122
00:06:17,520 --> 00:06:19,960
inside the computer in a way to accomplish
123
00:06:19,960 --> 00:06:21,080
the goals of the end user.
124
00:06:21,080 --> 00:06:24,600
And remember, sometimes we are our own end user.
125
00:06:24,600 --> 00:06:30,620
It's not just you're not always doing a startup.
126
00:06:30,620 --> 00:06:33,120
You're not always writing a mobile gaming system.
127
00:06:33,120 --> 00:06:35,000
So sometimes you're writing something for yourself.
128
00:06:35,000 --> 00:06:35,720
But that's OK.
129
00:06:38,880 --> 00:06:42,200
So sometimes you're writing something to solve a problem.
130
00:06:42,200 --> 00:06:43,400
You're like crafting.
131
00:06:43,400 --> 00:06:46,960
You're doing something that you could do by hand or manually.
132
00:06:46,960 --> 00:06:51,600
And you're making some clever little 25 or 100 line program.
133
00:06:51,600 --> 00:06:54,440
And you're putting that in.
134
00:06:54,440 --> 00:06:57,240
Other times, like when I work on the open source learning
135
00:06:57,240 --> 00:07:00,760
management system, Sakai, it is my creativity.
136
00:07:00,760 --> 00:07:01,840
I've got an idea.
137
00:07:01,840 --> 00:07:04,200
And I want to share it with a million users.
138
00:07:04,200 --> 00:07:07,520
And so I write my code for an external audience.
139
00:07:07,520 --> 00:07:10,160
And so code is that sequence of instructions
140
00:07:10,160 --> 00:07:14,520
that the computer itself doesn't know how to hand a roster out.
141
00:07:14,520 --> 00:07:17,000
But I can write code that will hand a roster out
142
00:07:17,000 --> 00:07:20,140
by looking at the data that's inside this computer,
143
00:07:20,140 --> 00:07:22,360
inside this application.
144
00:07:22,360 --> 00:07:24,240
And so if you think about programs,
145
00:07:24,240 --> 00:07:28,360
we have programs for computers and programs for humans.
146
00:07:28,360 --> 00:07:30,920
And a number of years ago, now I'm starting.
147
00:07:30,920 --> 00:07:34,480
Sooner or later, this will be me showing my age.
148
00:07:34,480 --> 00:07:36,560
This is an example of the Macarena.
149
00:07:36,560 --> 00:07:38,800
And the Macarena is a song that effectively
150
00:07:38,800 --> 00:07:40,720
is a sequence of instructions.
151
00:07:40,720 --> 00:07:42,000
You put your left hand out.
152
00:07:42,000 --> 00:07:43,160
You put your right hand out.
153
00:07:43,160 --> 00:07:44,400
You put it on the shoulder.
154
00:07:44,400 --> 00:07:45,520
You wiggle, wiggle, wiggle.
155
00:07:45,520 --> 00:07:46,440
And you spin around.
156
00:07:46,440 --> 00:07:47,240
And you do things.
157
00:07:47,240 --> 00:07:52,760
And this is a program for people.
158
00:07:52,760 --> 00:07:55,200
And so I want you to take a quick look at this
159
00:07:55,200 --> 00:07:58,500
and see if you can find anything wrong
160
00:07:58,500 --> 00:08:01,200
with this particular program.
161
00:08:01,200 --> 00:08:02,240
So look really closely.
162
00:08:09,520 --> 00:08:13,160
So I'll show you.
163
00:08:13,160 --> 00:08:15,040
It's got some typographical errors in it.
164
00:08:15,040 --> 00:08:20,920
And we as humans are really good at reading or hearing
165
00:08:20,920 --> 00:08:23,800
typographical errors and correcting them automatically
166
00:08:23,800 --> 00:08:26,120
and instantly.
167
00:08:26,120 --> 00:08:28,000
But computers are not.
168
00:08:28,000 --> 00:08:31,120
Computers are extremely literal.
169
00:08:31,120 --> 00:08:34,880
If it saw this ham instead of hand,
170
00:08:34,880 --> 00:08:36,919
it would think, what's a ham?
171
00:08:36,919 --> 00:08:39,339
And why am I going to hit someone in the back of the head
172
00:08:39,340 --> 00:08:40,580
with a ham?
173
00:08:40,580 --> 00:08:44,400
And why would I take my left hand and hit somebody?
174
00:08:44,400 --> 00:08:45,760
These are all bad things.
175
00:08:45,760 --> 00:08:49,340
But the computer is going to take us very literally.
176
00:08:49,340 --> 00:08:52,560
And so we have to be really precise.
177
00:08:52,560 --> 00:08:54,800
And the computer just doesn't know
178
00:08:54,800 --> 00:08:58,760
the difference between what we mean and what we say.
179
00:08:58,760 --> 00:09:00,600
So we have to be very precise.
180
00:09:00,600 --> 00:09:03,980
And this is one of the great frustrations
181
00:09:03,980 --> 00:09:08,040
that people have when they first start using computers.
182
00:09:08,040 --> 00:09:09,600
And so we have to get this right.
183
00:09:09,600 --> 00:09:11,840
We have to get these little bits of text
184
00:09:11,840 --> 00:09:13,520
exactly the way they are.
185
00:09:13,520 --> 00:09:15,400
Computers will blow up with syntax errors.
186
00:09:15,400 --> 00:09:17,480
And they seem to make quite a fuss
187
00:09:17,480 --> 00:09:19,600
when you make the tiniest of errors.
188
00:09:19,600 --> 00:09:20,760
But you'll get used to that.
189
00:09:20,760 --> 00:09:24,960
I mean, that's not because you're bad or you're
190
00:09:24,960 --> 00:09:26,400
less than awesome.
191
00:09:26,400 --> 00:09:29,440
It just means the computers can't compensate
192
00:09:29,440 --> 00:09:30,760
when you make small mistakes.
193
00:09:30,760 --> 00:09:32,740
And so you've got to get used to the fact
194
00:09:32,740 --> 00:09:35,520
that the computer is sort of intellectually
195
00:09:35,520 --> 00:09:37,000
not as strong as you.
196
00:09:37,000 --> 00:09:39,120
And so it gets confused really easy.
197
00:09:39,120 --> 00:09:40,880
Even though when it gets confused,
198
00:09:40,880 --> 00:09:42,800
it says seemingly mean things to you.
199
00:09:42,800 --> 00:09:45,840
So you'll get used to that.
200
00:09:45,840 --> 00:09:47,980
OK, so the first thing I want to do
201
00:09:47,980 --> 00:09:49,320
is I want to throw up some text.
202
00:09:49,320 --> 00:09:51,880
And I want you to, while this text is up,
203
00:09:51,880 --> 00:09:55,640
I want you to count the number of each word in this text
204
00:09:55,640 --> 00:10:00,240
and tell me what the most common word is in this text.
205
00:10:00,240 --> 00:10:01,520
OK, so here we go.
206
00:10:01,520 --> 00:10:23,600
OK, so I kind of made that hard on you on purpose
207
00:10:23,600 --> 00:10:26,440
by moving around and distracting you and confusing you.
208
00:10:26,440 --> 00:10:29,080
But even if it's not moving at all,
209
00:10:29,080 --> 00:10:31,720
it's a little bit tricky to do.
210
00:10:31,720 --> 00:10:34,040
You probably stare at it a couple of times.
211
00:10:34,040 --> 00:10:36,440
Your brain is going back and forth and back and forth.
212
00:10:36,440 --> 00:10:39,360
And so text analysis is one of the great things
213
00:10:39,360 --> 00:10:41,960
that computers are very, very good at.
214
00:10:41,960 --> 00:10:45,240
And some of the things that they can translate text,
215
00:10:45,240 --> 00:10:46,620
and that's because they've looked
216
00:10:46,620 --> 00:10:47,620
at a lot of information.
217
00:10:47,620 --> 00:10:49,620
So looking at text is actually something computers
218
00:10:49,620 --> 00:10:51,080
are really good at.
219
00:10:51,080 --> 00:10:54,500
And so if we take a look at the kind of programs
220
00:10:54,500 --> 00:10:57,440
that we're going to write to do this kind of thing,
221
00:10:57,440 --> 00:11:00,000
this is something that humans are not naturally good at,
222
00:11:00,000 --> 00:11:01,440
but computers are super good at.
223
00:11:01,440 --> 00:11:05,240
Now, I'm not going to have you look at this code.
224
00:11:05,240 --> 00:11:07,040
I'm not going to, this code you will understand
225
00:11:07,040 --> 00:11:08,120
in a few weeks.
226
00:11:08,120 --> 00:11:11,160
But basically, this is a set of instructions
227
00:11:11,160 --> 00:11:14,160
to open a file, read that file,
228
00:11:14,160 --> 00:11:16,160
read all the words in the file,
229
00:11:16,160 --> 00:11:19,120
create a histogram of all the words in the file,
230
00:11:19,120 --> 00:11:21,400
and then search through that histogram
231
00:11:21,400 --> 00:11:22,880
to find the most common word
232
00:11:22,880 --> 00:11:26,940
and tell us what the most common word is in the file.
233
00:11:26,940 --> 00:11:29,800
And in this clown file, the word the is the most common.
234
00:11:29,800 --> 00:11:31,040
It happened seven times.
235
00:11:31,040 --> 00:11:33,960
And here's another large file called words.text,
236
00:11:33,960 --> 00:11:36,640
and the word to is the most common thing.
237
00:11:36,640 --> 00:11:38,520
And our goal is to get to the point
238
00:11:38,520 --> 00:11:40,560
where you can write this on your own.
239
00:11:40,560 --> 00:11:43,420
So you can say, you know what, I got a problem to solve.
240
00:11:43,420 --> 00:11:45,920
That is, what's the most common word in this file?
241
00:11:45,920 --> 00:11:49,240
I know how to start, and then I know how to finish.
242
00:11:49,240 --> 00:11:51,120
I know how to do the stuff in the middle.
243
00:11:51,120 --> 00:11:53,840
And we have to learn this kind of weird language.
244
00:11:53,840 --> 00:11:57,600
But when we do, we can count millions of words
245
00:11:57,600 --> 00:11:59,640
as easily as we count 20 words.
246
00:11:59,640 --> 00:12:02,200
So that's the fun of all of this,
247
00:12:02,200 --> 00:12:04,200
is to teach you this language
248
00:12:04,200 --> 00:12:06,420
so that you can solve that problem
249
00:12:06,420 --> 00:12:07,900
so that you don't have to solve it.
250
00:12:07,900 --> 00:12:09,400
Because you could solve it,
251
00:12:09,400 --> 00:12:12,240
but it's not something that you're naturally good at,
252
00:12:12,240 --> 00:12:13,480
and it's hard work.
253
00:12:14,520 --> 00:12:16,020
So up next, we're gonna talk a little bit
254
00:12:16,020 --> 00:12:17,960
about the hardware architecture
255
00:12:17,960 --> 00:12:22,960
that you're gonna be experiencing as you write programs.
256
00:12:25,800 --> 00:12:28,240
Hello, and welcome back to hardware architecture.
257
00:12:28,240 --> 00:12:30,800
Now, you might ask, why do I tell you
258
00:12:30,800 --> 00:12:32,360
about hardware architecture?
259
00:12:34,360 --> 00:12:36,620
Probably you're not gonna build any hardware,
260
00:12:36,620 --> 00:12:38,120
although it's fun stuff to do.
261
00:12:38,120 --> 00:12:40,040
And if you're gonna become a computer scientist,
262
00:12:40,040 --> 00:12:42,560
which most of you won't want to be,
263
00:12:42,560 --> 00:12:43,760
it's a great thing to study.
264
00:12:43,760 --> 00:12:46,040
And it's those who build our hardware
265
00:12:46,040 --> 00:12:48,160
are amazingly talented individuals,
266
00:12:48,160 --> 00:12:49,760
and it's a really rewarding job.
267
00:12:51,400 --> 00:12:53,800
The reason I like talking to you about hardware
268
00:12:53,800 --> 00:12:57,000
is because I want to be able to use words at some point
269
00:12:57,000 --> 00:12:58,520
and say, oh, secondary storage,
270
00:12:58,520 --> 00:12:59,880
or central processing unit,
271
00:12:59,880 --> 00:13:02,360
or random access memory,
272
00:13:02,360 --> 00:13:05,760
or peripherals, input devices.
273
00:13:05,760 --> 00:13:07,000
And I wanna be able to say those words,
274
00:13:07,000 --> 00:13:08,840
and I want you to be able to understand them.
275
00:13:08,840 --> 00:13:11,120
And so I'll start with a little piece of hardware
276
00:13:11,120 --> 00:13:12,680
called the Raspberry Pi.
277
00:13:12,680 --> 00:13:16,360
And the Raspberry Pi is a cute little single board computer.
278
00:13:17,320 --> 00:13:19,640
As we go forward, these things are smaller,
279
00:13:19,640 --> 00:13:21,300
and smaller, and smaller.
280
00:13:21,300 --> 00:13:23,160
And the interesting thing is that
281
00:13:23,160 --> 00:13:25,060
the architecture of these stays the same,
282
00:13:25,060 --> 00:13:27,600
but the number of components drops.
283
00:13:27,600 --> 00:13:31,440
So I'm gonna start and give you a block diagram
284
00:13:31,440 --> 00:13:34,100
of sort of a generic computer
285
00:13:34,100 --> 00:13:36,460
and tell you the major parts of it.
286
00:13:36,460 --> 00:13:41,020
Now, I'm gonna show you some really old hardware,
287
00:13:41,020 --> 00:13:42,680
some really new hardware,
288
00:13:42,680 --> 00:13:46,560
and then some hardware that is of medium age.
289
00:13:46,560 --> 00:13:47,760
And the medium age hardware
290
00:13:47,760 --> 00:13:49,560
is probably the easiest one to see.
291
00:13:49,560 --> 00:13:52,520
The architecture is the same, okay?
292
00:13:52,520 --> 00:13:57,520
And so the basic block diagram is that the brains,
293
00:13:57,780 --> 00:13:59,720
if there are brains in computers,
294
00:13:59,720 --> 00:14:01,120
which there really aren't,
295
00:14:01,120 --> 00:14:03,760
the software is the closest thing computers have to brains,
296
00:14:03,760 --> 00:14:07,040
but in hardware, the closest brain a computer has is this,
297
00:14:07,040 --> 00:14:11,240
called a micro-processing unit, or a central processor unit.
298
00:14:11,240 --> 00:14:15,240
And this is designed three billion times a second
299
00:14:17,440 --> 00:14:20,320
to ask the question, what do you want me to do next?
300
00:14:20,320 --> 00:14:23,620
And these little pins on the back are instructions,
301
00:14:23,620 --> 00:14:26,660
like 32 or 64 of these pins,
302
00:14:26,660 --> 00:14:27,920
three billion times a second,
303
00:14:27,920 --> 00:14:30,200
we send an instruction into these things.
304
00:14:30,200 --> 00:14:35,200
Now, we can't sit there and talk to it, we can't.
305
00:14:35,200 --> 00:14:37,040
And so the instructions we store
306
00:14:37,040 --> 00:14:39,080
in what's called the main memory.
307
00:14:39,080 --> 00:14:40,960
And this memory is really fast,
308
00:14:40,960 --> 00:14:43,480
and the memory sort of feeds this.
309
00:14:43,480 --> 00:14:46,700
And so every time the CPU needs a new instruction,
310
00:14:46,700 --> 00:14:49,200
it asks the memory where that instruction is.
311
00:14:49,200 --> 00:14:51,480
And so the memory feeds the instruction CPU,
312
00:14:51,480 --> 00:14:53,800
the CPU does it, says give me another instruction,
313
00:14:53,800 --> 00:14:55,920
CPU does it, gives me another instruction,
314
00:14:55,920 --> 00:14:59,800
and that is the basic essence of programming.
315
00:14:59,800 --> 00:15:01,360
This asks what's next,
316
00:15:01,360 --> 00:15:03,920
and this is where your program is stored,
317
00:15:03,920 --> 00:15:06,820
or a program you purchased or came with your hardware,
318
00:15:06,820 --> 00:15:09,920
where that's all stored, and those are your places.
319
00:15:09,920 --> 00:15:13,000
And so you end up inside,
320
00:15:13,000 --> 00:15:16,320
your programs end up inside this memory.
321
00:15:16,320 --> 00:15:19,040
So then there's a, I mean,
322
00:15:19,040 --> 00:15:23,000
and so in software you tend to program the CPU,
323
00:15:23,000 --> 00:15:25,520
and if you had bought a desktop computer
324
00:15:25,520 --> 00:15:26,800
a number of years back,
325
00:15:26,800 --> 00:15:28,960
it would have this thing called the motherboard.
326
00:15:28,960 --> 00:15:31,100
And the motherboard is called this
327
00:15:31,100 --> 00:15:34,000
because it kind of connects all the components together.
328
00:15:34,000 --> 00:15:36,860
And so if you buy memory by itself, it does nothing,
329
00:15:36,860 --> 00:15:39,420
but it has a place to plug into the motherboard,
330
00:15:39,420 --> 00:15:41,320
and if you buy a microprocessor,
331
00:15:41,320 --> 00:15:44,080
it has a place to plug into the motherboard.
332
00:15:45,640 --> 00:15:49,520
And if you buy a hard drive,
333
00:15:51,000 --> 00:15:53,400
this is a really old hard drive,
334
00:15:53,400 --> 00:15:55,120
it has a place to plug in on the motherboard,
335
00:15:55,120 --> 00:15:57,880
and so the motherboard sort of connects everything together.
336
00:15:57,880 --> 00:16:00,120
The hard drive is secondary storage.
337
00:16:00,120 --> 00:16:04,280
Now the way, how secondary storage is different
338
00:16:04,280 --> 00:16:08,400
than the main memory, which, there it is.
339
00:16:08,400 --> 00:16:11,600
I gotta unpile this stuff.
340
00:16:11,600 --> 00:16:14,460
So this main memory is really fast,
341
00:16:14,460 --> 00:16:17,640
but as soon as you turn the power off of this memory,
342
00:16:17,640 --> 00:16:19,120
it sort of vanishes.
343
00:16:19,120 --> 00:16:21,880
And so to store files like word processing files
344
00:16:21,880 --> 00:16:24,740
or text files or whatever,
345
00:16:24,740 --> 00:16:25,840
you gotta store it on something
346
00:16:25,840 --> 00:16:27,560
that lasts a little bit longer.
347
00:16:27,560 --> 00:16:31,000
And so that's the purpose of the secondary storage.
348
00:16:31,000 --> 00:16:32,220
It's permanent.
349
00:16:32,220 --> 00:16:34,280
When the power's off, it stores it.
350
00:16:34,280 --> 00:16:36,680
Now this one here is in such bad shape
351
00:16:36,680 --> 00:16:38,800
that isn't probably storing anything,
352
00:16:38,800 --> 00:16:40,580
but it's got these little heads,
353
00:16:40,580 --> 00:16:43,800
and it spins around and goes in and out.
354
00:16:43,800 --> 00:16:46,480
And we'll have a video later that shows you
355
00:16:46,480 --> 00:16:48,920
one of these things that's not quite in as bad a shape.
356
00:16:48,920 --> 00:16:51,760
If you look, this has four different platters
357
00:16:51,760 --> 00:16:53,560
that are all spinning around.
358
00:16:53,560 --> 00:16:56,460
And so this is just using magnetic material
359
00:16:56,460 --> 00:16:59,280
and electronics that sort of magnetize
360
00:16:59,280 --> 00:17:00,920
and demagnetize this stuff.
361
00:17:00,920 --> 00:17:02,520
And if you look at a disk,
362
00:17:02,520 --> 00:17:04,560
they're often rated, physical disks,
363
00:17:04,560 --> 00:17:06,440
are rated in revolutions per minute.
364
00:17:06,440 --> 00:17:08,520
And that's how many times this thing spins around.
365
00:17:08,520 --> 00:17:11,599
And if you got an old desktop and you hear it spin up,
366
00:17:11,599 --> 00:17:13,519
this is the thing that's spinning.
367
00:17:13,520 --> 00:17:16,280
And it's the place that your operating system lives,
368
00:17:16,280 --> 00:17:18,940
your files live, your applications live,
369
00:17:18,940 --> 00:17:21,319
while they're stored and while the computer's turned off.
370
00:17:21,319 --> 00:17:24,859
And then they're loaded into this while they're running.
371
00:17:24,859 --> 00:17:29,040
And then, this CPU takes the data from the main memory,
372
00:17:36,880 --> 00:17:41,680
and your program runs at three billion operations per second.
373
00:17:41,680 --> 00:17:45,320
So, let's talk a little bit about something
374
00:17:45,320 --> 00:17:50,400
that this is probably from the 1960s or 70s.
375
00:17:50,400 --> 00:17:54,120
This actually has, if you're an electrical person,
376
00:17:54,120 --> 00:17:59,120
it has capacitors, those little silver things are capacitors.
377
00:17:59,280 --> 00:18:02,400
These little colored things are resistors,
378
00:18:02,400 --> 00:18:03,480
and that's more capacitors.
379
00:18:03,480 --> 00:18:06,440
And then there's wires, and wires move everything.
380
00:18:06,440 --> 00:18:10,920
And so, when you say like this has millions of transistors,
381
00:18:10,920 --> 00:18:13,800
oh wait, that isn't a capacitor, that's a transistor.
382
00:18:13,800 --> 00:18:15,000
That's a transistor.
383
00:18:15,000 --> 00:18:17,960
When you say that this here has etched,
384
00:18:17,960 --> 00:18:19,160
and if you look closely at this,
385
00:18:19,160 --> 00:18:22,040
go look at a picture of a microprocessor online,
386
00:18:22,040 --> 00:18:24,120
you will see that it has millions of these.
387
00:18:24,120 --> 00:18:29,120
And so, the difference between 1960 and today
388
00:18:29,280 --> 00:18:34,120
is this circuitry of capacitors, resistors,
389
00:18:34,120 --> 00:18:39,120
and transistors has been microized and put onto this.
390
00:18:39,880 --> 00:18:42,320
It's using a photographic process,
391
00:18:42,320 --> 00:18:45,560
and they're tinier and tinier and putting more and more on.
392
00:18:45,560 --> 00:18:48,820
And if you think going from millions of these
393
00:18:48,820 --> 00:18:53,640
to one of these is crazy, the thing that's happening now,
394
00:18:53,640 --> 00:18:56,560
and the reason we have whole computers inside our pocket,
395
00:18:56,560 --> 00:19:00,640
is that everything, all of this, this whole thing,
396
00:19:00,640 --> 00:19:04,520
CPU, memory, everything, all of it connected,
397
00:19:04,520 --> 00:19:07,760
and the storage is being made smaller and smaller.
398
00:19:07,760 --> 00:19:09,720
And so, this little single board computer
399
00:19:09,720 --> 00:19:12,480
called the Raspberry Pi has one thing in it,
400
00:19:12,480 --> 00:19:15,600
and it has the main memory, and it has the CPU,
401
00:19:15,600 --> 00:19:17,800
it has connections for things like peripherals,
402
00:19:17,800 --> 00:19:19,120
like keyboards and stuff.
403
00:19:19,120 --> 00:19:21,840
Now, it doesn't yet have secondary storage on it.
404
00:19:21,840 --> 00:19:26,120
The secondary storage gets plugged in right here via USB.
405
00:19:26,120 --> 00:19:29,720
And then if you take it one step farther to my phone,
406
00:19:29,720 --> 00:19:31,960
it's got the secondary storage built right in.
407
00:19:31,960 --> 00:19:36,880
And so, this picture goes from the size of cabinets
408
00:19:36,880 --> 00:19:40,680
in the old days all the way down to really tiny.
409
00:19:40,680 --> 00:19:44,280
But, at the end of the day, inside it is a highly
410
00:19:44,280 --> 00:19:47,840
sophisticated piece of circuitry that asks for instructions
411
00:19:47,840 --> 00:19:52,080
one at a time, and main memory that holds the instructions
412
00:19:52,080 --> 00:19:54,080
and feeds them, okay?
413
00:19:54,080 --> 00:19:56,000
Central processor does the thinking,
414
00:19:56,000 --> 00:19:58,760
take a look here, central processor does the thinking,
415
00:19:58,760 --> 00:20:01,520
it runs the program, it's asking what's next,
416
00:20:01,520 --> 00:20:04,520
it's not really smart, but it's really fast.
417
00:20:04,520 --> 00:20:08,480
And so, we compensate for the lack of intelligence
418
00:20:08,480 --> 00:20:11,440
of this thing by us writing really good software
419
00:20:11,440 --> 00:20:12,640
that runs really fast.
420
00:20:12,640 --> 00:20:16,920
And so, voice recognition on things like phones is possible
421
00:20:16,920 --> 00:20:20,200
because computers have so much storage and they run so fast
422
00:20:20,200 --> 00:20:23,120
and the algorithms that do voice recognition
423
00:20:23,120 --> 00:20:25,440
are finally starting to work.
424
00:20:25,440 --> 00:20:29,600
Input devices like keyboards and mice and pens and whatever,
425
00:20:29,600 --> 00:20:32,400
they come in, output devices are like the screens
426
00:20:32,400 --> 00:20:35,800
that we see, the main memory is the fast part
427
00:20:35,800 --> 00:20:37,920
of the computer that stores all the programs,
428
00:20:37,920 --> 00:20:41,160
and the secondary memory is the permanent storage.
429
00:20:41,160 --> 00:20:44,880
Increasingly, secondary memory,
430
00:20:44,880 --> 00:20:46,800
do I have any USB sticks in here?
431
00:20:47,720 --> 00:20:48,900
I don't.
432
00:20:48,900 --> 00:20:53,600
Well, increasingly secondary memory is flash RAM
433
00:20:53,600 --> 00:20:58,200
or static RAM with no moving parts.
434
00:20:58,200 --> 00:21:02,000
And so, in a few years you'll not even be able to see
435
00:21:02,000 --> 00:21:04,840
secondary memory with moving parts.
436
00:21:04,840 --> 00:21:07,160
But that's okay, it's still secondary memory,
437
00:21:07,160 --> 00:21:09,040
it's still memory that lasts.
438
00:21:09,040 --> 00:21:13,560
And so, you and where your place is in here
439
00:21:13,560 --> 00:21:15,120
is you live in the main memory.
440
00:21:15,120 --> 00:21:17,160
This is you, you are here.
441
00:21:17,160 --> 00:21:20,600
And so, in a sense, when the CPU asks the question
442
00:21:20,600 --> 00:21:23,160
what next, it is your job to answer that.
443
00:21:23,160 --> 00:21:25,880
And you answer that by writing Python code.
444
00:21:25,880 --> 00:21:28,760
And so, your Python code, you'll write a file in Python code.
445
00:21:28,760 --> 00:21:30,320
Blah, blah, blah, blah, blah, blah, blah.
446
00:21:30,320 --> 00:21:33,440
And then that Python code sort of gets loaded
447
00:21:33,440 --> 00:21:35,720
into main memory, there's a magic translation process
448
00:21:35,720 --> 00:21:36,560
that happens.
449
00:21:36,560 --> 00:21:40,120
And then your code is actually answering this question
450
00:21:40,120 --> 00:21:41,520
three billion times a second.
451
00:21:41,520 --> 00:21:43,960
Three billion times a second, you're sitting there.
452
00:21:43,960 --> 00:21:45,280
But this is you.
453
00:21:45,280 --> 00:21:48,220
You're really out here, but you then write a file
454
00:21:48,220 --> 00:21:50,500
and the file's loaded in and then the file runs.
455
00:21:50,500 --> 00:21:51,680
And that's how things are at.
456
00:21:51,680 --> 00:21:55,480
And that's your place in the world.
457
00:21:55,480 --> 00:21:58,600
Now, what's actually running is not Python code.
458
00:21:58,600 --> 00:22:01,280
There is, as I said, a translation process.
459
00:22:01,280 --> 00:22:05,440
You write a Python file and then Python itself
460
00:22:05,440 --> 00:22:08,120
translates this into the actual language
461
00:22:08,120 --> 00:22:12,120
known by the microprocessor, which is a series of zeros
462
00:22:12,120 --> 00:22:13,400
and ones called machine language.
463
00:22:13,400 --> 00:22:15,680
Someday I would love to teach you a class
464
00:22:15,680 --> 00:22:17,000
on machine language.
465
00:22:17,000 --> 00:22:18,760
But for now, we're gonna teach you Python
466
00:22:18,760 --> 00:22:20,400
and we're gonna use Python as a crutch.
467
00:22:20,400 --> 00:22:21,840
We don't have to talk machine language,
468
00:22:21,840 --> 00:22:24,600
but you could, if you really wanted to,
469
00:22:24,600 --> 00:22:26,160
you could know how to write machine language.
470
00:22:26,160 --> 00:22:30,000
But I assure you, Python is far easier to learn
471
00:22:30,000 --> 00:22:31,160
than machine language.
472
00:22:31,160 --> 00:22:33,540
So, Python acts as a translator,
473
00:22:33,540 --> 00:22:35,680
translates what you're doing into machine language,
474
00:22:35,680 --> 00:22:38,960
and then the machine language is what's sent back and forth.
475
00:22:38,960 --> 00:22:40,440
But still, even though it's translated
476
00:22:40,440 --> 00:22:42,520
to machine language, it's you.
477
00:22:42,520 --> 00:22:44,360
It is you answering those questions
478
00:22:44,360 --> 00:22:45,600
and that's what a program is,
479
00:22:45,600 --> 00:22:49,880
is you pre-storing your response to the what next question
480
00:22:49,880 --> 00:22:52,120
over and over again.
481
00:22:52,120 --> 00:22:54,200
So, here's a couple of videos that you can look at
482
00:22:54,200 --> 00:22:56,520
on YouTube about a CPU.
483
00:22:56,520 --> 00:22:59,200
These CPUs, and it looks very much like this CPU
484
00:22:59,200 --> 00:23:01,000
that I've got with me,
485
00:23:01,000 --> 00:23:05,920
these CPUs run extremely high heat
486
00:23:05,920 --> 00:23:08,340
when you put this thing on your computer on your lap
487
00:23:08,340 --> 00:23:09,600
and it starts to heat up.
488
00:23:09,600 --> 00:23:12,160
That means it's thinking really, really hard.
489
00:23:12,160 --> 00:23:15,360
And so, this is a small little old video
490
00:23:15,360 --> 00:23:17,580
from a long time ago that shows what happens
491
00:23:17,580 --> 00:23:19,960
when you take out the cooling capability
492
00:23:19,960 --> 00:23:24,080
of microprocessors and just how hot they can be.
493
00:23:24,080 --> 00:23:28,160
And the other video that I have is a hard disk.
494
00:23:28,160 --> 00:23:31,280
Something like this hard disk that I have
495
00:23:31,280 --> 00:23:34,120
except that it works and they turn the power on.
496
00:23:34,120 --> 00:23:36,440
Some of them last for a few seconds,
497
00:23:36,440 --> 00:23:38,680
some of them last for a few minutes.
498
00:23:38,680 --> 00:23:40,160
It's never a,
499
00:23:40,160 --> 00:23:41,000
achoo!
500
00:23:43,520 --> 00:23:44,360
Achoo!
501
00:23:44,360 --> 00:23:46,440
I must be allergic to this hard drive.
502
00:23:47,280 --> 00:23:49,760
Or maybe it's because there's dust in this hard drive
503
00:23:49,760 --> 00:23:52,160
and I keep spinning it and I sneeze.
504
00:23:52,160 --> 00:23:56,960
But basically, some of them last for a few seconds,
505
00:23:56,960 --> 00:23:58,320
some of them last for a few minutes.
506
00:23:58,320 --> 00:24:00,120
It's not a good idea to open them up,
507
00:24:00,120 --> 00:24:01,640
but I'm glad somebody opened it up
508
00:24:01,640 --> 00:24:04,020
and then did what they did and then recorded it
509
00:24:04,020 --> 00:24:06,920
so we can all enjoy what it is
510
00:24:06,920 --> 00:24:09,000
that they're capable of doing, okay?
511
00:24:09,000 --> 00:24:12,120
So that's a quick introduction to hardware,
512
00:24:12,120 --> 00:24:15,480
mostly so that I can use those words going forward.
513
00:24:15,480 --> 00:24:16,920
Now, what we're gonna talk about next
514
00:24:16,920 --> 00:24:19,680
is communicating in the language Python.
515
00:24:19,680 --> 00:24:22,640
That is, writing code and putting it into the computer
516
00:24:22,640 --> 00:24:27,640
so that that can execute, okay?
517
00:24:30,040 --> 00:24:33,400
And welcome to my video that shows how to get started
518
00:24:33,400 --> 00:24:38,400
and install Python on Microsoft Windows, okay?
519
00:24:38,600 --> 00:24:40,440
So it's not too hard.
520
00:24:40,440 --> 00:24:42,920
We're gonna both install Python 3
521
00:24:42,920 --> 00:24:45,480
and we're going to install Text Editor.
522
00:24:45,480 --> 00:24:48,240
And so I'm just gonna go into Google
523
00:24:48,240 --> 00:24:52,080
and I'm gonna install Python 3.
524
00:24:52,080 --> 00:24:54,760
And my top link is downloading Python.
525
00:24:55,680 --> 00:25:00,680
And there is my link for downloading Python 3.5.2.
526
00:25:00,800 --> 00:25:03,040
This version of my class uses Python 3.
527
00:25:03,040 --> 00:25:06,040
I have an earlier class that you may have seen
528
00:25:06,040 --> 00:25:08,400
that uses Python 2, but in this class,
529
00:25:08,400 --> 00:25:09,240
we're going to do this.
530
00:25:09,240 --> 00:25:10,880
Now, it might take you a while to download this.
531
00:25:10,880 --> 00:25:13,100
I've actually already downloaded it.
532
00:25:13,100 --> 00:25:16,960
Now, the other thing we need is a programmer text editor.
533
00:25:16,960 --> 00:25:20,240
And you can really use any programmer text editor.
534
00:25:20,240 --> 00:25:23,000
We've used Notepad Plus in the past.
535
00:25:23,000 --> 00:25:25,120
We've used JEdit in the past.
536
00:25:25,120 --> 00:25:30,120
I like Adam, Adam.io, T-O-M,.io,
537
00:25:30,200 --> 00:25:31,760
mostly because it works the same
538
00:25:31,760 --> 00:25:35,120
on Windows and Mac and Linux.
539
00:25:35,120 --> 00:25:39,200
But you can really use any text editor that you like.
540
00:25:39,200 --> 00:25:42,960
Just don't use Word or TextEdit
541
00:25:42,960 --> 00:25:44,400
that comes with the operating system.
542
00:25:44,400 --> 00:25:46,240
You need a programmer's editor
543
00:25:46,240 --> 00:25:50,060
that doesn't mess with weird characters or weird lines
544
00:25:50,060 --> 00:25:52,080
or strange formats.
545
00:25:52,080 --> 00:25:55,520
You must have a real programmer editor.
546
00:25:55,520 --> 00:25:58,200
And so I've already downloaded this as well.
547
00:25:59,040 --> 00:26:02,880
And so I won't waste the time waiting to download it,
548
00:26:02,880 --> 00:26:04,960
but let's go ahead and do the installation.
549
00:26:04,960 --> 00:26:09,960
So these things ended up in my downloads file.
550
00:26:15,960 --> 00:26:18,020
So I'm going to downloads.
551
00:26:18,020 --> 00:26:21,280
And I'll start installing Python 3.5.2.
552
00:26:22,760 --> 00:26:24,640
Now it's gonna ask me some things.
553
00:26:25,920 --> 00:26:28,120
Add Python 3.5 to the path.
554
00:26:28,120 --> 00:26:29,160
And that's a good idea.
555
00:26:29,160 --> 00:26:30,960
Install the launcher for all users.
556
00:26:30,960 --> 00:26:32,880
I'm going to add that.
557
00:26:32,880 --> 00:26:34,760
Maybe you will, maybe you won't do that.
558
00:26:34,760 --> 00:26:37,840
It's gonna tell me where it's going to install it.
559
00:26:42,000 --> 00:26:42,840
Install now.
560
00:26:44,280 --> 00:26:46,140
Of course, it's going to ask me
561
00:26:46,140 --> 00:26:48,120
for permission to do these things.
562
00:26:49,200 --> 00:26:51,440
And now it's running through the installation.
563
00:26:56,200 --> 00:26:57,980
Okay, so there we go.
564
00:26:57,980 --> 00:26:59,160
You could maybe click on this
565
00:26:59,160 --> 00:27:01,080
online tutorial and documentation.
566
00:27:02,520 --> 00:27:04,280
But we're just gonna close this.
567
00:27:06,400 --> 00:27:09,960
And I'm gonna start and run the Windows command line.
568
00:27:09,960 --> 00:27:14,960
Now, you may have all kinds of fancy ways to run Python,
569
00:27:14,960 --> 00:27:18,200
but I like running the command line,
570
00:27:19,200 --> 00:27:22,120
C-O-M-M-A-N-D.
571
00:27:22,120 --> 00:27:26,720
I like running the command line because after a while,
572
00:27:26,720 --> 00:27:29,800
it's important to know what folder things are being run in.
573
00:27:31,000 --> 00:27:33,040
And so here's this command line.
574
00:27:33,040 --> 00:27:36,080
And I should be able to type Python here.
575
00:27:36,080 --> 00:27:38,680
And so now I'm in Python 3.2.
576
00:27:38,680 --> 00:27:42,160
And this is, the chevron prompt here
577
00:27:42,160 --> 00:27:43,960
is the Python interpreter,
578
00:27:43,960 --> 00:27:46,040
where it's asking for Python commands.
579
00:27:46,040 --> 00:27:47,520
And I can say print.
580
00:27:51,320 --> 00:27:52,880
Hello world.
581
00:27:52,880 --> 00:27:56,840
Of course, this is what we tend to print all the time.
582
00:27:56,840 --> 00:27:57,980
I can make a mistake.
583
00:27:57,980 --> 00:27:59,280
I can say,
584
00:27:59,280 --> 00:28:00,120
lulululul.
585
00:28:06,720 --> 00:28:08,760
Right, and it'll complain to me.
586
00:28:08,760 --> 00:28:09,600
Now to get out of this,
587
00:28:09,600 --> 00:28:11,620
I can either type control-Z or quit.
588
00:28:11,620 --> 00:28:13,640
In this case, I'm gonna type control-Z.
589
00:28:13,640 --> 00:28:15,280
And I'm back to the prompt.
590
00:28:15,280 --> 00:28:16,700
A couple of things,
591
00:28:17,620 --> 00:28:20,920
I can do a dir to see what folders and files I have.
592
00:28:20,920 --> 00:28:22,520
And that is like my desktop.
593
00:28:23,420 --> 00:28:26,200
And then the cd command tells me
594
00:28:26,200 --> 00:28:28,200
where I'm at in the folder.
595
00:28:28,200 --> 00:28:31,520
That means I'm in the user's directory, Dr. Chuck.
596
00:28:32,440 --> 00:28:35,420
So I have now installed Python.
597
00:28:35,420 --> 00:28:38,120
I ran the Python interpreter to verify it.
598
00:28:38,120 --> 00:28:40,480
I said, print hello world.
599
00:28:40,480 --> 00:28:41,640
And so now what I'm gonna do
600
00:28:41,640 --> 00:28:44,080
is I'm gonna actually install Atom.
601
00:28:44,080 --> 00:28:45,680
And I already had this downloaded.
602
00:28:45,680 --> 00:28:48,420
So let's go ahead and install Atom on my computer.
603
00:29:05,240 --> 00:29:07,400
Okay, so Atom is now installed
604
00:29:07,400 --> 00:29:09,920
and it's kind of telling us what to do.
605
00:29:09,920 --> 00:29:12,960
So I'm gonna actually just close all these windows,
606
00:29:12,960 --> 00:29:15,440
close this window, close everything.
607
00:29:15,440 --> 00:29:18,040
And I'm gonna create a file.
608
00:29:18,040 --> 00:29:20,280
I'm going to say print.
609
00:29:20,280 --> 00:29:21,400
In this case,
610
00:29:22,320 --> 00:29:26,600
let's see if I can make this bigger.
611
00:29:26,600 --> 00:29:28,420
I can make it bigger.
612
00:29:28,420 --> 00:29:33,260
So I'm gonna type print hello from a file.
613
00:29:34,120 --> 00:29:37,680
Okay, and I'm gonna save this.
614
00:29:37,680 --> 00:29:42,680
I'm gonna say file, save as.
615
00:29:43,960 --> 00:29:47,420
And what I'm gonna do is I'm gonna go to my desktop.
616
00:29:50,960 --> 00:29:55,520
And I'm gonna make a folder on the desktop.
617
00:29:55,520 --> 00:29:59,640
I'm gonna call this folder py4e.
618
00:29:59,640 --> 00:30:01,560
So I now have a folder on the desktop.
619
00:30:02,400 --> 00:30:04,240
Move this here, I'll move this here.
620
00:30:04,240 --> 00:30:06,000
Oops.
621
00:30:06,000 --> 00:30:08,380
And I'm gonna go into py4e.
622
00:30:09,520 --> 00:30:13,920
And then I'm gonna name this file first.py.
623
00:30:17,160 --> 00:30:19,340
And you'll notice that when I save this,
624
00:30:20,280 --> 00:30:22,240
when I save this,
625
00:30:22,240 --> 00:30:24,840
it syntax highlighted it.
626
00:30:24,840 --> 00:30:27,520
That's one of the nice things about a programmer editor.
627
00:30:27,520 --> 00:30:30,860
Okay, and so it says, oh, it's got a suffix of.py.
628
00:30:30,860 --> 00:30:34,120
So therefore it knows that it's supposed to look pretty
629
00:30:34,120 --> 00:30:35,800
with Python and make this one color,
630
00:30:35,800 --> 00:30:37,400
make this another color.
631
00:30:37,400 --> 00:30:38,600
The other thing that you'll notice
632
00:30:38,600 --> 00:30:42,120
is that I now have a folder called py4e.
633
00:30:42,120 --> 00:30:45,780
And if I am in this command line,
634
00:30:45,780 --> 00:30:47,200
let me just start that up again.
635
00:30:47,200 --> 00:30:49,600
I'll show you how to start the command line again.
636
00:30:52,560 --> 00:30:53,400
Command.
637
00:30:55,400 --> 00:30:58,880
Now, if I do a dir, I see the folders that I'm in.
638
00:30:58,880 --> 00:31:00,760
And one of the folders that you can see here
639
00:31:00,760 --> 00:31:02,160
is the desktop folder.
640
00:31:02,160 --> 00:31:04,080
So I'm gonna say cd desktop.
641
00:31:06,160 --> 00:31:07,600
And then I'm gonna type the dir command
642
00:31:07,600 --> 00:31:10,240
to see what folders are in the desktop.
643
00:31:10,240 --> 00:31:14,560
These folders are the same as these folders.
644
00:31:14,560 --> 00:31:16,520
These things are kind of virtual folders.
645
00:31:16,520 --> 00:31:19,680
Py4e is py4e.
646
00:31:19,680 --> 00:31:24,160
Now I can type cd, which stands for change directory, py4e.
647
00:31:25,360 --> 00:31:28,460
And I can do a dir, and I see first.py.
648
00:31:28,460 --> 00:31:32,920
And that's the same as if I'm diving into this folder.
649
00:31:32,920 --> 00:31:34,980
Here's this file, first.py.
650
00:31:34,980 --> 00:31:36,600
Windows hides the suffix,
651
00:31:36,600 --> 00:31:40,120
which is somewhat annoying and frustrating,
652
00:31:40,120 --> 00:31:44,440
but that suffix is there, that file is there.
653
00:31:44,440 --> 00:31:46,440
And so for me, one of the things
654
00:31:46,440 --> 00:31:47,560
you gotta figure out in Windows
655
00:31:47,560 --> 00:31:51,320
is how to make sure that you are in the same folder,
656
00:31:51,320 --> 00:31:54,780
users, Dr. Chuck, desktop, py4e,
657
00:31:54,780 --> 00:31:58,440
and that's the name of this file, and here as well.
658
00:31:58,440 --> 00:32:00,280
And now I'm gonna run this program.
659
00:32:00,280 --> 00:32:05,120
I'm gonna type python, first.py.
660
00:32:06,900 --> 00:32:10,460
And you see that it ran the Python code, okay?
661
00:32:10,460 --> 00:32:15,460
Another way you can do this is you can type first.py.
662
00:32:15,960 --> 00:32:18,680
And that's because this file association
663
00:32:18,680 --> 00:32:19,640
has happened in Windows.
664
00:32:19,640 --> 00:32:21,160
This doesn't work in Macintosh.
665
00:32:21,160 --> 00:32:22,720
This only works in Windows.
666
00:32:22,720 --> 00:32:26,640
That all files with.py are expected to be Python,
667
00:32:26,640 --> 00:32:28,320
and it knows the Python interpreter
668
00:32:28,320 --> 00:32:30,160
where to run it, okay?
669
00:32:30,160 --> 00:32:33,120
And so I've got Python 3.0 installed,
670
00:32:33,120 --> 00:32:35,320
and that gets me started,
671
00:32:35,320 --> 00:32:40,200
and so I hope that this little introduction
672
00:32:40,200 --> 00:32:41,560
about getting things started
673
00:32:41,560 --> 00:32:43,640
and writing your first Python program
674
00:32:43,640 --> 00:32:44,900
has been helpful to you.
675
00:32:49,760 --> 00:32:51,160
We're going to actually download
676
00:32:51,160 --> 00:32:55,920
and install Python 3 from python.org on a Macintosh.
677
00:32:55,920 --> 00:32:58,640
If your Macintosh for years has wonderfully
678
00:32:58,640 --> 00:32:59,840
come with Python 2.
679
00:32:59,840 --> 00:33:03,220
So if I type python minus minus version,
680
00:33:04,160 --> 00:33:07,640
then I type that.
681
00:33:07,640 --> 00:33:09,680
I see that I've got Python 2.0.
682
00:33:09,680 --> 00:33:12,560
What we wanna do is, in addition, install Python 3.
683
00:33:12,560 --> 00:33:16,400
One of these days, Macintosh might upgrade
684
00:33:16,400 --> 00:33:19,520
their distributed version of Python 3,
685
00:33:19,520 --> 00:33:21,240
but there's so many things inside Mac
686
00:33:21,240 --> 00:33:22,760
that depend on Python 2.
687
00:33:22,760 --> 00:33:25,680
I'm gonna expect that it will always be named Python 3,
688
00:33:25,680 --> 00:33:29,840
which is what we're gonna call it in a second.
689
00:33:29,840 --> 00:33:33,320
So here I am at the python.org downloads,
690
00:33:33,320 --> 00:33:36,080
and I'm gonna download Python 3.
691
00:33:36,080 --> 00:33:38,640
You click here, and I've actually got it sitting here
692
00:33:38,640 --> 00:33:42,120
in downloads already, because I always do that.
693
00:33:42,120 --> 00:33:45,360
And so I'm gonna install this.
694
00:33:46,560 --> 00:33:48,640
There is the installer.
695
00:33:48,640 --> 00:33:50,960
I'm gonna say continue, continue, continue.
696
00:33:50,960 --> 00:33:53,680
Of course I agree, I read all that really fast,
697
00:33:53,680 --> 00:33:55,320
and now I'm going to install it.
698
00:33:55,320 --> 00:34:00,320
Okay, so now that means if I run a terminal,
699
00:34:09,360 --> 00:34:12,040
so this of course is start run terminal,
700
00:34:12,040 --> 00:34:15,320
so Python 2 is still there,
701
00:34:15,320 --> 00:34:18,800
but Python 3 is also now there,
702
00:34:18,800 --> 00:34:20,800
so we should have Python 3 installed.
703
00:34:20,800 --> 00:34:24,000
So we installed Python 3.6, and so there we go,
704
00:34:24,000 --> 00:34:26,800
and that's all it takes to install Python 3
705
00:34:26,800 --> 00:34:28,719
on the Macintosh.
706
00:34:28,719 --> 00:34:31,419
So let's write our first little Python program.
707
00:34:32,840 --> 00:34:36,000
I'm going to, I like Atom,
708
00:34:38,159 --> 00:34:40,119
and so I've got this Atom editor.
709
00:34:40,120 --> 00:34:43,800
It's atom.io, right here, atom.io,
710
00:34:43,800 --> 00:34:45,980
download and install the Atom editor.
711
00:34:45,980 --> 00:34:49,440
I like it because Atom works the same
712
00:34:49,440 --> 00:34:52,679
on both Windows, Mac, and Linux,
713
00:34:52,679 --> 00:34:54,039
and it has syntax highlighting,
714
00:34:54,040 --> 00:34:55,840
and so I really like things like that.
715
00:34:55,840 --> 00:35:00,240
So I'm gonna make myself a simple Python program.
716
00:35:02,520 --> 00:35:04,360
Hello world, like we always do.
717
00:35:04,360 --> 00:35:06,720
Now you'll notice that it's not syntax highlighting yet,
718
00:35:06,720 --> 00:35:10,960
but I'm gonna do a file, save, oopsie daisy,
719
00:35:10,960 --> 00:35:15,560
file, save as, and I'm gonna go into my desktop,
720
00:35:15,560 --> 00:35:19,460
and I'm gonna make a folder called py4e.
721
00:35:19,460 --> 00:35:24,460
I must find this call as hello.py.
722
00:35:27,280 --> 00:35:32,280
Oh crud, gotta rename it, rename it.
723
00:35:33,760 --> 00:35:37,120
I ended up with two dots, hello.py, there we are.
724
00:35:37,120 --> 00:35:40,160
And so now I'm here, and I'm in my home folder.
725
00:35:40,160 --> 00:35:42,240
I can go into my desktop, and I can go into
726
00:35:42,240 --> 00:35:45,000
that new folder I made, Python for Everybody,
727
00:35:45,000 --> 00:35:47,040
and I can see the files.
728
00:35:47,040 --> 00:35:50,200
Now there are ways to run this, and I really don't,
729
00:35:50,200 --> 00:35:52,680
I really want you to learn the terminal
730
00:35:53,560 --> 00:35:54,840
so that you really know what you're doing.
731
00:35:54,840 --> 00:35:56,940
And so here we are, we are in the folder
732
00:35:56,940 --> 00:35:59,660
that has the Python, and then all we do to run it
733
00:35:59,660 --> 00:36:04,660
is we say Python3 hello.py, and there we go.
734
00:36:05,280 --> 00:36:06,600
And of course this is Python3
735
00:36:06,600 --> 00:36:08,240
because I'm using parentheses there.
736
00:36:08,240 --> 00:36:11,320
So instead of double quotes.
737
00:36:11,320 --> 00:36:13,440
But Python2 is still there, and of course
738
00:36:13,440 --> 00:36:15,880
if you just run Python hello.py,
739
00:36:15,880 --> 00:36:19,600
it'll be a syntax error, or not.
740
00:36:19,600 --> 00:36:21,000
Must be they added something.
741
00:36:21,000 --> 00:36:21,840
Ha ha ha.
742
00:36:22,760 --> 00:36:25,420
Yeah, because Python is still version,
743
00:36:26,940 --> 00:36:29,160
still version two, but apparently they allowed print
744
00:36:29,160 --> 00:36:31,600
in the latest version of Python2.
745
00:36:31,600 --> 00:36:33,580
So away we go.
746
00:36:34,480 --> 00:36:37,680
Okay, so again, thanks for watching.
747
00:36:37,680 --> 00:36:41,160
I hope this was helpful to you to get Python3
748
00:36:41,160 --> 00:36:46,160
installed on your Macintosh.
749
00:36:47,200 --> 00:36:50,160
Hello, and welcome back to Python as a Language.
750
00:36:50,160 --> 00:36:52,280
You'll notice that I'm wearing a hat.
751
00:36:53,880 --> 00:36:57,600
And part of the story of the hat is that
752
00:36:57,600 --> 00:36:59,880
where I work here at the University of Michigan
753
00:36:59,880 --> 00:37:03,840
School of Information, my office is in this building
754
00:37:03,840 --> 00:37:05,520
called North Quad.
755
00:37:05,520 --> 00:37:09,320
And we call it quadwort sometimes
756
00:37:09,320 --> 00:37:11,240
because it's sort of got a square,
757
00:37:11,240 --> 00:37:13,740
it sort of imitates an Oxford quad.
758
00:37:13,740 --> 00:37:17,880
And so it seemed to me to evoke notions of Harry Potter.
759
00:37:17,880 --> 00:37:19,680
And when we first moved into the building,
760
00:37:19,680 --> 00:37:22,680
I joked in one of my classes that
761
00:37:22,680 --> 00:37:25,160
we should have a sorting ceremony for all the students
762
00:37:25,160 --> 00:37:28,360
as they come into North Quad for the first time.
763
00:37:28,360 --> 00:37:32,480
And so that was cool, and I thought that I would belong
764
00:37:32,480 --> 00:37:36,880
in Gryffindor, like everyone wants to be in Gryffindor,
765
00:37:36,880 --> 00:37:38,040
right, they're the good guys.
766
00:37:38,040 --> 00:37:41,860
And my students told me that I couldn't be in Gryffindor,
767
00:37:43,240 --> 00:37:45,340
that I had to be in Slytherin.
768
00:37:45,340 --> 00:37:47,900
So you'll see me drinking tea throughout the course
769
00:37:47,900 --> 00:37:49,120
out of this teacup.
770
00:37:49,120 --> 00:37:51,800
It's my Slytherin teacup.
771
00:37:51,800 --> 00:37:54,120
I picked that up from Harry Potter World.
772
00:37:54,120 --> 00:37:58,120
I went down to Florida and visited Harry Potter World.
773
00:37:58,120 --> 00:38:03,120
And the reason that I was sorted by my students
774
00:38:03,120 --> 00:38:08,120
into Slytherin is also because I teach Python.
775
00:38:09,320 --> 00:38:13,360
And Python is like a snake.
776
00:38:13,360 --> 00:38:17,280
And so if you think about the people from Slytherin,
777
00:38:17,280 --> 00:38:20,560
they are capable of talking to snakes.
778
00:38:20,560 --> 00:38:22,080
And the class that we were doing the sorting
779
00:38:22,080 --> 00:38:23,160
was a Python class.
780
00:38:23,160 --> 00:38:26,160
And so it sort of made perfect sense
781
00:38:26,160 --> 00:38:28,800
that you would have to be in Slytherin
782
00:38:28,800 --> 00:38:30,680
if you were the Python teacher.
783
00:38:30,680 --> 00:38:33,920
And of course, your name is Charles Severance.
784
00:38:33,920 --> 00:38:36,720
And then that sounds kind of like Severus Snape.
785
00:38:36,720 --> 00:38:41,720
And so I just accepted that I'm in Slytherin, okay?
786
00:38:43,720 --> 00:38:46,480
So you all can be in Gryffindor, but I can't.
787
00:38:46,480 --> 00:38:47,320
I'm in Slytherin.
788
00:38:47,320 --> 00:38:49,720
So I'm the bad guy or the good guy.
789
00:38:49,720 --> 00:38:51,800
Depends on how you look at it, right?
790
00:38:52,800 --> 00:38:56,400
And so what I'm going to do now is I'm going to bring you
791
00:38:56,400 --> 00:38:59,720
into Slytherin as well.
792
00:38:59,720 --> 00:39:04,520
Because I'm going to teach you the Python language.
793
00:39:04,520 --> 00:39:09,120
Python is the language that we Pythonistas talk.
794
00:39:09,120 --> 00:39:11,640
It was invented about over 20 years ago
795
00:39:11,640 --> 00:39:13,880
by a fellow named Gita Van Rossum.
796
00:39:13,880 --> 00:39:16,400
And away we go.
797
00:39:16,400 --> 00:39:20,160
Now, even though I'm using this whole snake Slytherin thing,
798
00:39:20,160 --> 00:39:22,960
it turns out that Python was not at all named
799
00:39:22,960 --> 00:39:26,000
for Harry Potter because Python was invented
800
00:39:26,000 --> 00:39:29,120
almost two decades before Harry Potter was created.
801
00:39:29,120 --> 00:39:30,880
And it wasn't for the snake.
802
00:39:30,880 --> 00:39:33,560
It was actually, Monty Python's flying circus
803
00:39:33,560 --> 00:39:37,360
was the inspiration for Python, the name Python.
804
00:39:37,360 --> 00:39:40,240
And because Gita Van Rossum really wanted
805
00:39:40,240 --> 00:39:41,800
to create a programming language,
806
00:39:41,800 --> 00:39:44,440
that while it was powerful underneath it
807
00:39:44,440 --> 00:39:47,120
in its very nature was a very powerful language,
808
00:39:47,120 --> 00:39:50,000
he wanted it to be a language that was fun.
809
00:39:50,000 --> 00:39:52,440
And he wanted it to be a language that was approachable.
810
00:39:52,440 --> 00:39:55,040
And so that's why Python recently has become
811
00:39:55,040 --> 00:39:59,360
so absolutely popular.
812
00:39:59,360 --> 00:40:02,480
And it's easy to learn.
813
00:40:02,480 --> 00:40:03,560
But it's also powerful.
814
00:40:03,560 --> 00:40:05,240
And that's sort of the magic of Python,
815
00:40:05,240 --> 00:40:09,600
is the ease of learning it, the brevity of the programs,
816
00:40:09,600 --> 00:40:14,320
the shortness of the programs, and the power.
817
00:40:14,320 --> 00:40:18,000
And so we are going to become Pythonistas.
818
00:40:18,000 --> 00:40:21,760
Now, as you learn to be a software developer using
819
00:40:21,760 --> 00:40:24,520
the Python programming language, you
820
00:40:24,520 --> 00:40:27,600
are going to encounter syntax errors.
821
00:40:27,600 --> 00:40:30,840
And I remember when I used to get syntax errors.
822
00:40:30,840 --> 00:40:34,840
And I remember my first programming class.
823
00:40:34,840 --> 00:40:37,680
And I would type on cards.
824
00:40:46,320 --> 00:40:51,800
And I would upload those cards to the computer.
825
00:40:51,800 --> 00:40:56,240
And the computer would say, you're not worthy.
826
00:40:56,240 --> 00:40:58,280
And I'm like, wait a sec, those are pretty good cards.
827
00:40:58,280 --> 00:41:01,280
How could you be so critical of me?
828
00:41:01,280 --> 00:41:02,400
I'd say syntax error.
829
00:41:02,400 --> 00:41:07,000
And I really got sort of a really bad attitude
830
00:41:07,000 --> 00:41:09,800
that somehow this computer didn't like me.
831
00:41:09,800 --> 00:41:12,680
And that I would make cards that would complain.
832
00:41:12,680 --> 00:41:14,320
And I would make changes to the cards.
833
00:41:14,320 --> 00:41:15,280
And it would still complain.
834
00:41:15,280 --> 00:41:17,280
And I make changes that would still complain.
835
00:41:17,280 --> 00:41:20,200
I'm like, how can I win in this situation?
836
00:41:20,200 --> 00:41:21,760
And you're going to feel the same thing.
837
00:41:21,760 --> 00:41:23,760
You're going to absolutely feel the same thing.
838
00:41:23,760 --> 00:41:25,440
You're going to be struggling.
839
00:41:25,440 --> 00:41:28,480
You're going to be like, how come this computer hates me?
840
00:41:28,480 --> 00:41:31,240
Let me assure you right now the computer doesn't hate you.
841
00:41:31,240 --> 00:41:33,320
The computer actually loves you.
842
00:41:33,320 --> 00:41:36,760
It just is not very good at showing how it loves you
843
00:41:36,760 --> 00:41:40,080
or telling you how or why it loves you.
844
00:41:40,080 --> 00:41:45,120
And so syntax errors are not so much Python telling you
845
00:41:45,120 --> 00:41:47,560
that you're bad or that you're an inadequate programmer
846
00:41:47,560 --> 00:41:49,840
or you should find something else to do.
847
00:41:49,840 --> 00:41:52,680
It's really Python's admission that it doesn't understand
848
00:41:52,680 --> 00:41:54,280
what you're trying to say.
849
00:41:54,280 --> 00:41:56,080
And so you've got to get used to that.
850
00:41:56,080 --> 00:41:58,000
And it's frustrating, but you've got
851
00:41:58,000 --> 00:42:00,680
to get used to the fact that syntax errors are your friend.
852
00:42:00,680 --> 00:42:03,480
Python is saying, hey, I got to line seven.
853
00:42:03,480 --> 00:42:05,120
And I was doing fine up to line seven.
854
00:42:05,120 --> 00:42:08,360
But boy, in line seven, there's some little thing.
855
00:42:08,360 --> 00:42:12,600
I don't know what the word else means in this context.
856
00:42:12,600 --> 00:42:13,840
Or you didn't indent it.
857
00:42:13,840 --> 00:42:15,160
And so I'm kind of confused.
858
00:42:15,160 --> 00:42:15,880
What did you mean?
859
00:42:15,880 --> 00:42:18,560
Please, please, please help me.
860
00:42:18,560 --> 00:42:20,800
And so it's so much easier for you
861
00:42:20,800 --> 00:42:23,880
to learn Python than it is for Python
862
00:42:23,880 --> 00:42:26,840
to figure out what you mean when you're writing code.
863
00:42:26,840 --> 00:42:28,320
So we have a number of different ways
864
00:42:28,320 --> 00:42:30,480
to sort of encode our instructions
865
00:42:30,480 --> 00:42:32,040
when we talk to Python.
866
00:42:32,040 --> 00:42:35,120
One is we just run Python interactively on our computer.
867
00:42:35,120 --> 00:42:37,240
Hopefully, by now, you've got it installed.
868
00:42:37,240 --> 00:42:39,640
And you just type Python at a command prompt.
869
00:42:39,640 --> 00:42:42,120
So either a Windows command prompt or a Linux command
870
00:42:42,120 --> 00:42:44,160
prompt or a Macintosh command prompt.
871
00:42:44,160 --> 00:42:47,240
And I got some examples of how to sort of get this all started,
872
00:42:47,240 --> 00:42:50,160
get Python installed, and away you go.
873
00:42:50,160 --> 00:42:52,360
Now, you'll notice when you run the Python interpreter,
874
00:42:52,360 --> 00:42:57,000
the three chevron prompt, Python is asking you what next.
875
00:42:57,000 --> 00:42:57,960
This is you.
876
00:42:57,960 --> 00:43:00,280
It's saying, I want to talk to you.
877
00:43:00,280 --> 00:43:02,960
I want you to tell me some Python to do.
878
00:43:02,960 --> 00:43:04,480
If you know the Python language, you
879
00:43:04,480 --> 00:43:07,080
know what to say right here.
880
00:43:07,080 --> 00:43:09,800
Now, if you know Python, you can type these languages.
881
00:43:09,800 --> 00:43:12,360
You can say, oh, x equals 1, which really means,
882
00:43:12,360 --> 00:43:14,640
go find a little piece of memory, label it x,
883
00:43:14,640 --> 00:43:15,880
and stick 1 in it.
884
00:43:15,880 --> 00:43:17,440
Print x is like, go find that thing
885
00:43:17,440 --> 00:43:19,720
where you labeled it x, and bring me back that number
886
00:43:19,720 --> 00:43:21,120
and tell me what I stored in there.
887
00:43:21,120 --> 00:43:23,560
Now, why you want to do this, that's a different question.
888
00:43:23,560 --> 00:43:24,960
These are very simple things.
889
00:43:24,960 --> 00:43:27,040
It's going to take you a while to get the big picture of why
890
00:43:27,040 --> 00:43:27,720
we're doing this.
891
00:43:27,720 --> 00:43:31,540
So just trust me that you want to learn these statements.
892
00:43:31,540 --> 00:43:33,520
And then later, we will successfully
893
00:43:33,520 --> 00:43:35,840
turn those into a program.
894
00:43:35,840 --> 00:43:39,280
So x equals x plus 1, the third line there.
895
00:43:39,280 --> 00:43:43,760
x equals x plus 1 is not, as it seems in math,
896
00:43:43,760 --> 00:43:46,320
it basically says, hey, go grab the old value of x,
897
00:43:46,320 --> 00:43:48,240
add 1 to it, and stick it back in x.
898
00:43:48,240 --> 00:43:49,240
That's what that means.
899
00:43:49,240 --> 00:43:52,720
So equal sign really has kind of an arrow to it.
900
00:43:52,720 --> 00:43:54,880
And then we say, hey, go look up that x thing
901
00:43:54,880 --> 00:43:56,720
that we just did, and print that out,
902
00:43:56,720 --> 00:43:58,680
and then we're going to say, quit.
903
00:43:58,680 --> 00:44:00,880
So that's us talking to Python.
904
00:44:00,880 --> 00:44:04,160
Now, you can type just about any crazy stuff you want in here,
905
00:44:04,160 --> 00:44:08,120
and Python will be unhappy and talk to you.
906
00:44:08,120 --> 00:44:10,880
So what we're going to do next is
907
00:44:10,880 --> 00:44:13,520
we're going to start talking about the actual language
908
00:44:13,520 --> 00:44:16,640
of Python and what it is that we have to say
909
00:44:16,640 --> 00:44:19,120
to make Python happy when we're talking to it.
910
00:44:24,080 --> 00:44:25,760
So now we're going to start learning
911
00:44:25,760 --> 00:44:28,380
the actual Python language.
912
00:44:28,380 --> 00:44:30,440
So what do we say?
913
00:44:30,440 --> 00:44:32,640
You can think of this as almost like writing,
914
00:44:32,640 --> 00:44:34,880
almost like writing a story.
915
00:44:34,880 --> 00:44:36,960
We're going to start with a basic vocabulary.
916
00:44:36,960 --> 00:44:40,200
We're going to talk a little bit about lines or sentences.
917
00:44:40,200 --> 00:44:41,680
And then we're going to start talking about
918
00:44:41,680 --> 00:44:44,200
how to put those sentences together
919
00:44:44,200 --> 00:44:47,640
to make a coherent paragraph, as it were.
920
00:44:47,640 --> 00:44:49,960
And you just have to accept the fact
921
00:44:49,960 --> 00:44:52,480
that when I start teaching you this stuff,
922
00:44:52,480 --> 00:44:53,800
it's not going to make sense
923
00:44:53,800 --> 00:44:56,800
for about six or seven more chapters.
924
00:44:56,800 --> 00:44:59,440
And so just sort of bear with me,
925
00:44:59,440 --> 00:45:02,520
except, I mean, I remember when I first learned,
926
00:45:02,520 --> 00:45:05,560
it went from me confused, confused, confused,
927
00:45:05,560 --> 00:45:09,120
confused, confused, holy mackerel, this is awesome.
928
00:45:09,120 --> 00:45:12,380
And so I expect many of you will go through that same thing.
929
00:45:12,380 --> 00:45:14,880
So just learn the first parts,
930
00:45:14,880 --> 00:45:17,960
accept the fact that it doesn't necessarily make sense
931
00:45:17,960 --> 00:45:22,960
in a big picture, and just bear with us, okay?
932
00:45:23,040 --> 00:45:24,280
So we'll start with vocabulary,
933
00:45:24,280 --> 00:45:25,440
we'll start to make sentences,
934
00:45:25,440 --> 00:45:29,120
and then we'll have little short stories and paragraphs, okay?
935
00:45:29,120 --> 00:45:31,240
And so this is a short story
936
00:45:31,240 --> 00:45:34,040
about how to count the words in Python.
937
00:45:34,040 --> 00:45:35,400
It's got a couple of paragraphs,
938
00:45:35,400 --> 00:45:39,520
and we are going to look at all of this stuff eventually.
939
00:45:39,520 --> 00:45:42,980
So we start with a set of reserved words.
940
00:45:42,980 --> 00:45:44,780
And what are reserved words?
941
00:45:44,780 --> 00:45:49,440
Well, they're words that Python expects
942
00:45:49,440 --> 00:45:51,720
when you use these words that they're gonna mean
943
00:45:51,720 --> 00:45:53,760
exactly what Python expects to mean.
944
00:45:53,760 --> 00:45:55,520
And what it really means is you're not allowed
945
00:45:55,520 --> 00:45:57,040
to use them for any other purpose
946
00:45:57,040 --> 00:45:58,440
than the purpose that Python wants.
947
00:45:58,440 --> 00:46:00,440
It's sort of part of the contract.
948
00:46:00,440 --> 00:46:02,720
It's like when you have a dog,
949
00:46:02,720 --> 00:46:07,340
and you go, what did you think of that television program?
950
00:46:07,340 --> 00:46:08,960
And the dog has no idea what you're saying,
951
00:46:08,960 --> 00:46:13,200
and then you say, do you wanna wait until Saturday
952
00:46:13,200 --> 00:46:17,260
to go to the veterinarian?
953
00:46:17,260 --> 00:46:19,040
And the dog still doesn't know what you're saying.
954
00:46:19,040 --> 00:46:22,160
Then you go like, how would you like to take a walk?
955
00:46:22,160 --> 00:46:23,640
And then the dog goes, walk?
956
00:46:23,640 --> 00:46:25,880
I know what that means, and then hits the door, right?
957
00:46:25,880 --> 00:46:29,320
And so the way the dog sees you is blah, blah, blah,
958
00:46:29,320 --> 00:46:31,120
walk, blah, blah, blah, blah, food,
959
00:46:31,120 --> 00:46:34,480
blah, blah, blah, blah, treat, blah, blah, blah, blah, walk.
960
00:46:34,480 --> 00:46:37,200
That's kinda how Python looks at these reserved words.
961
00:46:37,200 --> 00:46:39,040
When you say class, it goes class.
962
00:46:39,040 --> 00:46:40,520
Oh, I know what that means.
963
00:46:40,520 --> 00:46:44,120
Now, if I say zap, it's like, oh, zap's something
964
00:46:44,120 --> 00:46:47,280
that you get to decide, or it's maybe a variable name.
965
00:46:47,280 --> 00:46:49,440
So reserved words are simply words
966
00:46:49,440 --> 00:46:52,240
that when you use these words in Python,
967
00:46:52,240 --> 00:46:53,960
and there's only a few of them,
968
00:46:53,960 --> 00:47:00,960
like and, or del, or if, maybe, pass, maybe, in.
969
00:47:02,320 --> 00:47:04,760
A lot of these, you won't end up using them,
970
00:47:04,760 --> 00:47:06,880
it's just these are reserved for Python
971
00:47:06,880 --> 00:47:08,960
and part of the Python vocabulary.
972
00:47:08,960 --> 00:47:11,160
This is Python vocabulary.
973
00:47:11,160 --> 00:47:15,240
Now, when we move from words to sentences,
974
00:47:15,240 --> 00:47:17,760
you see that Python is a series of lines.
975
00:47:17,760 --> 00:47:20,400
A Python program is a series of statements.
976
00:47:20,400 --> 00:47:22,960
They have an order because the computer wants to know
977
00:47:22,960 --> 00:47:24,520
what next, what next, what next.
978
00:47:24,520 --> 00:47:27,880
So, what next is start at the beginning.
979
00:47:27,880 --> 00:47:30,040
So, I already talked about an assignment statement
980
00:47:30,040 --> 00:47:32,320
that basically says x equals two,
981
00:47:32,320 --> 00:47:33,880
this is not a mathematical statement,
982
00:47:33,880 --> 00:47:38,120
this is a directive to say, take this variable two,
983
00:47:38,120 --> 00:47:39,880
this value two, this constant two,
984
00:47:39,880 --> 00:47:42,520
and stick it in a location in your memory,
985
00:47:42,520 --> 00:47:45,400
and remember that I asked you to name it x.
986
00:47:45,400 --> 00:47:47,760
X is a variable, something you made up.
987
00:47:47,760 --> 00:47:52,140
You chose that, but it's Python's job to remember it.
988
00:47:52,140 --> 00:47:55,080
So, this says go, whatever that x is,
989
00:47:55,080 --> 00:47:58,120
there's a two in there, now pull that x back out,
990
00:47:58,120 --> 00:48:00,200
add two to it, which makes it four,
991
00:48:00,200 --> 00:48:02,600
and stick it back in x, and so that makes this a four.
992
00:48:02,600 --> 00:48:06,280
So, x is a four, and print x says,
993
00:48:06,280 --> 00:48:09,040
go look up that thing that was an x and print it out.
994
00:48:09,040 --> 00:48:12,360
And so, these are like, each line has something to it,
995
00:48:12,360 --> 00:48:15,000
I'm using a reserved word, well actually that's a function,
996
00:48:15,000 --> 00:48:17,460
but it's a reserved word too.
997
00:48:18,480 --> 00:48:21,760
And so, there's reserved words and all these things,
998
00:48:21,760 --> 00:48:24,560
and you combine these, there are operators,
999
00:48:24,560 --> 00:48:26,680
plus is an operator, equals is an operator,
1000
00:48:26,680 --> 00:48:29,800
these things do things, and we'll learn all this stuff
1001
00:48:29,800 --> 00:48:33,980
in time, so the basic building blocks of lines of Python.
1002
00:48:35,920 --> 00:48:38,640
Now, as we take these lines of Python and build them up,
1003
00:48:38,640 --> 00:48:42,000
we end up making paragraphs, programming in paragraphs.
1004
00:48:42,000 --> 00:48:45,200
And so, one of the things that it's important is
1005
00:48:45,200 --> 00:48:47,620
I showed you how to do interactive Python,
1006
00:48:47,620 --> 00:48:49,840
so you just type Python and you type a statement
1007
00:48:49,840 --> 00:48:52,860
and a statement, those get really tiring
1008
00:48:52,860 --> 00:48:54,940
after about three or four lines of Python,
1009
00:48:54,940 --> 00:48:57,840
because you start making mistakes and you have to start over.
1010
00:48:57,840 --> 00:49:00,900
So, the better thing to do is to, as your program
1011
00:49:00,900 --> 00:49:03,400
gets a little larger, to write a script,
1012
00:49:03,400 --> 00:49:05,560
put your Python instructions in a file,
1013
00:49:05,560 --> 00:49:08,520
and then tell Python to read from the file,
1014
00:49:08,520 --> 00:49:12,760
and then run the script as it's entered in that file.
1015
00:49:12,760 --> 00:49:15,480
We tend to name these files with.py,
1016
00:49:15,480 --> 00:49:18,300
and I've got a series of videos that you can watch
1017
00:49:18,300 --> 00:49:20,520
to figure out how this all works.
1018
00:49:20,520 --> 00:49:23,060
Like I said, you can type interactively to Python,
1019
00:49:23,060 --> 00:49:25,740
and it's a great way to experiment with Python,
1020
00:49:25,740 --> 00:49:28,240
check to see if a statement does what you think it does,
1021
00:49:28,240 --> 00:49:31,360
but script is the way, after we are past
1022
00:49:31,360 --> 00:49:33,700
one or two lines of code, we write it in files
1023
00:49:33,700 --> 00:49:35,060
and then run it separately.
1024
00:49:37,240 --> 00:49:39,880
So, there are a couple of basic patterns,
1025
00:49:39,880 --> 00:49:42,320
and it's really important to understand
1026
00:49:42,320 --> 00:49:43,920
each of these patterns, and like I said,
1027
00:49:43,920 --> 00:49:45,920
we'll teach you these patterns separately,
1028
00:49:45,920 --> 00:49:47,880
and then we'll combine them together.
1029
00:49:47,880 --> 00:49:49,360
And when you combine them together is when you say,
1030
00:49:49,360 --> 00:49:51,000
oh, that's what a program is.
1031
00:49:51,000 --> 00:49:53,960
So, you have to suspend disbelief.
1032
00:49:53,960 --> 00:49:55,640
We have a couple of different patterns.
1033
00:49:55,640 --> 00:49:57,640
One is a sequence of steps.
1034
00:49:57,640 --> 00:49:59,160
Do this, then do this, then do this.
1035
00:49:59,160 --> 00:50:01,440
Conditional is like skipping something.
1036
00:50:01,440 --> 00:50:03,640
Repeated does it over and over and over again.
1037
00:50:03,640 --> 00:50:05,900
Computers are really good at repeating stuff.
1038
00:50:05,900 --> 00:50:06,800
Much better than people.
1039
00:50:06,800 --> 00:50:09,600
People get tired going over and over doing the same thing.
1040
00:50:09,600 --> 00:50:12,800
And then we have store and repeated steps as well.
1041
00:50:12,800 --> 00:50:14,880
And so, if we take a look at this,
1042
00:50:14,880 --> 00:50:19,000
and we take a look at a Python program, this is a piece
1043
00:50:19,000 --> 00:50:20,080
of code, this is a little script.
1044
00:50:20,080 --> 00:50:23,460
If you type this into a code, take this Python code
1045
00:50:23,460 --> 00:50:26,760
into a file and run it, it starts at the beginning,
1046
00:50:26,760 --> 00:50:28,400
and then it goes to the next line, and the next line,
1047
00:50:28,400 --> 00:50:29,240
and the next line.
1048
00:50:29,240 --> 00:50:32,040
And Python executes the scripts as you write them.
1049
00:50:32,040 --> 00:50:36,320
So, it says, stick a variable, find a place called
1050
00:50:36,320 --> 00:50:40,080
in your memory called x, stick two into that, okay.
1051
00:50:40,080 --> 00:50:41,760
Then go to the next one, print that out.
1052
00:50:41,760 --> 00:50:43,640
So, the program is producing output.
1053
00:50:43,640 --> 00:50:46,620
Now, go read x and add two to it, and stick it back in x.
1054
00:50:46,620 --> 00:50:48,800
So, x is four, then print that.
1055
00:50:48,800 --> 00:50:51,200
This side over here, this is called a flow chart.
1056
00:50:51,200 --> 00:50:52,760
I'm not gonna make you draw flow charts.
1057
00:50:52,760 --> 00:50:54,440
I'm only gonna draw them a few times
1058
00:50:54,440 --> 00:50:56,160
in ways that I think will help you.
1059
00:50:56,160 --> 00:50:58,040
But you can think of it as Python,
1060
00:50:58,040 --> 00:51:00,280
when it finishes something, it goes onto the next one,
1061
00:51:00,280 --> 00:51:01,740
unless you tell it otherwise.
1062
00:51:01,740 --> 00:51:03,280
Finishes this, goes onto the next one.
1063
00:51:03,280 --> 00:51:05,160
Finishes this, goes onto the next one.
1064
00:51:05,160 --> 00:51:08,400
Finishes this, and now the program is all done.
1065
00:51:08,400 --> 00:51:10,080
And so, that's sequential steps.
1066
00:51:10,080 --> 00:51:12,840
You just type them in, Python runs it.
1067
00:51:12,840 --> 00:51:15,840
They're important, but sort of uninteresting,
1068
00:51:15,840 --> 00:51:18,800
because you can only get so far.
1069
00:51:18,800 --> 00:51:20,320
And you can't really make them intelligent,
1070
00:51:20,320 --> 00:51:22,200
because it's always gonna do the next one.
1071
00:51:22,200 --> 00:51:24,520
So, the next thing we do is what are called conditional steps.
1072
00:51:24,520 --> 00:51:27,080
And this is where it starts to get intelligent.
1073
00:51:27,080 --> 00:51:30,400
I mean, where you are able to encode your brain
1074
00:51:30,400 --> 00:51:32,360
into the computer, like, oh wait a sec,
1075
00:51:32,360 --> 00:51:34,680
let's only do this step if something is true.
1076
00:51:34,680 --> 00:51:38,160
And the syntax that we tend to use here
1077
00:51:38,160 --> 00:51:42,880
is the reserved word if, if, okay?
1078
00:51:42,880 --> 00:51:46,360
And so, the if is like a little fork in the road.
1079
00:51:46,360 --> 00:51:48,440
You can go one way, or you can go another way,
1080
00:51:48,440 --> 00:51:50,040
and you're asking a question.
1081
00:51:50,040 --> 00:51:52,120
So, inside the if statement, right here,
1082
00:51:52,120 --> 00:51:54,800
there is a question, saying, is x less than 10?
1083
00:51:54,800 --> 00:51:58,600
That's a, that resolves to a true or false.
1084
00:51:58,600 --> 00:52:00,100
If it's less than 10, that's true.
1085
00:52:00,100 --> 00:52:02,000
If it's greater than 10, it's false.
1086
00:52:02,000 --> 00:52:06,360
And so, then what we do is, if it's less than 10,
1087
00:52:06,360 --> 00:52:07,880
we have this indented block of code.
1088
00:52:07,880 --> 00:52:09,440
There's also this colon that tells us
1089
00:52:09,440 --> 00:52:11,280
we're in the beginning of an indented block of code.
1090
00:52:11,280 --> 00:52:13,120
And so, what it basically says is,
1091
00:52:13,120 --> 00:52:15,440
if this is true, run that code.
1092
00:52:15,440 --> 00:52:17,160
If it's false, skip that code.
1093
00:52:17,160 --> 00:52:18,960
So, it can either run it or skip it,
1094
00:52:18,960 --> 00:52:23,160
depending on this question that's being asked.
1095
00:52:23,160 --> 00:52:24,240
Now, if you look at this code,
1096
00:52:24,240 --> 00:52:27,180
it's pretty obvious what's going on.
1097
00:52:27,180 --> 00:52:29,300
It comes down, x is five.
1098
00:52:29,300 --> 00:52:31,600
If x is less than 10, that's true.
1099
00:52:31,600 --> 00:52:34,480
So, it runs this code and prints out smaller.
1100
00:52:34,480 --> 00:52:37,440
And then, it comes back here, deindents.
1101
00:52:37,440 --> 00:52:38,880
The next basic sequential,
1102
00:52:38,880 --> 00:52:40,920
this ends up being kind of a block.
1103
00:52:40,920 --> 00:52:43,080
If x is greater than 20,
1104
00:52:43,080 --> 00:52:46,080
if x is greater than 20, oh, come back, come back.
1105
00:52:47,640 --> 00:52:50,840
If x is greater than 20, this turns out to be false,
1106
00:52:50,840 --> 00:52:53,120
because x is five, and so it skips this.
1107
00:52:53,120 --> 00:52:54,760
So, the bigger never comes out,
1108
00:52:54,760 --> 00:52:57,120
and then it continues on and prints fini.
1109
00:52:57,120 --> 00:52:58,440
Oops, that's a typographical error.
1110
00:52:58,440 --> 00:53:01,120
Make that a lowercase print, and then prints fini.
1111
00:53:01,120 --> 00:53:06,040
So, it comes in, runs this, skips this, and then finishes.
1112
00:53:06,040 --> 00:53:10,840
Okay, so here is the last one we'll talk about,
1113
00:53:10,840 --> 00:53:11,700
the repeated steps.
1114
00:53:11,700 --> 00:53:15,460
We'll get back to store and retrieve later,
1115
00:53:15,460 --> 00:53:18,360
but for now, we're just gonna talk about three of the four.
1116
00:53:19,280 --> 00:53:22,040
This is another program, and the key is,
1117
00:53:22,040 --> 00:53:24,880
is that we're gonna use this same choice
1118
00:53:24,880 --> 00:53:25,820
where we're gonna go in,
1119
00:53:25,820 --> 00:53:28,400
but then we're gonna run for a while,
1120
00:53:28,400 --> 00:53:31,080
and then we'll have an exit condition where we get out.
1121
00:53:31,080 --> 00:53:34,760
So, this is repeated over and over and over and over again,
1122
00:53:34,760 --> 00:53:38,320
and this is the essence of how we make computers
1123
00:53:38,320 --> 00:53:40,120
do things that are seemingly difficult,
1124
00:53:40,120 --> 00:53:43,280
while they're more naturally difficult for people, okay?
1125
00:53:43,280 --> 00:53:45,720
And so, how do we encode this notion
1126
00:53:45,720 --> 00:53:49,040
that we wanna do something not forever, but for a while?
1127
00:53:49,040 --> 00:53:51,240
How do we encode that notion?
1128
00:53:51,240 --> 00:53:53,400
And so, we do it in this way.
1129
00:53:53,400 --> 00:53:55,000
So, we have our statement,
1130
00:53:55,000 --> 00:53:57,940
sequentially go to this while, while is a key word,
1131
00:53:57,940 --> 00:53:59,480
and it's asking another question
1132
00:53:59,480 --> 00:54:00,720
that's a true false question.
1133
00:54:00,720 --> 00:54:02,580
Is n greater than zero?
1134
00:54:02,580 --> 00:54:06,080
I read this as, as long as n remains greater than zero,
1135
00:54:06,080 --> 00:54:07,800
keep doing this indented block,
1136
00:54:07,800 --> 00:54:10,120
and you have a colon at the end,
1137
00:54:10,120 --> 00:54:12,760
and then you have two lines of code that's indented,
1138
00:54:12,760 --> 00:54:14,660
so that tells us what the loop is,
1139
00:54:14,660 --> 00:54:16,800
and then this is now deindented.
1140
00:54:16,800 --> 00:54:20,320
And so, it comes in, and if this is true,
1141
00:54:20,320 --> 00:54:24,180
if this is true, if this is true, it runs these two lines.
1142
00:54:24,180 --> 00:54:26,400
Prints out n, n is five, and then it says,
1143
00:54:26,400 --> 00:54:29,560
n equals n minus one, which makes n be four,
1144
00:54:29,560 --> 00:54:31,400
and it goes back up, and it goes up,
1145
00:54:31,400 --> 00:54:33,120
and it asks this question again.
1146
00:54:33,120 --> 00:54:34,940
Is n greater than zero?
1147
00:54:34,940 --> 00:54:37,740
If it is, continue on, and prints four,
1148
00:54:37,740 --> 00:54:39,640
and then subtracts it, and it does that,
1149
00:54:39,640 --> 00:54:43,440
four, three, two, and prints out one,
1150
00:54:43,440 --> 00:54:45,900
then it comes up, and now, after this,
1151
00:54:45,900 --> 00:54:49,140
n is now zero, n is now zero,
1152
00:54:49,140 --> 00:54:51,320
and n is no longer greater than zero,
1153
00:54:51,320 --> 00:54:54,040
so it takes sort of the exit ramp, and goes down here.
1154
00:54:54,040 --> 00:54:56,400
So, it takes the exit ramp, and goes to here,
1155
00:54:56,400 --> 00:54:58,200
and runs the next line.
1156
00:54:58,200 --> 00:55:01,440
Now, we're gonna cover all this again.
1157
00:55:01,440 --> 00:55:03,960
So, I'm just trying to give you the big picture,
1158
00:55:03,960 --> 00:55:05,460
next couple of chapters, we're gonna hit
1159
00:55:05,460 --> 00:55:07,260
all these things again, and we're gonna hit them
1160
00:55:07,260 --> 00:55:11,220
in much more detail, with a lot better information.
1161
00:55:11,220 --> 00:55:13,480
This is now sort of like combining these,
1162
00:55:13,480 --> 00:55:18,480
and again, I don't want you to really know this stuff,
1163
00:55:18,900 --> 00:55:21,320
just, you will know this in a couple of weeks,
1164
00:55:21,320 --> 00:55:23,640
you will see this program again,
1165
00:55:23,640 --> 00:55:26,600
but this shows you how we combine those patterns
1166
00:55:26,600 --> 00:55:30,420
of repeated, sequential, and conditional together.
1167
00:55:31,560 --> 00:55:33,520
So, this is a bit of sequential code,
1168
00:55:33,520 --> 00:55:36,080
comes in here, runs this, which happens to ask
1169
00:55:36,080 --> 00:55:38,280
for a file name, then it opens the file,
1170
00:55:38,280 --> 00:55:40,380
it creates a data structure called a dictionary,
1171
00:55:40,380 --> 00:55:42,240
this is all sequential, now the four
1172
00:55:42,240 --> 00:55:45,720
is another form of loops, so this is gonna loop for a while,
1173
00:55:45,720 --> 00:55:47,120
and then this is, within a loop,
1174
00:55:47,120 --> 00:55:49,840
we can even have two indents, and that's another loop,
1175
00:55:49,840 --> 00:55:52,840
so these are like repeated, and then it goes,
1176
00:55:52,840 --> 00:55:54,880
it goes down to the next sequential bit,
1177
00:55:54,880 --> 00:55:57,040
then it does this, here's another loop,
1178
00:55:57,040 --> 00:55:58,680
it's gonna run, and then here's a conditional,
1179
00:55:58,680 --> 00:56:00,800
it's gonna run, and then once all done,
1180
00:56:00,800 --> 00:56:03,000
we print out the last thing, and this is, of course,
1181
00:56:03,000 --> 00:56:08,000
is the program that does the, it figures out
1182
00:56:09,800 --> 00:56:12,300
the most common word and prints that most common word out,
1183
00:56:12,300 --> 00:56:16,720
and so this is a Python short story, it reads some data,
1184
00:56:16,720 --> 00:56:18,960
it reads the name of a file, it opens that file,
1185
00:56:18,960 --> 00:56:21,560
it talks about how to make a histogram,
1186
00:56:21,560 --> 00:56:25,160
and then it looks through for the most common word,
1187
00:56:25,160 --> 00:56:26,960
so don't worry too much about this,
1188
00:56:27,880 --> 00:56:30,100
over the next couple weeks, we'll fill in the pieces
1189
00:56:30,100 --> 00:56:31,880
so that you absolutely understand
1190
00:56:31,880 --> 00:56:34,160
every single line of this code.
1191
00:56:35,200 --> 00:56:40,160
So, this is a quick overview, chapter one, stick with us,
1192
00:56:40,160 --> 00:56:42,640
you realize it will be chapter seven
1193
00:56:42,640 --> 00:56:44,520
before this makes too much sense,
1194
00:56:44,520 --> 00:56:47,640
you really have to trust that you are learning
1195
00:56:47,640 --> 00:56:50,740
important things, and that it all makes sense
1196
00:56:50,740 --> 00:56:52,680
when we bring it together like in chapter seven
1197
00:56:52,680 --> 00:56:53,600
in a few weeks.
1198
00:56:57,840 --> 00:57:00,600
Hello, and welcome to chapter two.
1199
00:57:00,600 --> 00:57:02,100
Now we're gonna continue to talk about
1200
00:57:02,100 --> 00:57:04,480
the building blocks of Python, variables,
1201
00:57:04,480 --> 00:57:07,600
constants, statements, expressions, et cetera.
1202
00:57:07,600 --> 00:57:09,960
The first thing we have to talk about is constants,
1203
00:57:09,960 --> 00:57:11,800
these are just things we call constants
1204
00:57:11,800 --> 00:57:14,200
because they don't change, they're numbers,
1205
00:57:14,200 --> 00:57:16,520
strings, et cetera, and we use them
1206
00:57:16,520 --> 00:57:19,560
to sort of start calculations, or, you know,
1207
00:57:19,560 --> 00:57:23,360
if something is greater than 40 hours,
1208
00:57:23,360 --> 00:57:24,880
we're gonna do something, and so 40
1209
00:57:24,880 --> 00:57:26,560
is the constant in that situation.
1210
00:57:26,560 --> 00:57:31,360
So, we have 123, we have 98.6, we have hello world,
1211
00:57:31,360 --> 00:57:33,800
which is a string by enclosing it in quotes,
1212
00:57:33,800 --> 00:57:36,360
we pass each of these things to the print function,
1213
00:57:36,360 --> 00:57:38,120
and aside of the respect of the print function
1214
00:57:38,120 --> 00:57:39,680
is that we see the output.
1215
00:57:39,680 --> 00:57:44,040
So, print 123, prints out 123, print 98.6, prints it out.
1216
00:57:44,040 --> 00:57:47,280
So, these are just really the syntax of constants,
1217
00:57:47,280 --> 00:57:50,360
and without constants, we can't write really much
1218
00:57:50,360 --> 00:57:51,840
of anything.
1219
00:57:51,840 --> 00:57:54,040
The other sort of foundational notion
1220
00:57:54,040 --> 00:57:56,440
of any programming language are the reserved words,
1221
00:57:56,440 --> 00:57:58,680
and like I said before, reserved words are these
1222
00:57:58,680 --> 00:58:02,240
special words where Python is listening for them,
1223
00:58:02,240 --> 00:58:04,280
and there are very special meanings,
1224
00:58:04,280 --> 00:58:07,720
so when Python sees if, it's not just any other word,
1225
00:58:07,720 --> 00:58:11,400
it means how Python implements conditional execution.
1226
00:58:13,040 --> 00:58:15,980
Variables are the third building block,
1227
00:58:15,980 --> 00:58:19,320
and that is a way that you can ask Python
1228
00:58:19,320 --> 00:58:22,840
to allocate a piece of memory and then give it a name,
1229
00:58:22,840 --> 00:58:24,400
and you can put stuff in that.
1230
00:58:24,400 --> 00:58:26,840
Sometimes you just put one value, later we'll see,
1231
00:58:26,840 --> 00:58:29,700
when we do collections in chapters eight and nine,
1232
00:58:29,700 --> 00:58:31,340
we will see the more than one value
1233
00:58:31,340 --> 00:58:34,680
can be put into a variable, and the variable,
1234
00:58:34,680 --> 00:58:36,280
how we control the variable is through
1235
00:58:36,280 --> 00:58:38,800
the assignment statement, and as I said before,
1236
00:58:38,800 --> 00:58:41,260
it's important to think of the assignment statement
1237
00:58:41,260 --> 00:58:43,760
as having an arrow to it, so this is not saying
1238
00:58:43,760 --> 00:58:46,480
X for all time is the same as 12.2,
1239
00:58:46,480 --> 00:58:49,500
what it's saying is take 12.2, find a place,
1240
00:58:49,500 --> 00:58:52,600
find some memory in your computer there, Mr. Python,
1241
00:58:52,600 --> 00:58:55,020
give it a label X, we get to choose the X,
1242
00:58:55,020 --> 00:58:57,560
that's the variable part, we chose it, right?
1243
00:58:58,520 --> 00:59:01,600
And then stick 12 in it, and then the same is true for 14.
1244
00:59:01,600 --> 00:59:04,440
Go find another spot, name it Y,
1245
00:59:04,440 --> 00:59:08,640
and then put a 14 in there, so think of this as an arrow
1246
00:59:08,640 --> 00:59:10,920
every time you see that equality,
1247
00:59:10,920 --> 00:59:13,960
the assignment in an assignment statement.
1248
00:59:15,960 --> 00:59:18,520
Now, these variables hold one value,
1249
00:59:18,520 --> 00:59:22,720
so now if we have these three statements, these two,
1250
00:59:22,720 --> 00:59:25,560
and then the third one executes, it says put 100 into X,
1251
00:59:25,560 --> 00:59:28,800
but that wipes out the old value of 12.2,
1252
00:59:28,800 --> 00:59:31,560
and it rewrites it with 100, and so we can
1253
00:59:31,560 --> 00:59:33,320
change the variables, that's another reason
1254
00:59:33,320 --> 00:59:35,540
that we call them variable.
1255
00:59:37,340 --> 00:59:40,680
There are some names, some rules for making variable names,
1256
00:59:40,680 --> 00:59:43,040
you can start with a letter or an underscore.
1257
00:59:43,040 --> 00:59:46,000
We tend not to, as normal programmers use underscore,
1258
00:59:46,000 --> 00:59:49,240
we tend to reserve those for variables
1259
00:59:49,240 --> 00:59:51,620
that we use to communicate with Python itself,
1260
00:59:51,620 --> 00:59:52,860
so when we're making up a variable,
1261
00:59:52,860 --> 00:59:57,360
we tend not to use underscores as a first character.
1262
00:59:57,360 --> 01:00:00,080
You can have letters and numbers and underscores
1263
01:00:00,080 --> 01:00:02,600
after the first character, and they're case sensitive,
1264
01:00:02,600 --> 01:00:06,400
but it's really a bad idea to use case
1265
01:00:06,400 --> 01:00:07,980
as the only differentiator.
1266
01:00:07,980 --> 01:00:12,200
So, in this case, spam, eggs, spam 23,
1267
01:00:12,200 --> 01:00:14,160
and underscore speed are all totally legit,
1268
01:00:14,160 --> 01:00:15,720
we would probably not use this one
1269
01:00:15,720 --> 01:00:17,320
unless we were actually doing it
1270
01:00:17,320 --> 01:00:20,000
because Python told us to use that variable.
1271
01:00:20,000 --> 01:00:21,520
23 spam starts with a number,
1272
01:00:21,520 --> 01:00:23,960
pound sign starts and dot is not
1273
01:00:23,960 --> 01:00:25,760
a legitimate variable character.
1274
01:00:26,640 --> 01:00:30,400
And spam, capital spam and all caps spam are different,
1275
01:00:30,400 --> 01:00:32,520
but this is not something that you want
1276
01:00:32,520 --> 01:00:36,400
to sort of depend on too much, so.
1277
01:00:36,400 --> 01:00:38,000
That's just the rule names.
1278
01:00:38,000 --> 01:00:39,760
We tend to start them with a letter
1279
01:00:39,760 --> 01:00:41,480
and then use letters, numbers, and underscores.
1280
01:00:41,480 --> 01:00:43,240
Underscores other than the first character
1281
01:00:43,240 --> 01:00:45,400
are generally pretty common,
1282
01:00:45,400 --> 01:00:47,940
and you'll see those used a lot.
1283
01:00:47,940 --> 01:00:49,800
Now, when we're choosing variable names,
1284
01:00:49,800 --> 01:00:50,800
one of the things about variables
1285
01:00:50,800 --> 01:00:51,920
is we get to choose the name.
1286
01:00:51,920 --> 01:00:54,960
We get to choose the name X, choose the name Y,
1287
01:00:54,960 --> 01:00:57,000
and so sometimes we like them short,
1288
01:00:57,000 --> 01:00:58,640
but sometimes we want them descriptive,
1289
01:00:58,640 --> 01:01:02,520
and the notion that of making variables descriptive
1290
01:01:02,520 --> 01:01:04,520
is often confusing to beginning students.
1291
01:01:04,520 --> 01:01:07,280
Sometimes it's really helpful to,
1292
01:01:07,280 --> 01:01:08,840
if you're gonna have a line of text
1293
01:01:08,840 --> 01:01:11,600
and you name the variable line, that's great
1294
01:01:11,600 --> 01:01:13,440
because the next person reading your program
1295
01:01:13,440 --> 01:01:15,680
says, oh, that must be the line of text.
1296
01:01:15,680 --> 01:01:18,300
Whereas it also can become misleading
1297
01:01:18,300 --> 01:01:21,920
that line, the name of a variable somehow has meaning,
1298
01:01:21,920 --> 01:01:24,200
and so sometimes we'll have even singular variables
1299
01:01:24,200 --> 01:01:26,880
and plural variables like friend and friends.
1300
01:01:26,880 --> 01:01:28,240
You know, like, is plural,
1301
01:01:28,240 --> 01:01:30,840
does Python know about singular and plural?
1302
01:01:30,840 --> 01:01:32,120
And the answer is no.
1303
01:01:32,120 --> 01:01:35,240
So sometimes we pick variables that make no sense.
1304
01:01:35,240 --> 01:01:37,760
Sometimes we pick variables that make a lot of sense.
1305
01:01:37,760 --> 01:01:40,280
This is just something that you as a beginning programmer
1306
01:01:40,280 --> 01:01:42,140
are going to have to understand
1307
01:01:42,140 --> 01:01:44,480
that we can pick anything we want,
1308
01:01:45,380 --> 01:01:48,060
and so you'll see, I'll try to call attention to this
1309
01:01:48,060 --> 01:01:50,260
in the first few lectures as we go through.
1310
01:01:50,260 --> 01:01:53,580
So here's a bit of code with an assignment statement,
1311
01:01:53,580 --> 01:01:54,520
two assignment statements,
1312
01:01:54,520 --> 01:01:57,280
a multiplication, and a print statement,
1313
01:01:57,280 --> 01:01:59,120
and you can say, what is this doing?
1314
01:01:59,120 --> 01:02:02,920
Now, Python is perfectly happy with this code
1315
01:02:02,920 --> 01:02:04,000
because it assigns it in there.
1316
01:02:04,000 --> 01:02:06,760
You have said, please go give me this as a label,
1317
01:02:06,760 --> 01:02:08,200
and then we assign two variables,
1318
01:02:08,200 --> 01:02:10,840
and then we're carefully pulling these two variables
1319
01:02:10,840 --> 01:02:12,720
back out, multiplying them together
1320
01:02:12,720 --> 01:02:14,760
and sticking them into yet another variable,
1321
01:02:14,760 --> 01:02:16,000
and then printing that variable out.
1322
01:02:16,000 --> 01:02:18,680
That seems like, you know, we can figure out what it is.
1323
01:02:18,680 --> 01:02:20,280
You just have to look really careful,
1324
01:02:20,280 --> 01:02:22,120
and a single character mistake,
1325
01:02:22,120 --> 01:02:27,000
and Python is gonna be, you know, pretty unhappy, okay?
1326
01:02:27,000 --> 01:02:30,360
So that's one way to write this program.
1327
01:02:30,360 --> 01:02:32,700
It's hard, though, because any of those characters
1328
01:02:32,700 --> 01:02:35,520
are long variables and they're random stuff.
1329
01:02:35,520 --> 01:02:37,280
It's not very friendly to anyone
1330
01:02:37,280 --> 01:02:39,380
who might read your program.
1331
01:02:39,380 --> 01:02:40,700
Now, this looks a little friendlier.
1332
01:02:40,700 --> 01:02:41,840
It's the same program
1333
01:02:41,840 --> 01:02:44,440
because Python just wants a correspondence.
1334
01:02:44,440 --> 01:02:47,080
You pick A, you pick B, and you pick C,
1335
01:02:47,080 --> 01:02:50,760
and it's really much easier for us to see what's going on,
1336
01:02:50,760 --> 01:02:55,760
and so this is, in a way, going from here to here
1337
01:02:55,760 --> 01:02:59,400
is much friendlier, but we can be even friendlier
1338
01:02:59,400 --> 01:03:01,000
if we pick mnemonic variable names.
1339
01:03:01,000 --> 01:03:02,760
So this is not mnemonic.
1340
01:03:02,760 --> 01:03:04,600
This is short and convenient.
1341
01:03:04,600 --> 01:03:06,480
This is long and inconvenient.
1342
01:03:06,480 --> 01:03:08,600
Python is happy with any of these.
1343
01:03:09,520 --> 01:03:10,360
Here, on the other hand,
1344
01:03:10,360 --> 01:03:12,920
is another version of the exact same program,
1345
01:03:12,920 --> 01:03:16,120
and now you think to yourself, oh, yeah, now I get it.
1346
01:03:16,120 --> 01:03:17,920
35 is the number of hours.
1347
01:03:17,920 --> 01:03:20,040
12 dollars and 50 cents is the rate,
1348
01:03:20,040 --> 01:03:22,280
and then we're gonna multiply the hours and the rate
1349
01:03:22,280 --> 01:03:24,660
and come up with a pay, and we're putting out the pay.
1350
01:03:24,660 --> 01:03:27,440
Now, whoever wrote this program is much,
1351
01:03:27,440 --> 01:03:30,240
is helping us greatly understand what's going on,
1352
01:03:30,240 --> 01:03:31,600
and that's good.
1353
01:03:31,600 --> 01:03:33,200
Choosing variable names.
1354
01:03:33,200 --> 01:03:36,640
Python, again, all three of these are the same to Python.
1355
01:03:36,640 --> 01:03:39,260
Choosing variable names in a way that help your reader
1356
01:03:39,260 --> 01:03:42,560
understand what's going on is a great thing.
1357
01:03:42,560 --> 01:03:45,040
The problem is, the danger is,
1358
01:03:46,420 --> 01:03:48,640
if you read this and you think that somehow
1359
01:03:48,640 --> 01:03:50,440
Python understands payroll,
1360
01:03:50,440 --> 01:03:52,020
that if you name a variable hours
1361
01:03:52,020 --> 01:03:54,160
that Python knows what hours means,
1362
01:03:54,160 --> 01:03:57,200
the answer is, Python really doesn't care
1363
01:03:57,200 --> 01:03:58,660
what you name the variable as long as
1364
01:03:58,660 --> 01:04:01,400
what you name it, you use it, right?
1365
01:04:01,400 --> 01:04:02,680
And so, you gotta be careful,
1366
01:04:02,680 --> 01:04:04,920
and so you'll see, I will,
1367
01:04:04,920 --> 01:04:09,120
when I write my code in these first few weeks,
1368
01:04:09,120 --> 01:04:11,920
first few lectures, I will sometimes write it
1369
01:04:11,920 --> 01:04:13,800
with gibberish, I'll sometimes write it
1370
01:04:13,800 --> 01:04:16,000
with extremely short but meaningless variable names,
1371
01:04:16,000 --> 01:04:18,600
and sometimes I'll use meaningful variable names,
1372
01:04:18,600 --> 01:04:20,440
and I'll call your attention to it,
1373
01:04:20,440 --> 01:04:22,120
and it will get you.
1374
01:04:22,120 --> 01:04:24,440
You'll start, when you look at this third kind,
1375
01:04:24,440 --> 01:04:28,200
it has meaningful variables or mnemonic variable names,
1376
01:04:28,200 --> 01:04:31,160
you'll just instinctively want to give Python
1377
01:04:31,160 --> 01:04:34,120
more intelligence than it sort of deserves,
1378
01:04:34,120 --> 01:04:36,680
I guess that's probably the best way to say that.
1379
01:04:36,680 --> 01:04:38,620
So, we've talked about constants,
1380
01:04:38,620 --> 01:04:40,040
we've talked about reserved words,
1381
01:04:40,040 --> 01:04:41,480
we've talked about variables.
1382
01:04:43,440 --> 01:04:45,560
And so, here we have a sentence,
1383
01:04:45,560 --> 01:04:47,280
like we've already done some of these things,
1384
01:04:47,280 --> 01:04:49,560
where we set x equals two,
1385
01:04:49,560 --> 01:04:52,560
we retrieve the old value of x and add two to it,
1386
01:04:52,560 --> 01:04:55,200
so that becomes four, and then we print four out,
1387
01:04:55,200 --> 01:04:57,520
print is a function that's built in,
1388
01:04:57,520 --> 01:04:59,520
and we pass in whatever we want to print out.
1389
01:04:59,520 --> 01:05:03,480
So, this parentheses is part of a function call.
1390
01:05:05,080 --> 01:05:07,320
Okay, so, an assignment statement,
1391
01:05:07,320 --> 01:05:10,720
you have to really get your head around the notion
1392
01:05:10,720 --> 01:05:14,000
that it has this arrow nature,
1393
01:05:14,000 --> 01:05:17,280
and that it evaluates this entire right-hand side
1394
01:05:17,280 --> 01:05:20,640
before we change the left-hand side.
1395
01:05:20,640 --> 01:05:22,440
And so, you can think of this sort of as,
1396
01:05:22,440 --> 01:05:24,320
at time step one, it does this,
1397
01:05:24,320 --> 01:05:26,480
and then at time step two, it does the copy.
1398
01:05:26,480 --> 01:05:29,120
And that's how you can have something like x
1399
01:05:29,120 --> 01:05:33,040
on both sides of an assignment statement.
1400
01:05:33,040 --> 01:05:34,980
And so, if, for example, we have x,
1401
01:05:34,980 --> 01:05:39,960
and x has 0.6 in it, x has 0.6 in it,
1402
01:05:39,960 --> 01:05:42,200
what happens is that it first,
1403
01:05:42,200 --> 01:05:44,400
it sort of ignores this part right here,
1404
01:05:44,400 --> 01:05:45,920
and evaluates the expression.
1405
01:05:45,920 --> 01:05:48,960
So, it pulls the 0.6, everywhere x appears,
1406
01:05:48,960 --> 01:05:52,680
it pulls 0.6 out, then it starts running these calculations,
1407
01:05:52,680 --> 01:05:54,720
and then it has the new value.
1408
01:05:54,720 --> 01:05:56,640
After all the calculations are done,
1409
01:05:56,640 --> 01:06:01,640
then and only then is it going to put that back into x.
1410
01:06:02,000 --> 01:06:05,320
And so, it sort of takes that and puts it back into x,
1411
01:06:05,320 --> 01:06:07,840
and then wipes out the old value.
1412
01:06:07,840 --> 01:06:09,920
At this point, this has all been taken care of,
1413
01:06:09,920 --> 01:06:13,200
and it's been reduced down to this 0.93,
1414
01:06:13,200 --> 01:06:16,860
and so that is what's put in as the new value.
1415
01:06:18,360 --> 01:06:20,520
So, up next, we'll talk a little bit more
1416
01:06:20,520 --> 01:06:23,420
about making more complex expressions.
1417
01:06:27,400 --> 01:06:28,240
So, welcome back.
1418
01:06:28,240 --> 01:06:30,040
We're now going to talk about expressions.
1419
01:06:30,040 --> 01:06:33,120
Expressions are a little more complex calculations
1420
01:06:33,120 --> 01:06:33,960
that we can sort of do
1421
01:06:33,960 --> 01:06:37,800
on the right-hand side of an assignment statement.
1422
01:06:37,800 --> 01:06:41,480
So, one of the things about expressions is operators.
1423
01:06:41,480 --> 01:06:43,880
And then operators in computer programming
1424
01:06:43,880 --> 01:06:46,640
are often very much the same as the mathematical operators,
1425
01:06:46,640 --> 01:06:49,420
but we don't have all the fancy characters
1426
01:06:49,420 --> 01:06:51,480
that we have in mathematics,
1427
01:06:51,480 --> 01:06:54,800
and so we have to choose what's on the keyboard,
1428
01:06:54,800 --> 01:06:58,240
and then if we really go back to the 1960s and 1970s,
1429
01:06:58,240 --> 01:06:59,880
and then we used what was on the keyboard
1430
01:06:59,880 --> 01:07:03,200
in the 1960s and the 1970s to make these operators.
1431
01:07:03,200 --> 01:07:06,480
So, pluses addition, minuses subtraction,
1432
01:07:06,480 --> 01:07:09,160
we don't have a time sign or a dot in the middle,
1433
01:07:09,160 --> 01:07:12,120
so we use the asterisk as multiplication.
1434
01:07:12,120 --> 01:07:14,360
Division, we can't put two things over top of each other,
1435
01:07:14,360 --> 01:07:16,360
so we use slash for division.
1436
01:07:16,360 --> 01:07:17,400
Raising to the power,
1437
01:07:17,400 --> 01:07:19,060
because it didn't have little characters back then,
1438
01:07:19,060 --> 01:07:22,040
is star, star, which is raising to the power.
1439
01:07:22,040 --> 01:07:22,960
And then remainder.
1440
01:07:22,960 --> 01:07:26,680
Remainder is the, when you do integer division,
1441
01:07:26,680 --> 01:07:28,480
it's also called the modulo operator,
1442
01:07:28,480 --> 01:07:30,240
it's the remainder, not the quotient.
1443
01:07:30,240 --> 01:07:32,720
Now, I've got a picture of that coming up.
1444
01:07:32,720 --> 01:07:35,800
So, here's a whole series of little examples of this, right?
1445
01:07:35,800 --> 01:07:39,280
So, we've already seen, you know, the plus, x equals x plus one.
1446
01:07:39,280 --> 01:07:41,720
Keep remembering that these assignments are arrows,
1447
01:07:41,720 --> 01:07:44,480
basically, arrow, arrow, they have a direction.
1448
01:07:44,480 --> 01:07:47,200
Multiplication, 440 times 12.
1449
01:07:48,160 --> 01:07:53,160
Dividing this by, that's division over 1,000, 5.28.
1450
01:07:54,640 --> 01:07:56,400
Here, we're gonna put 23 into JJ,
1451
01:07:56,400 --> 01:07:57,340
and then we're gonna do modulo.
1452
01:07:57,340 --> 01:08:00,000
So, that says, take 23, divide it by five,
1453
01:08:00,000 --> 01:08:02,160
and give me back the remainder and put it in KK.
1454
01:08:02,160 --> 01:08:05,080
So, this is the expression that evaluates like this.
1455
01:08:05,080 --> 01:08:09,920
Take 23, divide five into 23, four, remainder, three.
1456
01:08:09,920 --> 01:08:12,600
The three is what comes back up here.
1457
01:08:12,600 --> 01:08:14,880
Okay, and so that is the remainder.
1458
01:08:14,880 --> 01:08:16,620
It's also called modulo operator.
1459
01:08:16,620 --> 01:08:19,880
It turns out that, for things like picking a random number
1460
01:08:19,880 --> 01:08:22,120
and then taking the modulo of 52
1461
01:08:22,120 --> 01:08:23,979
is a way to pick a card randomly.
1462
01:08:23,979 --> 01:08:26,239
So, this modulo operator is actually,
1463
01:08:26,240 --> 01:08:29,439
especially in games and other things, super useful.
1464
01:08:29,439 --> 01:08:32,459
So, that's the various operators.
1465
01:08:32,460 --> 01:08:37,460
It's important to know which of these operators goes first.
1466
01:08:37,580 --> 01:08:39,540
It's called operator precedence.
1467
01:08:39,540 --> 01:08:42,020
Now, normally, we put parentheses in,
1468
01:08:42,020 --> 01:08:44,380
like, you know, so if I put the parentheses in here,
1469
01:08:44,380 --> 01:08:46,240
I'd say this goes first,
1470
01:08:46,240 --> 01:08:48,020
parentheses, then this goes first.
1471
01:08:48,020 --> 01:08:49,620
Oh, actually, not that one.
1472
01:08:49,620 --> 01:08:51,880
Oops, got that one wrong.
1473
01:08:51,880 --> 01:08:56,700
This happens first, this happens, then this happens.
1474
01:08:56,700 --> 01:09:00,340
Okay, and so, but it's important for us to be able to know
1475
01:09:00,340 --> 01:09:01,640
if there were no parentheses,
1476
01:09:01,640 --> 01:09:03,920
the order in which these things will happen.
1477
01:09:03,920 --> 01:09:07,520
So, the way things work in terms of operator precedence
1478
01:09:07,520 --> 01:09:10,220
is parentheses are the most important thing,
1479
01:09:10,220 --> 01:09:13,620
followed by raising to the power, all else being equal.
1480
01:09:13,620 --> 01:09:17,260
Multiplication and division are all both equal,
1481
01:09:17,260 --> 01:09:18,859
and then addition, and then within,
1482
01:09:18,859 --> 01:09:20,239
it's adding left to right.
1483
01:09:20,240 --> 01:09:23,260
So, let's see an example of how this works.
1484
01:09:23,260 --> 01:09:27,279
And so, if we take one plus two to raise to the three power,
1485
01:09:27,279 --> 01:09:29,219
divided by four times five,
1486
01:09:29,220 --> 01:09:31,340
and we print out what comes out of this.
1487
01:09:31,340 --> 01:09:35,939
So, the way I did this when I was taking exams back
1488
01:09:35,939 --> 01:09:38,899
many, many years ago when I was first in computer science,
1489
01:09:38,899 --> 01:09:40,019
is I'd write it all down,
1490
01:09:40,020 --> 01:09:41,660
and I'd look for the highest precedence thing.
1491
01:09:41,660 --> 01:09:43,620
Now, parentheses would make this easy,
1492
01:09:43,620 --> 01:09:45,500
but exponentiation is the first one.
1493
01:09:45,500 --> 01:09:47,260
So, that means we're gonna take this,
1494
01:09:47,260 --> 01:09:50,100
and that's gonna be eight, two to the third power,
1495
01:09:50,100 --> 01:09:55,100
two times two times two, two cubed is eight.
1496
01:09:55,500 --> 01:09:57,100
Then what I would do is I rewrite the whole thing
1497
01:09:57,100 --> 01:09:59,180
with the eight there, and now I look across,
1498
01:09:59,180 --> 01:10:00,940
and I'm looking for multiplications,
1499
01:10:00,940 --> 01:10:02,100
because the power's been done,
1500
01:10:02,100 --> 01:10:03,940
the multiplication's what I'm looking for next.
1501
01:10:03,940 --> 01:10:05,980
And then there is both multiplication division,
1502
01:10:05,980 --> 01:10:08,140
they're equal, they're at the same level.
1503
01:10:08,140 --> 01:10:10,380
And so, what happens is they're done left to right.
1504
01:10:10,380 --> 01:10:14,500
Eight divided by four happens before four times five.
1505
01:10:14,500 --> 01:10:16,940
And so, the fact that it's not four times five,
1506
01:10:16,940 --> 01:10:18,420
but instead eight times four,
1507
01:10:18,420 --> 01:10:19,860
is because of the left to right rule.
1508
01:10:19,860 --> 01:10:22,420
So, then this gets rewritten to be two,
1509
01:10:22,420 --> 01:10:24,300
one plus two times five,
1510
01:10:24,300 --> 01:10:27,020
and this one, multiplication is the top one.
1511
01:10:27,020 --> 01:10:29,500
So, that does this next, two times five becomes 10,
1512
01:10:29,500 --> 01:10:32,180
I rewrite it again, and then one plus 10 addition
1513
01:10:32,180 --> 01:10:36,260
is the lowest thing, and that's how we end up with 11.
1514
01:10:36,260 --> 01:10:38,700
And so, that's how I would do these problems
1515
01:10:38,700 --> 01:10:41,720
if I ever saw the problem on an exam.
1516
01:10:41,720 --> 01:10:43,900
And it's a fun problem to put on exams,
1517
01:10:43,900 --> 01:10:46,420
because there is one and only one answer,
1518
01:10:46,420 --> 01:10:48,940
and every programming class has usually
1519
01:10:48,940 --> 01:10:51,260
at least one slide about this stuff.
1520
01:10:51,260 --> 01:10:53,780
So, like I said, the rules go top to bottom,
1521
01:10:53,780 --> 01:10:56,980
parentheses, power, multiplication, addition,
1522
01:10:56,980 --> 01:10:59,980
and then left to right within it.
1523
01:10:59,980 --> 01:11:03,420
So, we've talked about variables and computing values
1524
01:11:03,420 --> 01:11:06,380
to put inside variables, but the one thing you've kind of
1525
01:11:06,380 --> 01:11:08,180
also, maybe you noticed it as we go by,
1526
01:11:08,180 --> 01:11:10,680
is we have different kinds of data.
1527
01:11:10,680 --> 01:11:12,060
We call it type.
1528
01:11:12,060 --> 01:11:13,300
Is this of type integer?
1529
01:11:13,300 --> 01:11:15,340
Is this of type floating point number?
1530
01:11:15,340 --> 01:11:16,740
Is it of type string?
1531
01:11:16,740 --> 01:11:18,380
What is going on here?
1532
01:11:18,380 --> 01:11:21,420
And Python is pretty smart about various kinds
1533
01:11:21,420 --> 01:11:23,340
of types of data.
1534
01:11:23,340 --> 01:11:26,220
And so, you know, we're adding one plus four here,
1535
01:11:26,220 --> 01:11:28,180
and Python knows, as it looks at this,
1536
01:11:28,180 --> 01:11:30,100
that that's an integer and that's an integer,
1537
01:11:30,100 --> 01:11:32,020
and we'll add it together and make it an integer.
1538
01:11:32,020 --> 01:11:33,880
So, that thing is an integer.
1539
01:11:33,880 --> 01:11:37,180
We can also use this plus to concatenate two strings.
1540
01:11:37,180 --> 01:11:39,980
This is hello blank plus there,
1541
01:11:39,980 --> 01:11:42,020
and plus looks here, says, oh, that's a string,
1542
01:11:42,020 --> 01:11:43,000
and that's a string.
1543
01:11:43,000 --> 01:11:44,560
So, I know what to do with strings.
1544
01:11:44,560 --> 01:11:46,460
I will concatenate those two things together,
1545
01:11:46,460 --> 01:11:48,980
so it becomes another string that gets assigned
1546
01:11:48,980 --> 01:11:51,980
into EE, and it's hello space there.
1547
01:11:51,980 --> 01:11:53,380
The plus doesn't add the space.
1548
01:11:53,380 --> 01:11:56,140
I added the space by putting it right there.
1549
01:11:56,140 --> 01:11:58,100
And so, these operators are kind of smart
1550
01:11:58,100 --> 01:11:59,980
in that they kind of know what they're dealing with,
1551
01:11:59,980 --> 01:12:02,820
and sometimes they will do one thing or another
1552
01:12:02,820 --> 01:12:05,740
depending on the kinds of values, variables,
1553
01:12:05,740 --> 01:12:07,740
or constants that they're working with.
1554
01:12:09,380 --> 01:12:12,880
And so, sometimes type can get us in trouble.
1555
01:12:14,020 --> 01:12:16,740
So, here we have EE, which is hello there
1556
01:12:16,740 --> 01:12:18,900
because we've concatenated these two strings together,
1557
01:12:18,900 --> 01:12:20,140
and now we're adding one.
1558
01:12:20,140 --> 01:12:22,300
And the problem now is that it looks on one side
1559
01:12:22,300 --> 01:12:24,460
and says, that's a string, and that's a number,
1560
01:12:24,460 --> 01:12:26,340
and says, I don't know how to do that.
1561
01:12:26,340 --> 01:12:28,460
This is another one of those annoying errors
1562
01:12:28,460 --> 01:12:30,540
that you would like, you think that somehow
1563
01:12:30,540 --> 01:12:33,380
Python doesn't like you, but it just is confused.
1564
01:12:33,380 --> 01:12:35,460
If you look at these things, traceback,
1565
01:12:35,460 --> 01:12:37,500
traceback always means I quit.
1566
01:12:37,500 --> 01:12:40,100
It means I stopped, I ran, I'm quitting now
1567
01:12:40,100 --> 01:12:41,340
because I don't want to go any farther
1568
01:12:41,340 --> 01:12:43,040
because I've become confused.
1569
01:12:43,040 --> 01:12:46,100
So, your program stops running, and you say,
1570
01:12:46,100 --> 01:12:47,500
here's where I stopped running,
1571
01:12:47,500 --> 01:12:48,700
because we're typing interactively.
1572
01:12:48,700 --> 01:12:50,220
It's always line one here.
1573
01:12:50,220 --> 01:12:52,300
Type it, but you, if you read carefully
1574
01:12:52,300 --> 01:12:54,440
and you don't get too stuck on too much stuff,
1575
01:12:54,440 --> 01:12:58,260
line one that tells us something in module type error
1576
01:12:58,260 --> 01:13:01,500
can't convert int object to str implicitly.
1577
01:13:01,500 --> 01:13:03,940
So, that's an integer right there, and that's a string,
1578
01:13:03,940 --> 01:13:05,420
and that's what it's complaining about,
1579
01:13:05,420 --> 01:13:06,880
that little bit right there.
1580
01:13:06,880 --> 01:13:09,700
If Python is so grumpy about types,
1581
01:13:09,700 --> 01:13:11,620
then we should be able to ask it about type.
1582
01:13:11,620 --> 01:13:15,560
So, it turns out that there is, inside Python,
1583
01:13:15,560 --> 01:13:18,940
a built-in function called type, T-Y-P-E.
1584
01:13:18,940 --> 01:13:20,780
So, we can pass into type.
1585
01:13:20,780 --> 01:13:24,640
So, the syntax is calling a built-in function named type.
1586
01:13:24,640 --> 01:13:27,620
Parenthesis is the parameter that we're passing to it.
1587
01:13:27,620 --> 01:13:29,820
We're saying, hey, hello, tell me something
1588
01:13:29,820 --> 01:13:32,900
about the type of the variable E-E-E-E-E.
1589
01:13:32,900 --> 01:13:34,180
And so, this is a function,
1590
01:13:34,180 --> 01:13:36,380
the parentheses are part of the function call,
1591
01:13:36,380 --> 01:13:39,900
and it says, oh, that would be of class string.
1592
01:13:39,900 --> 01:13:41,920
And then we can pass in a constant and says,
1593
01:13:41,920 --> 01:13:43,540
hey, what about hello?
1594
01:13:43,540 --> 01:13:46,140
The string hello, it's like, oh, that's a string too.
1595
01:13:46,140 --> 01:13:47,100
What about a one?
1596
01:13:47,100 --> 01:13:47,980
Well, that's an integer.
1597
01:13:47,980 --> 01:13:52,340
And so, we are asking Python, through the type function,
1598
01:13:52,340 --> 01:13:56,460
what the type of either a variable or a constant is.
1599
01:13:56,460 --> 01:13:58,340
And there are even several types of numbers,
1600
01:13:58,340 --> 01:14:01,240
and we'll even see Booleans and others later,
1601
01:14:02,500 --> 01:14:05,420
like one with no decimal, that's an integer number.
1602
01:14:05,420 --> 01:14:08,860
98.6 with a decimal, that's a floating point number.
1603
01:14:08,860 --> 01:14:13,860
And so, constants can be both integer and floating point.
1604
01:14:14,700 --> 01:14:16,820
And I'm just asking over and over and over again,
1605
01:14:16,820 --> 01:14:19,100
what is the type of, what's in xxx?
1606
01:14:19,100 --> 01:14:20,820
What's the type of what's in temp?
1607
01:14:20,820 --> 01:14:24,380
And what's the type of the constant one?
1608
01:14:24,380 --> 01:14:26,780
And what's the type of 1.0?
1609
01:14:28,060 --> 01:14:30,980
You can also use a set of built-in functions,
1610
01:14:30,980 --> 01:14:34,860
like float and int, to convert from one to another.
1611
01:14:34,860 --> 01:14:37,860
And so, this basically says, I wanna convert,
1612
01:14:37,860 --> 01:14:39,960
oops, let's go back.
1613
01:14:39,960 --> 01:14:43,780
I wanna convert 99 to a floating point number.
1614
01:14:43,780 --> 01:14:45,020
So, this is a function,
1615
01:14:45,020 --> 01:14:47,900
and it's participating in this plus,
1616
01:14:47,900 --> 01:14:49,500
but before it can finish the plus,
1617
01:14:49,500 --> 01:14:52,700
it turns this into a 99.0.
1618
01:14:52,700 --> 01:14:55,380
The difference between 99 as an integer and 99.0
1619
01:14:55,380 --> 01:14:56,860
is that it's a floating point number.
1620
01:14:56,860 --> 01:14:59,900
And that actually turns this computation,
1621
01:14:59,900 --> 01:15:01,940
as it looks to the left and looks to the right,
1622
01:15:01,940 --> 01:15:03,820
says, oh, I've got a floating point number
1623
01:15:03,820 --> 01:15:06,080
on one side, an integer on the other side,
1624
01:15:06,080 --> 01:15:08,260
and so I'm gonna make my calculation overall
1625
01:15:08,260 --> 01:15:10,540
via floating point calculation.
1626
01:15:10,540 --> 01:15:13,100
I can also pass into the float function.
1627
01:15:13,100 --> 01:15:15,420
I can say, take this variable i,
1628
01:15:15,420 --> 01:15:17,820
which has a 42, also an integer,
1629
01:15:17,820 --> 01:15:19,940
and then give me back a floating point.
1630
01:15:19,940 --> 01:15:23,340
So, that'll be 42.0, pass that into f,
1631
01:15:23,340 --> 01:15:27,700
we print it out, and it is indeed 42.0, and it's a float.
1632
01:15:27,700 --> 01:15:31,860
And so, it knows the type and value in any variable.
1633
01:15:31,860 --> 01:15:34,360
This is an integer of value 42.
1634
01:15:34,360 --> 01:15:37,500
This is a float of value 42.0.
1635
01:15:39,740 --> 01:15:42,240
Integer division in Python 2 was kinda weird,
1636
01:15:42,240 --> 01:15:43,860
and it was actually one of the big things
1637
01:15:43,860 --> 01:15:46,500
that they changed between Python 2 and Python 3.
1638
01:15:46,500 --> 01:15:47,820
This is a Python 3 course,
1639
01:15:47,820 --> 01:15:49,820
so we're not worried about that too much.
1640
01:15:49,820 --> 01:15:52,900
What's nice about integer division in Python 3
1641
01:15:52,900 --> 01:15:55,420
is it always produces a floating point result.
1642
01:15:55,420 --> 01:15:59,020
And that means that Python 3's division is more predictable,
1643
01:15:59,020 --> 01:16:02,120
and it works more like a calculator.
1644
01:16:02,120 --> 01:16:04,380
So, in this case, I mean, you can go back
1645
01:16:04,380 --> 01:16:05,780
and look at my Python 2 lectures
1646
01:16:05,780 --> 01:16:07,940
and see how crazy it was in Python 2.
1647
01:16:07,940 --> 01:16:10,940
10 divided by 2 is 5.0, and the weird thing here
1648
01:16:10,940 --> 01:16:13,420
is these are both integers, but the division
1649
01:16:13,420 --> 01:16:15,460
forces the result of the calculation
1650
01:16:15,460 --> 01:16:16,780
to be a floating point number.
1651
01:16:16,780 --> 01:16:19,780
And this, you know, 10 over 2 could be 5,
1652
01:16:19,780 --> 01:16:24,740
but 9 over 2 is 4.5, and so that is accurate.
1653
01:16:24,740 --> 01:16:27,180
In old Python 2, that would give us back 4,
1654
01:16:27,180 --> 01:16:30,940
which is completely unpredictable and weird.
1655
01:16:30,940 --> 01:16:33,140
The same with 99 over 100.
1656
01:16:33,140 --> 01:16:35,060
As you would expect if this were a calculator,
1657
01:16:35,060 --> 01:16:36,980
you get 0.99.
1658
01:16:36,980 --> 01:16:39,380
Actually, what you get in Python 2 is zero
1659
01:16:39,380 --> 01:16:40,660
because it would round it down.
1660
01:16:40,660 --> 01:16:43,180
It doesn't round at all, it truncates it.
1661
01:16:43,180 --> 01:16:47,180
So, 99 over 100 is 0.99, and then it truncates it to zero.
1662
01:16:47,180 --> 01:16:48,260
That's Python 2.
1663
01:16:48,260 --> 01:16:49,900
We're not talking about Python 2.
1664
01:16:49,900 --> 01:16:51,900
There's a good reason we're not talking about Python 2.
1665
01:16:51,900 --> 01:16:53,380
Welcome to Python 3.
1666
01:16:53,380 --> 01:16:55,700
Of course, if there are a floating point on either side,
1667
01:16:55,700 --> 01:16:57,580
the result is still a floating point,
1668
01:16:57,580 --> 01:17:00,060
floating point, and the result is still a floating point.
1669
01:17:00,060 --> 01:17:03,220
So, integer division produces a floating result
1670
01:17:03,220 --> 01:17:07,900
in Python 3.0, not in Python 2.0.
1671
01:17:07,900 --> 01:17:11,220
That is an improvement in Python 3.0.
1672
01:17:11,220 --> 01:17:13,180
And that's why we're recording these lectures.
1673
01:17:13,180 --> 01:17:15,740
I have a whole great set of lectures about Python 2,
1674
01:17:15,740 --> 01:17:17,300
and now I'm gonna have a great set of lectures
1675
01:17:17,300 --> 01:17:18,540
about Python 3.
1676
01:17:18,540 --> 01:17:19,980
Welcome to Python 3.
1677
01:17:21,140 --> 01:17:23,620
Okay, so, we've been talking about converting
1678
01:17:23,620 --> 01:17:24,780
from integer to floating point,
1679
01:17:24,780 --> 01:17:27,420
but you can also convert from string to integer
1680
01:17:27,420 --> 01:17:28,820
or string to floating point.
1681
01:17:29,860 --> 01:17:33,420
And so, here we start out with a little string value.
1682
01:17:33,420 --> 01:17:35,340
Now, it only works for strings that are made of digits.
1683
01:17:35,340 --> 01:17:38,860
So, quote one, two, three, quote is not an integer.
1684
01:17:38,860 --> 01:17:42,140
It is a three-character string that has one, two, three
1685
01:17:42,140 --> 01:17:43,660
as the characters in that string,
1686
01:17:43,660 --> 01:17:46,220
which is very different than 123.
1687
01:17:47,140 --> 01:17:48,580
We say, what is the type of this?
1688
01:17:48,580 --> 01:17:49,860
It's a string.
1689
01:17:49,860 --> 01:17:51,540
We say, let's add one to it.
1690
01:17:51,540 --> 01:17:54,740
And it says, can't convert int to string,
1691
01:17:54,740 --> 01:17:56,060
so that blows up, right?
1692
01:17:56,060 --> 01:17:57,300
Because this is a string.
1693
01:17:57,300 --> 01:17:58,340
It looks to both sides.
1694
01:17:58,340 --> 01:18:01,980
String plus an integer, not good, okay?
1695
01:18:01,980 --> 01:18:03,820
But we can convert this.
1696
01:18:03,820 --> 01:18:05,420
We can call the int function,
1697
01:18:05,420 --> 01:18:07,180
which is like the float function,
1698
01:18:07,180 --> 01:18:08,300
and pass a string in.
1699
01:18:08,300 --> 01:18:11,620
So, it says, hey, take this and turn it into an integer.
1700
01:18:11,620 --> 01:18:13,660
So, take the input of sval,
1701
01:18:13,660 --> 01:18:15,780
which is the string one, two, three,
1702
01:18:15,780 --> 01:18:18,540
and give me back an integer representation of that,
1703
01:18:18,540 --> 01:18:21,100
which is going to be 123.
1704
01:18:21,100 --> 01:18:22,780
So, we say, what kind of thing do we get back?
1705
01:18:22,780 --> 01:18:24,280
Well, we got back an integer.
1706
01:18:24,280 --> 01:18:27,260
We can now add one to it and get 124.
1707
01:18:27,260 --> 01:18:30,140
And so, you have to manage the type of things
1708
01:18:30,140 --> 01:18:33,580
and you can convert from one type to another.
1709
01:18:33,580 --> 01:18:35,700
Now, int is not magic.
1710
01:18:35,700 --> 01:18:36,980
If you send something into it,
1711
01:18:36,980 --> 01:18:40,060
a string that doesn't consist of digits,
1712
01:18:40,060 --> 01:18:42,300
then you're gonna end up with another error.
1713
01:18:42,300 --> 01:18:44,620
Invalid literal for integer with base 10,
1714
01:18:44,620 --> 01:18:45,780
blah, blah, blah, blah, blah.
1715
01:18:45,780 --> 01:18:46,780
So, it's really complaining.
1716
01:18:46,780 --> 01:18:48,500
It says, I want these to be numbers here
1717
01:18:48,500 --> 01:18:49,900
and you just gave me letters.
1718
01:18:49,900 --> 01:18:52,260
So, that's going to cause this to fail.
1719
01:18:54,140 --> 01:18:55,900
Another thing that we're gonna do with variables
1720
01:18:55,900 --> 01:18:59,120
is just like the print function takes something,
1721
01:18:59,120 --> 01:19:00,980
a list of things, in this case,
1722
01:19:00,980 --> 01:19:02,620
a string, comma, a variable,
1723
01:19:02,620 --> 01:19:05,060
and then print some output in the program.
1724
01:19:05,060 --> 01:19:06,500
The opposite of that is input.
1725
01:19:06,500 --> 01:19:09,060
Actually, input generally happens before output.
1726
01:19:09,060 --> 01:19:12,860
Input is a built-in function and we pass to it a prompt,
1727
01:19:12,860 --> 01:19:15,200
a string of text that's going to be printed out
1728
01:19:15,200 --> 01:19:17,980
for the user and then it stops and waits.
1729
01:19:17,980 --> 01:19:19,560
So, it says, who are you?
1730
01:19:19,560 --> 01:19:21,580
And then right here, it just sits,
1731
01:19:21,580 --> 01:19:23,440
waiting for us to type something.
1732
01:19:23,440 --> 01:19:24,820
So, we type, blah, blah, blah, blah,
1733
01:19:24,820 --> 01:19:26,780
and then hit the enter key, right?
1734
01:19:26,780 --> 01:19:29,860
We hit the enter key and then this text
1735
01:19:29,860 --> 01:19:31,660
ends up in this variable.
1736
01:19:31,660 --> 01:19:33,700
So, this is an assignment statement
1737
01:19:33,700 --> 01:19:36,540
that chuck is the result of the input call,
1738
01:19:36,540 --> 01:19:38,620
gets copied into the nam variable.
1739
01:19:41,680 --> 01:19:43,680
So, let's do that again.
1740
01:19:43,680 --> 01:19:45,500
It's evaluating assignment statement.
1741
01:19:45,500 --> 01:19:46,740
Remember, it's kind of this way
1742
01:19:46,740 --> 01:19:49,780
or you can think of it as do this right side first.
1743
01:19:49,780 --> 01:19:53,220
It writes this out, writes that out,
1744
01:19:53,220 --> 01:19:55,820
then it waits, wait, wait, wait, wait, wait,
1745
01:19:55,820 --> 01:20:00,020
until we hit the enter and takes this chuck
1746
01:20:00,020 --> 01:20:02,620
and that becomes the result of this input
1747
01:20:02,620 --> 01:20:05,960
which is then assigned in to nam.
1748
01:20:05,960 --> 01:20:08,460
Now, then we go sequentially to the next line.
1749
01:20:08,460 --> 01:20:10,980
It prints out welcome, comma,
1750
01:20:10,980 --> 01:20:12,840
n-a, contents of the variable nam.
1751
01:20:12,840 --> 01:20:14,740
Now, this one, this comma here,
1752
01:20:14,740 --> 01:20:16,960
actually does put the space in here automatically.
1753
01:20:16,960 --> 01:20:18,860
So, it says welcome space chuck.
1754
01:20:18,860 --> 01:20:20,860
So, it pulls the, there's no space in chuck,
1755
01:20:20,860 --> 01:20:23,220
just the chu-c-k.
1756
01:20:23,220 --> 01:20:25,700
And so, print can take more than one thing
1757
01:20:25,700 --> 01:20:26,740
separated by commas.
1758
01:20:26,740 --> 01:20:28,420
Matter of fact, print can have,
1759
01:20:28,420 --> 01:20:31,960
you know, a whole bunch, oops, come back, come back, come back.
1760
01:20:34,220 --> 01:20:36,340
Print can have comma, comma, comma, parenthesis,
1761
01:20:36,340 --> 01:20:37,340
as many as you like.
1762
01:20:37,340 --> 01:20:38,260
Everything you've seen up to now
1763
01:20:38,260 --> 01:20:39,380
is kind of one thing in the print
1764
01:20:39,380 --> 01:20:42,340
but that doesn't mean that print only can do one thing.
1765
01:20:43,180 --> 01:20:44,700
So, I've talked about variables,
1766
01:20:44,700 --> 01:20:45,860
we've talked about constants,
1767
01:20:45,860 --> 01:20:46,900
we've talked about input,
1768
01:20:46,900 --> 01:20:47,780
we've talked about output,
1769
01:20:47,780 --> 01:20:51,580
and now it is time to write our first meaningful program.
1770
01:20:52,660 --> 01:20:55,740
And so, this program has to do with those of you
1771
01:20:55,740 --> 01:20:58,060
who have traveled internationally.
1772
01:20:58,060 --> 01:20:59,620
If you traveled to United States
1773
01:20:59,620 --> 01:21:01,080
and you traveled outside the United States,
1774
01:21:01,080 --> 01:21:03,980
you notice that there is an elevator convention
1775
01:21:03,980 --> 01:21:05,700
that is different inside the United States.
1776
01:21:05,700 --> 01:21:08,700
The United States, the walk in the ground floor
1777
01:21:08,700 --> 01:21:10,460
in the elevator, that's one.
1778
01:21:10,460 --> 01:21:12,440
And if you walk in a ground floor in Europe
1779
01:21:12,440 --> 01:21:13,940
or many other places in the world,
1780
01:21:13,940 --> 01:21:15,960
then the elevator is zero.
1781
01:21:15,960 --> 01:21:17,580
So, we have written a small app
1782
01:21:17,580 --> 01:21:18,620
that we're gonna put on the app store
1783
01:21:18,620 --> 01:21:19,780
and get wealthy with,
1784
01:21:19,780 --> 01:21:23,780
with called Elevator Floor Conversion App.
1785
01:21:23,780 --> 01:21:27,020
And it's gonna ask us, we're in Europe and we're lost,
1786
01:21:27,020 --> 01:21:28,900
and you say, well, what floor would this be
1787
01:21:28,900 --> 01:21:31,380
if I was in the United States of America?
1788
01:21:31,380 --> 01:21:33,660
And so, here's, we have to read the floor
1789
01:21:33,660 --> 01:21:35,680
that we are at in Europe,
1790
01:21:35,680 --> 01:21:38,500
and then we're going to convert it to a US floor,
1791
01:21:38,500 --> 01:21:39,520
and then we're gonna print it out.
1792
01:21:39,520 --> 01:21:41,580
This is very silly,
1793
01:21:41,580 --> 01:21:46,580
but it is a pure, essential program that has input,
1794
01:21:47,300 --> 01:21:49,620
does some kind of task on that input,
1795
01:21:49,620 --> 01:21:51,180
and then produces some output,
1796
01:21:51,180 --> 01:21:55,800
which is useful for some value of useful, okay?
1797
01:21:55,800 --> 01:21:57,700
So, let's take a look at how we combine
1798
01:21:57,700 --> 01:21:59,580
everything that we learned in this lecture,
1799
01:21:59,580 --> 01:22:01,340
input, processing, and output.
1800
01:22:01,340 --> 01:22:03,220
It's a three-line program,
1801
01:22:03,220 --> 01:22:05,100
but it's sort of the beginning
1802
01:22:05,100 --> 01:22:07,380
of something that programs do, okay?
1803
01:22:07,380 --> 01:22:09,100
You're gonna do lots of programs that do this.
1804
01:22:09,100 --> 01:22:11,560
So, here we go.
1805
01:22:11,560 --> 01:22:14,480
Program starts, we do the input side effect.
1806
01:22:14,480 --> 01:22:17,280
It prints out this and then waits.
1807
01:22:17,280 --> 01:22:20,220
We type in zero, that comes back here,
1808
01:22:20,220 --> 01:22:22,700
and the zero, which is a string.
1809
01:22:22,700 --> 01:22:24,500
Input gives you back a string.
1810
01:22:24,500 --> 01:22:26,620
It doesn't give you back a number.
1811
01:22:26,620 --> 01:22:27,780
It's a little different in Python too,
1812
01:22:27,780 --> 01:22:29,860
but in Python 3, input gives you a string.
1813
01:22:29,860 --> 01:22:32,300
So, quote zero, quote, which is what we typed here.
1814
01:22:32,300 --> 01:22:34,300
We didn't type the quotes, it's a string.
1815
01:22:34,300 --> 01:22:35,980
It gets stored in the imp variable.
1816
01:22:37,020 --> 01:22:39,240
Then we move to the next statement,
1817
01:22:39,240 --> 01:22:40,500
and on this right-hand side,
1818
01:22:40,500 --> 01:22:43,060
we convert that string variable to an integer,
1819
01:22:43,060 --> 01:22:44,940
so that becomes the integer zero.
1820
01:22:44,940 --> 01:22:48,260
We add one to it, and then that becomes one,
1821
01:22:48,260 --> 01:22:50,260
and then we assign that into USF.
1822
01:22:50,260 --> 01:22:54,340
I've named this variable United States Floor, right?
1823
01:22:54,340 --> 01:22:57,220
So, imp is the input, and USF, that's mnemonic.
1824
01:22:57,220 --> 01:22:58,880
It doesn't know anything about elevators,
1825
01:22:58,880 --> 01:23:01,840
it's just I picked a variable that was quite friendly.
1826
01:23:03,220 --> 01:23:07,940
And so, at this point, USF has the United States Floor
1827
01:23:07,940 --> 01:23:09,740
that's equivalent to the European Floor,
1828
01:23:09,740 --> 01:23:12,460
and then I just fall down and I do a print statement.
1829
01:23:12,460 --> 01:23:15,520
Print out USFloor, USFloor, comma,
1830
01:23:15,520 --> 01:23:16,980
that's the space right here,
1831
01:23:16,980 --> 01:23:20,060
and then whatever the contents of the USFloor variable is.
1832
01:23:20,060 --> 01:23:22,560
And you could see that I could write this on four,
1833
01:23:22,560 --> 01:23:23,740
and it would say three.
1834
01:23:23,740 --> 01:23:26,940
I could write this and say seven, and it would say six.
1835
01:23:26,940 --> 01:23:28,300
This is an amazing program.
1836
01:23:28,300 --> 01:23:32,460
It converts floors in a European numbering scheme.
1837
01:23:32,460 --> 01:23:35,240
Wait, actually, no, I got that wrong.
1838
01:23:35,240 --> 01:23:37,780
Hang on, let me clear this.
1839
01:23:37,780 --> 01:23:39,820
I wasn't thinking clearly.
1840
01:23:39,820 --> 01:23:42,840
I could type in four, and it would give me back five.
1841
01:23:42,840 --> 01:23:45,540
I could type in six, and it would give me back seven.
1842
01:23:45,540 --> 01:23:47,220
See, I'm confused, haven't been in Europe
1843
01:23:47,220 --> 01:23:50,140
in a couple of months, and so I forgot all about the floors,
1844
01:23:50,140 --> 01:23:51,700
but that's the idea.
1845
01:23:51,700 --> 01:23:55,900
Now, this is a super, super, super simple program.
1846
01:23:55,900 --> 01:23:58,520
Not super useful, but you get the idea
1847
01:23:58,520 --> 01:24:00,180
that we're gonna pull some data in,
1848
01:24:00,180 --> 01:24:02,900
we're gonna do some intelligent thing.
1849
01:24:02,900 --> 01:24:05,220
Soon this will be hundreds of lines of code
1850
01:24:05,220 --> 01:24:06,060
instead of one line of code,
1851
01:24:06,060 --> 01:24:08,660
and then we're gonna present the results to our user.
1852
01:24:12,140 --> 01:24:14,820
Now, another element of most any programming language
1853
01:24:14,820 --> 01:24:16,400
is what's called a comment.
1854
01:24:16,400 --> 01:24:20,700
A comment is a way for you to put in a program file
1855
01:24:20,700 --> 01:24:24,380
some text that's to be ignored by Python or C
1856
01:24:24,380 --> 01:24:26,340
or whatever language we happen to be using.
1857
01:24:26,340 --> 01:24:29,780
In Python, comments start with a pound sign.
1858
01:24:29,780 --> 01:24:32,660
So what you can do is put a pound sign anywhere in a line,
1859
01:24:32,660 --> 01:24:34,860
and then after the pound sign,
1860
01:24:34,860 --> 01:24:36,860
Python ignores everything after that pound sign.
1861
01:24:36,860 --> 01:24:38,360
It can be the first character.
1862
01:24:38,360 --> 01:24:43,180
So here's our recurring concept that we talk a lot about.
1863
01:24:43,180 --> 01:24:44,540
We're not gonna cover this.
1864
01:24:44,540 --> 01:24:45,460
Remember what this does.
1865
01:24:45,460 --> 01:24:47,860
This is counting how many letters, the, the, the.
1866
01:24:47,860 --> 01:24:50,260
There's 16 thes, and there's, in that file,
1867
01:24:50,260 --> 01:24:52,380
there was six twos or whatever it was.
1868
01:24:52,380 --> 01:24:53,220
This is that code.
1869
01:24:53,220 --> 01:24:54,960
We'll get back to this code.
1870
01:24:54,960 --> 01:24:57,300
But what we've done here is I've added some comments
1871
01:24:57,300 --> 01:25:00,740
that are really for human consumption.
1872
01:25:00,740 --> 01:25:03,060
So this first paragraph is get the name of the file
1873
01:25:03,060 --> 01:25:03,900
and open it.
1874
01:25:03,900 --> 01:25:06,820
The second paragraph is count the word frequency.
1875
01:25:06,820 --> 01:25:09,220
You know, maybe I should have said histogram here.
1876
01:25:10,140 --> 01:25:12,540
Count the word frequency and assemble a histogram.
1877
01:25:12,540 --> 01:25:15,180
And then here I'm putting this pound sign in,
1878
01:25:15,180 --> 01:25:16,260
find the most common word,
1879
01:25:16,260 --> 01:25:18,980
and then I'm all done, I print this stuff out, right?
1880
01:25:18,980 --> 01:25:23,140
And so all I'm saying is comments are for people to read.
1881
01:25:23,140 --> 01:25:25,700
Your next programmer or the person who's gonna change
1882
01:25:25,700 --> 01:25:28,340
your program after you're done with it.
1883
01:25:28,340 --> 01:25:29,180
And they're nice.
1884
01:25:29,180 --> 01:25:32,020
And you don't have to use any particularly weird syntax
1885
01:25:32,020 --> 01:25:33,860
or variable naming conventions.
1886
01:25:33,860 --> 01:25:36,780
You put a pound sign in and you can write anything you want
1887
01:25:36,780 --> 01:25:38,040
from that point forward.
1888
01:25:39,720 --> 01:25:42,560
Okay, so we've talked a little bit about variables
1889
01:25:42,560 --> 01:25:45,260
and types and mnemonics and how we would choose
1890
01:25:45,260 --> 01:25:47,780
variable names and how expressions work
1891
01:25:47,780 --> 01:25:49,440
and the various operators converting
1892
01:25:49,440 --> 01:25:52,540
between different types, printing,
1893
01:25:52,540 --> 01:25:54,660
input, output, and comments.
1894
01:25:54,660 --> 01:25:58,280
So that just kinda gets us sentences.
1895
01:25:58,280 --> 01:26:01,640
Coming up next we'll talk about conditional execution
1896
01:26:01,640 --> 01:26:03,940
where we're really starting to move up to paragraphs.
1897
01:26:03,940 --> 01:26:05,100
So see you in a bit.
1898
01:26:09,600 --> 01:26:12,180
Hello and welcome to chapter three, conditional execution.
1899
01:26:12,180 --> 01:26:15,300
In conditional execution we meet the if statement.
1900
01:26:15,300 --> 01:26:17,960
The if statement is where Python can go one way
1901
01:26:17,960 --> 01:26:19,180
or another way.
1902
01:26:19,180 --> 01:26:21,620
And it's the beginning of sort of our way
1903
01:26:21,620 --> 01:26:25,140
of making Python make decisions for us.
1904
01:26:25,140 --> 01:26:27,380
Sequential code, we just do some things.
1905
01:26:27,380 --> 01:26:28,500
Sometimes that's useful.
1906
01:26:28,500 --> 01:26:32,420
But now we can have our code check something
1907
01:26:32,420 --> 01:26:35,740
and then make a decision based on that thing.
1908
01:26:35,740 --> 01:26:37,820
So the conditional steps in Python
1909
01:26:37,820 --> 01:26:39,700
are pretty straightforward.
1910
01:26:39,700 --> 01:26:42,540
The keyword that we're going to use is the if statement.
1911
01:26:42,540 --> 01:26:44,460
And so if is a reserved word.
1912
01:26:44,460 --> 01:26:47,900
And the if statement has as part of it
1913
01:26:47,900 --> 01:26:48,980
a question that it asks.
1914
01:26:48,980 --> 01:26:51,780
And this is asking if x is less than 10.
1915
01:26:51,780 --> 01:26:53,740
And the colon is the end of the if statement.
1916
01:26:53,740 --> 01:26:56,540
And then we begin an indented block of text.
1917
01:26:56,540 --> 01:26:58,460
And the way this works in this particular thing
1918
01:26:58,460 --> 01:27:00,820
is this line is the conditional line.
1919
01:27:00,820 --> 01:27:03,820
If the question is true, the line executes.
1920
01:27:03,820 --> 01:27:06,580
And if the question is false, the line is skipped.
1921
01:27:06,580 --> 01:27:08,600
And you can think of it the way this is, right?
1922
01:27:08,600 --> 01:27:10,500
x is five, ask a question.
1923
01:27:10,500 --> 01:27:11,780
Is it 10 or not?
1924
01:27:11,780 --> 01:27:14,420
These questions do not harm the value of x.
1925
01:27:14,420 --> 01:27:16,700
If it is, then we run this code.
1926
01:27:16,700 --> 01:27:18,140
And then we sort of rejoin here.
1927
01:27:18,140 --> 01:27:20,420
And then we test this next if.
1928
01:27:20,420 --> 01:27:22,020
And if that's true, we do this code.
1929
01:27:22,020 --> 01:27:23,100
And then we do there.
1930
01:27:23,100 --> 01:27:24,580
But in this case, it's going to be false
1931
01:27:24,580 --> 01:27:26,060
because x is not less than 20.
1932
01:27:26,060 --> 01:27:28,100
And so it just continues down here.
1933
01:27:28,100 --> 01:27:32,860
So if we look at how this works, it runs.
1934
01:27:32,860 --> 01:27:34,300
It runs this line.
1935
01:27:34,300 --> 01:27:35,740
Then it sees this question.
1936
01:27:35,740 --> 01:27:36,580
It skips that line.
1937
01:27:36,580 --> 01:27:38,220
So this line does not run.
1938
01:27:38,220 --> 01:27:39,940
And so smaller prints out.
1939
01:27:39,940 --> 01:27:42,020
And funny prints out.
1940
01:27:42,020 --> 01:27:42,700
OK?
1941
01:27:42,700 --> 01:27:45,460
And so that's the basic idea of an if statement.
1942
01:27:45,460 --> 01:27:49,140
And the indentation, when we are done with an if statement,
1943
01:27:49,140 --> 01:27:50,220
we deindent back.
1944
01:27:50,220 --> 01:27:52,020
And there's this little block.
1945
01:27:52,020 --> 01:27:53,900
This is one sort of if statement.
1946
01:27:53,900 --> 01:27:56,480
And this is another if statement.
1947
01:27:56,480 --> 01:27:58,140
And these are the two conditional lines
1948
01:27:58,140 --> 01:27:59,840
that either run or they don't run,
1949
01:27:59,840 --> 01:28:04,180
depending on the answer to that question.
1950
01:28:04,180 --> 01:28:06,500
So we have a number of different comparison operators
1951
01:28:06,500 --> 01:28:09,980
that we can use to ask these true-false questions that
1952
01:28:09,980 --> 01:28:11,660
say, is this true?
1953
01:28:11,660 --> 01:28:14,420
So again, we're kind of limited to the keys
1954
01:28:14,420 --> 01:28:19,860
that were on computer keyboards in the 1940s and 1950s.
1955
01:28:19,860 --> 01:28:21,780
Less than, less than or equal to.
1956
01:28:21,780 --> 01:28:24,180
So we didn't have fancy math characters.
1957
01:28:24,180 --> 01:28:26,700
So we just concatenated less than and equal
1958
01:28:26,700 --> 01:28:28,380
to be less than or equal to.
1959
01:28:28,380 --> 01:28:32,460
This double equals is the asking, is this equal to?
1960
01:28:32,460 --> 01:28:34,380
And so that's a little tricky.
1961
01:28:34,380 --> 01:28:37,260
The equals sign is that assignment operator.
1962
01:28:37,260 --> 01:28:39,340
If I was building a language today from scratch,
1963
01:28:39,340 --> 01:28:41,340
I would probably make assignment be arrow.
1964
01:28:41,340 --> 01:28:44,660
And the equals question to have an equals.
1965
01:28:44,660 --> 01:28:49,060
Or I might say somewhere I would say question equals.
1966
01:28:49,060 --> 01:28:51,860
But I'm not building this language.
1967
01:28:51,860 --> 01:28:54,020
So that's not up to me.
1968
01:28:54,020 --> 01:28:55,420
So this is the question.
1969
01:28:55,420 --> 01:28:59,660
Double equals is asking the question is equal to.
1970
01:28:59,660 --> 01:29:02,940
Greater than or equal, greater than, and not equal.
1971
01:29:02,940 --> 01:29:06,220
So this is the exclamation point is sort of like not equal.
1972
01:29:06,220 --> 01:29:07,940
So that's sort of not equal.
1973
01:29:07,940 --> 01:29:09,100
So that's how we do not equal.
1974
01:29:09,100 --> 01:29:12,580
So if we take a look at some of these in some examples,
1975
01:29:12,580 --> 01:29:17,500
all of these are going to be true because of the way x is set.
1976
01:29:17,500 --> 01:29:20,900
If x is equal to 5, that's the question version.
1977
01:29:20,900 --> 01:29:22,420
That's true or false.
1978
01:29:22,420 --> 01:29:23,660
It'll execute that.
1979
01:29:23,660 --> 01:29:26,580
If x is greater than 4, it's going to execute that.
1980
01:29:26,580 --> 01:29:29,180
If x is greater than or equal to 5, it's going to execute that.
1981
01:29:29,180 --> 01:29:31,300
Here's kind of a shorthand where if there's only
1982
01:29:31,300 --> 01:29:33,860
one line in this block, you can kind of pull it up right
1983
01:29:33,860 --> 01:29:35,780
on the same line after the equals.
1984
01:29:35,780 --> 01:29:38,540
If x is less than 6, which it is, true.
1985
01:29:38,540 --> 01:29:40,180
Execute that.
1986
01:29:40,180 --> 01:29:42,460
Then if x is less than or equal to 5, do that.
1987
01:29:42,460 --> 01:29:44,740
And if x is not equal to 6, do that.
1988
01:29:44,740 --> 01:29:47,020
Now like I said, all these questions
1989
01:29:47,020 --> 01:29:49,700
have been carefully constructed so that they're true.
1990
01:29:49,700 --> 01:29:52,900
Just to kind of show you the syntax of those comparison
1991
01:29:52,900 --> 01:29:53,980
operators.
1992
01:29:53,980 --> 01:29:56,340
Now you don't just have to have a single line of text
1993
01:29:56,340 --> 01:29:57,940
in the indented block.
1994
01:29:57,940 --> 01:30:00,380
And this will be something you're going to get used to.
1995
01:30:00,380 --> 01:30:03,140
So if we indent more than one line,
1996
01:30:03,140 --> 01:30:08,500
then the conditional code is actually these three lines.
1997
01:30:08,500 --> 01:30:10,300
So the idea is you have an if statement.
1998
01:30:10,300 --> 01:30:12,060
You come in, you do an indent.
1999
01:30:12,060 --> 01:30:13,420
And as long as you stay indented,
2000
01:30:13,420 --> 01:30:15,100
you stay in that if block.
2001
01:30:15,100 --> 01:30:19,420
If it's false, it just skips all of those.
2002
01:30:19,420 --> 01:30:24,660
So the way this is going to execute, x is 5.
2003
01:30:24,660 --> 01:30:25,860
You could print before 5.
2004
01:30:25,860 --> 01:30:26,940
Is x equal 5?
2005
01:30:26,940 --> 01:30:29,020
That's the question mark, and that's true.
2006
01:30:29,020 --> 01:30:31,060
So it's going to run all these.
2007
01:30:31,060 --> 01:30:33,940
And then come back, and then continue on, and then de-indent.
2008
01:30:33,940 --> 01:30:36,700
So all this stuff is running.
2009
01:30:36,700 --> 01:30:38,340
And then it says if x equals 6.
2010
01:30:38,340 --> 01:30:39,700
So that was false.
2011
01:30:39,700 --> 01:30:41,140
So that skips all of them.
2012
01:30:41,140 --> 01:30:43,700
So none of these lines of code run.
2013
01:30:43,700 --> 01:30:49,460
So these actually don't run, and it says afterwards 6.
2014
01:30:49,460 --> 01:30:50,540
So that's a mistake.
2015
01:30:50,540 --> 01:30:56,060
Those don't run right there, because x is not equal 6.
2016
01:30:56,060 --> 01:30:57,940
OK?
2017
01:30:57,940 --> 01:31:03,740
So indentation is an essential part of Python.
2018
01:31:03,740 --> 01:31:06,180
We use indentation in lots of programming languages,
2019
01:31:06,180 --> 01:31:10,060
often to demarcate blocks to show
2020
01:31:10,060 --> 01:31:11,820
where blocks start and stop.
2021
01:31:11,820 --> 01:31:15,220
But in Python, it's syntactically correct.
2022
01:31:15,220 --> 01:31:17,780
You can make an error if your indentation is wrong.
2023
01:31:17,780 --> 01:31:19,580
After an if, you must indent.
2024
01:31:19,580 --> 01:31:21,140
And you maintain the indent as long
2025
01:31:21,140 --> 01:31:23,740
as you want to be in that same if block.
2026
01:31:23,740 --> 01:31:25,540
And then when you're done with the if block,
2027
01:31:25,540 --> 01:31:27,060
you reduce the indent.
2028
01:31:27,060 --> 01:31:31,260
In this rule of indenting, comment lines and blank lines
2029
01:31:31,260 --> 01:31:34,380
are completely ignored.
2030
01:31:34,380 --> 01:31:36,780
So we're going to tend to put four spaces.
2031
01:31:36,780 --> 01:31:44,220
Four spaces ends up being the normal thing that we do.
2032
01:31:44,220 --> 01:31:46,340
And you'll see all the code that I write
2033
01:31:46,340 --> 01:31:48,180
has four spaces for each indent.
2034
01:31:48,180 --> 01:31:51,020
If I go in twice, I use eight spaces.
2035
01:31:51,020 --> 01:31:52,540
And we have this instinct of wanting
2036
01:31:52,540 --> 01:31:55,300
to hit the Tab key to move in four spaces.
2037
01:31:55,300 --> 01:31:57,660
Now, the problem is that it might
2038
01:31:57,660 --> 01:31:59,180
look the same on your screen.
2039
01:31:59,180 --> 01:32:02,580
A Tab and four spaces might line up the same place,
2040
01:32:02,580 --> 01:32:04,740
depending on how tabs are set.
2041
01:32:04,740 --> 01:32:06,460
But Python can get confused by that.
2042
01:32:06,460 --> 01:32:11,260
So we tend to avoid using actual tabs in files.
2043
01:32:11,260 --> 01:32:12,920
And so most programming text editors,
2044
01:32:12,920 --> 01:32:15,220
like if you're using Notepad or Text Wrangler,
2045
01:32:15,220 --> 01:32:17,980
there is a place to set the tabs,
2046
01:32:17,980 --> 01:32:19,740
to say don't put tabs in this document.
2047
01:32:19,740 --> 01:32:22,140
But every time you hit Tab, move over four spaces.
2048
01:32:22,140 --> 01:32:24,460
And so if you hit a Tab, but it's like space, space, space,
2049
01:32:24,460 --> 01:32:25,460
space, space.
2050
01:32:25,460 --> 01:32:28,820
Now, the nice thing about Atom, and this is the text editor
2051
01:32:28,820 --> 01:32:30,300
we tend to recommend in this class,
2052
01:32:30,300 --> 01:32:33,700
A, because it works on Windows, Linux, and Mac,
2053
01:32:33,700 --> 01:32:36,060
but also because it automatically sets this up.
2054
01:32:36,060 --> 01:32:38,900
As soon as you save your file with a.py extension,
2055
01:32:38,900 --> 01:32:41,260
you can sort of hit the Tab key with impunity.
2056
01:32:41,260 --> 01:32:44,060
And everything works perfectly.
2057
01:32:44,060 --> 01:32:47,140
But the key thing here is that Python insists
2058
01:32:47,140 --> 01:32:48,380
that you get this right.
2059
01:32:48,380 --> 01:32:50,420
And if you don't get this right, you're
2060
01:32:50,420 --> 01:32:51,820
going to get indentation errors.
2061
01:32:51,820 --> 01:32:56,220
And they're just another syntax error.
2062
01:32:56,220 --> 01:33:01,540
So if you're using something like Text Wrangler or Notepad,
2063
01:33:01,540 --> 01:33:03,040
run around in the Preferences, and you'll
2064
01:33:03,040 --> 01:33:05,220
find something about expanding tabs,
2065
01:33:05,220 --> 01:33:09,260
or maybe how many spaces each tab stop is supposed to be.
2066
01:33:09,260 --> 01:33:10,380
And so you check these.
2067
01:33:10,380 --> 01:33:12,700
And what this really is doing is telling your text editor,
2068
01:33:12,700 --> 01:33:15,100
never put an actual tab in the document,
2069
01:33:15,100 --> 01:33:19,140
but somehow simulate tab stops using spaces.
2070
01:33:19,140 --> 01:33:21,780
And so here is a bit of code.
2071
01:33:21,780 --> 01:33:23,740
It's got some nested block.
2072
01:33:23,740 --> 01:33:27,060
But it gives you the sense that you have to be very explicit
2073
01:33:27,060 --> 01:33:30,580
when you're reading Python code of whether the indent is
2074
01:33:30,580 --> 01:33:36,300
the same between two lines, the same, increased or decreased.
2075
01:33:36,300 --> 01:33:38,700
And every time you increase it, you mean something.
2076
01:33:38,700 --> 01:33:40,260
And every time you decrease it, you mean something.
2077
01:33:40,260 --> 01:33:42,020
And literally, if it stays the same,
2078
01:33:42,020 --> 01:33:43,740
you mean something as well.
2079
01:33:43,740 --> 01:33:46,460
And so if we take a look at this, here we have a line.
2080
01:33:46,460 --> 01:33:48,460
And the next line has the same indent.
2081
01:33:48,460 --> 01:33:50,380
This is an if with a colon at the end.
2082
01:33:50,380 --> 01:33:52,460
So we have to increase the indent.
2083
01:33:52,460 --> 01:33:54,900
And now we're maintaining it.
2084
01:33:54,900 --> 01:33:56,980
So these two lines are part of that if.
2085
01:33:56,980 --> 01:33:58,540
But now we have deindent it.
2086
01:33:58,540 --> 01:34:02,260
So whether you choose to deindent this word, or this word,
2087
01:34:02,260 --> 01:34:05,100
or whatever, the where you do this deindent
2088
01:34:05,100 --> 01:34:09,380
affects the scope of how far this if statement lasts.
2089
01:34:09,380 --> 01:34:13,460
It lasts up to, but not including, the line that's
2090
01:34:13,460 --> 01:34:16,140
deindented to the same level as the if.
2091
01:34:16,140 --> 01:34:17,700
So this is a deindent.
2092
01:34:17,700 --> 01:34:19,840
Now we have a blank line, which doesn't matter.
2093
01:34:19,840 --> 01:34:20,980
And we maintain it.
2094
01:34:20,980 --> 01:34:23,340
And we have a for, which we'll learn about in the next chapter,
2095
01:34:23,340 --> 01:34:24,660
which is a looping structure.
2096
01:34:24,660 --> 01:34:25,460
Let's do a for.
2097
01:34:25,460 --> 01:34:27,380
For runs this five times.
2098
01:34:27,380 --> 01:34:28,180
It has a colon.
2099
01:34:28,180 --> 01:34:30,860
And it also expects an indented block.
2100
01:34:30,860 --> 01:34:32,740
Now we have what's called a nested block,
2101
01:34:32,740 --> 01:34:34,500
where we have an if and a colon.
2102
01:34:34,500 --> 01:34:35,820
We go into some more.
2103
01:34:35,820 --> 01:34:37,660
So this is like two indents.
2104
01:34:37,660 --> 01:34:39,260
So these are one indent.
2105
01:34:39,260 --> 01:34:40,300
And these are two indents.
2106
01:34:40,300 --> 01:34:43,380
And so this is a block within a block.
2107
01:34:43,380 --> 01:34:45,020
And then we deindent.
2108
01:34:45,020 --> 01:34:47,300
So that means this print is not part of the if statement,
2109
01:34:47,300 --> 01:34:49,700
but it's still part of the for statement.
2110
01:34:49,700 --> 01:34:51,580
And then we deindent again.
2111
01:34:51,580 --> 01:34:55,200
And then that means this print is on the same level
2112
01:34:55,200 --> 01:34:56,620
as that for statement.
2113
01:34:56,620 --> 01:34:59,460
So if you start thinking about this,
2114
01:34:59,460 --> 01:35:01,020
you want to be able to start thinking
2115
01:35:01,020 --> 01:35:04,340
that these blocks are the start of the block with the colon
2116
01:35:04,340 --> 01:35:08,220
line up to, but not including this line that's
2117
01:35:08,220 --> 01:35:09,380
been deindented.
2118
01:35:09,380 --> 01:35:12,340
So the for goes this far.
2119
01:35:12,340 --> 01:35:14,180
The for goes up to, but not including
2120
01:35:14,180 --> 01:35:15,660
the line that's deindented.
2121
01:35:15,660 --> 01:35:18,340
The if goes up to, but not including
2122
01:35:18,340 --> 01:35:20,260
the line that's deindented.
2123
01:35:20,260 --> 01:35:22,780
So as you do this, you'll sort of mentally
2124
01:35:22,780 --> 01:35:24,180
start drawing these blocks.
2125
01:35:24,180 --> 01:35:26,900
And pretty soon, you will start constructing them as blocks.
2126
01:35:26,900 --> 01:35:30,340
And it takes a while, but doesn't take forever.
2127
01:35:30,340 --> 01:35:43,460
But in Python, unlike other languages,
2128
01:35:43,460 --> 01:35:47,700
you have this is very important, and it matters.
2129
01:35:47,700 --> 01:35:49,780
And you can have syntax errors if you get it wrong.
2130
01:35:49,780 --> 01:35:51,580
Because you're really communicating
2131
01:35:51,580 --> 01:35:53,580
the shape and structure of your code
2132
01:35:53,580 --> 01:35:56,840
using these indents and deindents.
2133
01:35:56,840 --> 01:35:58,380
We already saw a nested indent.
2134
01:35:58,380 --> 01:36:00,020
This is a nested if.
2135
01:36:00,020 --> 01:36:01,900
So you can put an if within an if.
2136
01:36:01,900 --> 01:36:03,820
And you can go as far deep as you want to go,
2137
01:36:03,820 --> 01:36:05,500
like Russian dolls.
2138
01:36:05,500 --> 01:36:08,140
And so here we have x equals 42.
2139
01:36:08,140 --> 01:36:10,060
If it's one, we indent one.
2140
01:36:10,060 --> 01:36:11,620
And then with this next thing we do,
2141
01:36:11,620 --> 01:36:13,460
these are not the same level of indent.
2142
01:36:13,460 --> 01:36:16,060
But now we see an if, and it has to indent further.
2143
01:36:16,060 --> 01:36:19,380
So this is like two in, eight spaces.
2144
01:36:19,380 --> 01:36:21,340
And then we deindent back.
2145
01:36:21,340 --> 01:36:22,820
Actually, we deindent back too.
2146
01:36:22,820 --> 01:36:24,500
And so if you'll watch this, and you
2147
01:36:24,500 --> 01:36:27,700
take a look at how this works, it runs to here.
2148
01:36:27,700 --> 01:36:30,140
Oops, back up.
2149
01:36:30,140 --> 01:36:30,820
Comes in here.
2150
01:36:30,820 --> 01:36:33,180
The answer is yes, x is greater than one.
2151
01:36:33,180 --> 01:36:33,740
Prints this.
2152
01:36:33,740 --> 01:36:34,860
Is x less than 100?
2153
01:36:34,860 --> 01:36:36,820
Well, it's 42, so the answer is yes.
2154
01:36:36,820 --> 01:36:39,900
So it runs this, and then it kind of continues back to there.
2155
01:36:39,900 --> 01:36:42,140
And you can also think of drawing boxes around this.
2156
01:36:42,140 --> 01:36:44,340
This is one if box.
2157
01:36:44,340 --> 01:36:47,860
And then within that if box, there is another if box.
2158
01:36:47,860 --> 01:36:51,260
And again, it's the indent block up to,
2159
01:36:51,260 --> 01:36:53,540
but not including where the deindent happens.
2160
01:36:53,540 --> 01:36:57,860
And this here is like two backwards deindents.
2161
01:36:57,860 --> 01:36:59,020
So it ends two blocks.
2162
01:36:59,020 --> 01:37:01,740
So two blocks are ended by where we place this.
2163
01:37:01,740 --> 01:37:03,860
We could move this in, or we could move this out.
2164
01:37:03,860 --> 01:37:05,460
We could have it all the way into here.
2165
01:37:05,460 --> 01:37:07,300
We could have it to here or here.
2166
01:37:07,300 --> 01:37:09,300
And where we put that line depends
2167
01:37:09,300 --> 01:37:13,500
on how the ends of these blocks are going to work out.
2168
01:37:13,500 --> 01:37:18,660
So one form that's a one branch if that we just saw,
2169
01:37:18,660 --> 01:37:21,140
but then you can also have what's called a two branch if.
2170
01:37:21,140 --> 01:37:23,140
And the basic idea of a two branch if
2171
01:37:23,140 --> 01:37:25,660
is that you're going to come in, you're going to ask a question,
2172
01:37:25,660 --> 01:37:27,900
and you're going to go one direction if it's yes,
2173
01:37:27,900 --> 01:37:29,460
and another direction if it's no.
2174
01:37:29,460 --> 01:37:30,820
We call this an if then else.
2175
01:37:30,820 --> 01:37:32,660
It's kind of like a fork in the road.
2176
01:37:32,660 --> 01:37:34,700
And the way to think about it is depending
2177
01:37:34,700 --> 01:37:35,980
on the output of this question, we're
2178
01:37:35,980 --> 01:37:37,540
going to pick one or two of these.
2179
01:37:37,540 --> 01:37:40,020
But if we pick one, the other one's never going to happen.
2180
01:37:40,020 --> 01:37:41,420
So it's like an either or.
2181
01:37:41,420 --> 01:37:42,980
We're either going to go one way,
2182
01:37:42,980 --> 01:37:44,340
or we're going to go the other way.
2183
01:37:44,340 --> 01:37:45,940
But there is no path where we somehow
2184
01:37:45,940 --> 01:37:47,620
go boot through both on that.
2185
01:37:47,620 --> 01:37:49,820
That doesn't happen.
2186
01:37:49,820 --> 01:37:52,380
And the syntax that we use for this
2187
01:37:52,380 --> 01:37:55,060
is what we call the if then else.
2188
01:37:55,060 --> 01:37:59,580
And so the first part is normal if with an indent.
2189
01:37:59,580 --> 01:38:00,420
And then we deindent.
2190
01:38:00,420 --> 01:38:03,340
And then this is another reserved word else with a colon.
2191
01:38:03,340 --> 01:38:04,740
And then we reindent.
2192
01:38:04,740 --> 01:38:07,540
And so this is really end up being part of a whole block
2193
01:38:07,540 --> 01:38:08,220
here.
2194
01:38:08,220 --> 01:38:10,540
And the else is the part.
2195
01:38:10,540 --> 01:38:12,780
This is the part that runs if it's false.
2196
01:38:12,780 --> 01:38:14,580
And this is the part that runs if it's true.
2197
01:38:14,580 --> 01:38:17,900
The first branch of the if, the first indented block
2198
01:38:17,900 --> 01:38:19,380
is what runs if it's true.
2199
01:38:19,380 --> 01:38:23,660
And the second indented block is the one that runs if it's false.
2200
01:38:23,660 --> 01:38:24,500
And so here we go.
2201
01:38:24,500 --> 01:38:27,180
It's just if x is greater than 2, in this case it's yes.
2202
01:38:27,180 --> 01:38:28,380
We're going to print bigger.
2203
01:38:28,380 --> 01:38:29,820
And then we're going to be all done.
2204
01:38:29,820 --> 01:38:31,060
And so we do one.
2205
01:38:31,060 --> 01:38:32,220
And so this one did run.
2206
01:38:32,220 --> 01:38:33,740
And this one did not run.
2207
01:38:33,740 --> 01:38:35,460
So basically with an if then else,
2208
01:38:35,460 --> 01:38:37,660
one of the two branches is going to run.
2209
01:38:37,660 --> 01:38:41,180
But there's no case in which both branches run.
2210
01:38:41,180 --> 01:38:43,700
And again, you sort of draw these blocks
2211
01:38:43,700 --> 01:38:45,660
around these things mentally.
2212
01:38:45,660 --> 01:38:47,980
And in this one, you sort of take from the if,
2213
01:38:47,980 --> 01:38:50,580
not the else is really part of the block up to,
2214
01:38:50,580 --> 01:38:52,060
but not including that print, which
2215
01:38:52,060 --> 01:38:57,900
is deindented back to the same level as the if statement.
2216
01:38:57,900 --> 01:39:00,340
OK?
2217
01:39:00,340 --> 01:39:03,260
Python is actually one of the more elegant languages,
2218
01:39:03,260 --> 01:39:05,380
even though after a while this indenting,
2219
01:39:05,380 --> 01:39:09,660
and when you get too far in, it gets a little bit complex.
2220
01:39:09,660 --> 01:39:11,780
But this is a good way to visualize this with these
2221
01:39:11,780 --> 01:39:13,020
indents.
2222
01:39:13,020 --> 01:39:16,820
Coming up next, we're going to talk about some more complex
2223
01:39:16,820 --> 01:39:17,940
conditional structures.
2224
01:39:22,180 --> 01:39:23,140
So welcome back.
2225
01:39:23,140 --> 01:39:25,460
Let's talk a little bit more about some more complex
2226
01:39:25,460 --> 01:39:27,700
conditional statements that sort of build
2227
01:39:27,700 --> 01:39:30,140
on this concept of if and if then else.
2228
01:39:30,140 --> 01:39:31,660
The first thing we're going to look at
2229
01:39:31,660 --> 01:39:35,220
is the multi-way branch.
2230
01:39:35,220 --> 01:39:37,740
And so the idea is it's kind of like the if then else
2231
01:39:37,740 --> 01:39:39,380
where you're going to pick one of two,
2232
01:39:39,380 --> 01:39:41,500
but now we can pick one of three, or one of four,
2233
01:39:41,500 --> 01:39:43,380
or one of five.
2234
01:39:43,380 --> 01:39:46,060
And it introduces a new concept called the LF.
2235
01:39:46,060 --> 01:39:49,980
The LF is another reserved word inside Python.
2236
01:39:49,980 --> 01:39:51,900
And the way it works is it's probably
2237
01:39:51,900 --> 01:39:54,820
best to look at this here, where it checks the first one,
2238
01:39:54,820 --> 01:39:58,100
and if it's a true, then it runs that, and then it's done.
2239
01:39:58,100 --> 01:39:59,260
It doesn't check them all.
2240
01:39:59,260 --> 01:40:02,900
It's not like it sees that there are two logical conditions.
2241
01:40:02,900 --> 01:40:04,620
It actually checks them, the first one,
2242
01:40:04,620 --> 01:40:08,940
and how you order these matters, as we'll see in a bit.
2243
01:40:08,940 --> 01:40:10,540
And so if the first one is true, it
2244
01:40:10,540 --> 01:40:15,900
runs if the first one is false, and the second one is true,
2245
01:40:15,900 --> 01:40:17,740
it runs this one, and it's done.
2246
01:40:17,740 --> 01:40:21,700
And if neither of them are true, it falls through,
2247
01:40:21,700 --> 01:40:24,260
and there's an else clause that is otherwise,
2248
01:40:24,260 --> 01:40:25,420
and it runs that.
2249
01:40:25,420 --> 01:40:29,820
So basically, it's going to run one and then skip the other two,
2250
01:40:29,820 --> 01:40:35,320
or it is going to skip one, skip two, and then run this one.
2251
01:40:35,320 --> 01:40:37,940
But it only runs, in this case, one of them.
2252
01:40:37,940 --> 01:40:41,340
But the important thing is it checks these questions in order.
2253
01:40:41,340 --> 01:40:43,260
And it doesn't check the second question
2254
01:40:43,260 --> 01:40:45,100
until it finds that the first.
2255
01:40:45,100 --> 01:40:47,740
It doesn't check the second question
2256
01:40:47,740 --> 01:40:50,500
until it knows the first question is false.
2257
01:40:50,500 --> 01:40:52,540
So if the first question is true, you're done.
2258
01:40:52,540 --> 01:40:54,140
You're done, and you're done with this.
2259
01:40:54,140 --> 01:40:56,620
You're done with the whole block at that point.
2260
01:40:56,620 --> 01:41:00,340
So only one of these three is going to execute in that block.
2261
01:41:04,700 --> 01:41:07,380
So here's sort of some examples of this.
2262
01:41:07,380 --> 01:41:10,460
If we, for example, have x equals 0,
2263
01:41:10,460 --> 01:41:11,620
it's going to come down here.
2264
01:41:11,620 --> 01:41:12,580
x is less than true.
2265
01:41:12,580 --> 01:41:13,580
That's true.
2266
01:41:13,580 --> 01:41:16,020
So it runs this code, and then it skips, skips, skips,
2267
01:41:16,020 --> 01:41:17,020
down to that.
2268
01:41:17,020 --> 01:41:19,180
And so it's like this, runs that code,
2269
01:41:19,180 --> 01:41:21,980
and then skips to the end.
2270
01:41:21,980 --> 01:41:27,460
On the other hand, if it's 5, then this is false,
2271
01:41:27,460 --> 01:41:29,380
and it skips that, and it checks this.
2272
01:41:29,380 --> 01:41:30,660
This is true.
2273
01:41:30,660 --> 01:41:33,580
It runs this code, and then it's done, skips to the end.
2274
01:41:33,580 --> 01:41:39,100
It was like false, true, run, end.
2275
01:41:39,100 --> 01:41:44,100
And then if x is like 20, for example, it runs, it runs,
2276
01:41:44,100 --> 01:41:48,180
false, false, run the else clause, and you're done.
2277
01:41:48,180 --> 01:41:52,620
So skip, skip, else, run that code, and you're done.
2278
01:41:52,620 --> 01:41:54,940
So in this case, we ran that, and we didn't run that,
2279
01:41:54,940 --> 01:41:56,300
and we didn't run that.
2280
01:41:56,300 --> 01:41:58,700
Again, one of them is going to run.
2281
01:41:58,700 --> 01:41:59,940
They're checked in order.
2282
01:41:59,940 --> 01:42:03,420
These questions are checked in order, not out of order.
2283
01:42:03,420 --> 01:42:04,500
It doesn't look ahead.
2284
01:42:04,500 --> 01:42:07,220
It just checks in the order that you wrote it.
2285
01:42:07,220 --> 01:42:09,420
You're the one that wrote that order.
2286
01:42:09,420 --> 01:42:11,980
And so there's a couple of variations on this multi-way.
2287
01:42:14,700 --> 01:42:18,060
You can have no else.
2288
01:42:18,060 --> 01:42:20,260
You can have no else, as in this case.
2289
01:42:20,260 --> 01:42:23,860
And this just means that it might not run any of them.
2290
01:42:23,860 --> 01:42:27,260
In this case, x is 5, so it's not less than 2,
2291
01:42:27,260 --> 01:42:28,820
but then it runs this one.
2292
01:42:28,820 --> 01:42:34,340
But if x was like 50, for example, if x was 50,
2293
01:42:34,340 --> 01:42:36,540
then this would be false, then it would skip,
2294
01:42:36,540 --> 01:42:38,540
and this would still be false, and it would skip,
2295
01:42:38,540 --> 01:42:40,020
and neither of these two would run.
2296
01:42:40,020 --> 01:42:41,020
So if you don't have an else, you're
2297
01:42:41,020 --> 01:42:43,020
not guaranteed that one of them is going to run,
2298
01:42:43,020 --> 01:42:44,700
because else is like the catch-all.
2299
01:42:44,700 --> 01:42:46,980
If the other ones are all false, then the else
2300
01:42:46,980 --> 01:42:48,980
is the one that runs.
2301
01:42:48,980 --> 01:42:52,820
Similarly, you can have many elifs,
2302
01:42:52,820 --> 01:42:55,260
but this is where it's really important for you
2303
01:42:55,260 --> 01:42:57,740
to make sure you know what order they're being taken in.
2304
01:42:57,740 --> 01:43:02,340
So if this is true, it runs.
2305
01:43:02,340 --> 01:43:04,540
It goes all the way to the bottom.
2306
01:43:04,540 --> 01:43:11,260
If it's false, false, false, true, it runs this one,
2307
01:43:11,260 --> 01:43:12,660
and it's done.
2308
01:43:12,660 --> 01:43:17,180
If, on the other hand, it looks at it as false,
2309
01:43:17,180 --> 01:43:19,140
go back, go back.
2310
01:43:19,140 --> 01:43:23,020
If it runs false, false, false, false, they're all false,
2311
01:43:23,020 --> 01:43:24,260
then it runs the else.
2312
01:43:24,260 --> 01:43:25,380
This one has an else.
2313
01:43:25,380 --> 01:43:26,580
This one didn't have an else.
2314
01:43:26,580 --> 01:43:27,660
They don't have to have them.
2315
01:43:27,660 --> 01:43:30,940
The key is you can have more than one of these elifs.
2316
01:43:30,940 --> 01:43:32,620
So I got a couple little things.
2317
01:43:32,620 --> 01:43:37,500
I'll let you pause right now and look at the question is,
2318
01:43:37,500 --> 01:43:43,660
are there looking at the three lines or four lines of code,
2319
01:43:43,660 --> 01:43:45,660
x equals something.
2320
01:43:45,660 --> 01:43:48,460
Are there lines of code that will never execute,
2321
01:43:48,460 --> 01:43:50,340
regardless of the value for x?
2322
01:43:50,340 --> 01:43:52,460
And I'll let you pause and think about it,
2323
01:43:52,460 --> 01:43:54,020
and then I'll explain it to you.
2324
01:43:54,020 --> 01:43:56,340
OK, hopefully you paused and thought about it
2325
01:43:56,340 --> 01:44:00,140
as long as you liked, but so let me now explain it to you.
2326
01:44:00,140 --> 01:44:03,780
So we come in here, and if x is less than or equal to 2,
2327
01:44:03,780 --> 01:44:05,020
it's going to run this first thing.
2328
01:44:05,020 --> 01:44:07,900
And if x is greater than or equal to 2, it's going to run this.
2329
01:44:07,900 --> 01:44:10,300
And if neither of those are true, then it's going to run this.
2330
01:44:10,300 --> 01:44:13,260
Well, the weird thing is, all numbers
2331
01:44:13,260 --> 01:44:15,900
are either less than 2 or greater than or equal to 2.
2332
01:44:15,900 --> 01:44:17,940
I carefully constructed this to the point
2333
01:44:17,940 --> 01:44:20,700
where it would never run this line of code.
2334
01:44:20,700 --> 01:44:24,140
It is either going to run this one or run that one,
2335
01:44:24,140 --> 01:44:25,700
but it's not going to ever run this one.
2336
01:44:25,700 --> 01:44:27,900
So that was kind of like a weird dysfunctional one
2337
01:44:27,900 --> 01:44:29,140
that I constructed.
2338
01:44:29,140 --> 01:44:31,700
This other one is a little different.
2339
01:44:31,700 --> 01:44:33,700
If x is less than 2, we do this.
2340
01:44:33,700 --> 01:44:35,660
If x is less than 20, we do that.
2341
01:44:35,660 --> 01:44:37,340
If x is less than 10, we do that.
2342
01:44:37,340 --> 01:44:39,060
And if none of those are true, we do that.
2343
01:44:39,060 --> 01:44:41,700
Well, the problem here is between these two lines.
2344
01:44:41,700 --> 01:44:44,180
The problem is, if something's less than 10, like 6,
2345
01:44:44,180 --> 01:44:47,380
for example, it's also less than 20.
2346
01:44:47,380 --> 01:44:49,380
So even though x is less than 20,
2347
01:44:49,380 --> 01:44:52,020
so even though there might be values
2348
01:44:52,020 --> 01:44:54,300
for which this is true, those also
2349
01:44:54,300 --> 01:44:55,260
are going to have this true.
2350
01:44:55,260 --> 01:44:58,380
So for something like 6, it's going to run here.
2351
01:44:58,380 --> 01:45:00,420
And it's not even going to look at this.
2352
01:45:00,420 --> 01:45:01,140
That's the point.
2353
01:45:01,140 --> 01:45:02,620
It doesn't even look at this.
2354
01:45:02,620 --> 01:45:05,540
And so that's, I mean, I could have made this more sensible
2355
01:45:05,540 --> 01:45:08,060
if I'd have moved this little block of code up to there.
2356
01:45:08,060 --> 01:45:12,620
So this is where the order in which you choose your questions,
2357
01:45:12,620 --> 01:45:14,780
the way you put these LFs together,
2358
01:45:14,780 --> 01:45:17,180
matters because it doesn't look at all of them.
2359
01:45:17,180 --> 01:45:19,100
It only looks as long as it can.
2360
01:45:19,100 --> 01:45:21,260
As long as it sees falses, then it
2361
01:45:21,260 --> 01:45:22,500
keeps on going to the next one.
2362
01:45:22,500 --> 01:45:26,580
But as soon as it doesn't see a false, it doesn't continue.
2363
01:45:26,580 --> 01:45:28,900
So the last conditional structure
2364
01:45:28,900 --> 01:45:31,140
we'll talk about is the try and accept structure.
2365
01:45:31,140 --> 01:45:35,700
If you know any other languages like C++ or Java or JavaScript,
2366
01:45:35,700 --> 01:45:38,620
you're like, whoa, that's kind of an advanced concept.
2367
01:45:38,620 --> 01:45:42,500
But it turns out in Python, because of Python's propensity
2368
01:45:42,500 --> 01:45:47,380
to throw trace backs in situations
2369
01:45:47,380 --> 01:45:49,300
where you kind of would like to recover,
2370
01:45:49,300 --> 01:45:51,860
it turns out you kind of have to use it a little more
2371
01:45:51,860 --> 01:45:55,140
and a little earlier in your programming skill.
2372
01:45:55,140 --> 01:45:58,140
So the problem is, what if there is a line of code
2373
01:45:58,140 --> 01:46:00,540
and you absolutely know it's going to make a trace back.
2374
01:46:00,540 --> 01:46:02,700
It's going to blow up.
2375
01:46:02,700 --> 01:46:04,660
But you don't want to blow up.
2376
01:46:04,660 --> 01:46:06,660
I mean, I don't want to have code blow up.
2377
01:46:06,660 --> 01:46:08,120
If you're using my autograder and you
2378
01:46:08,120 --> 01:46:09,620
see a trace back in my autograder,
2379
01:46:09,620 --> 01:46:11,580
that's kind of like I consider that a failure.
2380
01:46:11,580 --> 01:46:14,500
I could put an error like, hey, you entered blank data
2381
01:46:14,500 --> 01:46:16,220
or you didn't enter a number.
2382
01:46:16,220 --> 01:46:18,140
But a trace back, that just seems
2383
01:46:18,140 --> 01:46:19,580
like I'm too lazy as a programmer.
2384
01:46:19,580 --> 01:46:21,380
So we as programmers are supposed
2385
01:46:21,380 --> 01:46:24,080
to anticipate parts of our code that
2386
01:46:24,080 --> 01:46:26,580
are going to blow up potentially based on perhaps the user's
2387
01:46:26,580 --> 01:46:28,780
input and then do something about it.
2388
01:46:28,780 --> 01:46:31,740
And that's what the try and accept are for.
2389
01:46:31,740 --> 01:46:33,580
You take this little dangerous piece of code
2390
01:46:33,580 --> 01:46:35,780
that might break and might blow up,
2391
01:46:35,780 --> 01:46:39,560
and you surround it with a try and says, this might blow up.
2392
01:46:39,560 --> 01:46:43,100
And if it fails, run this code down here.
2393
01:46:43,100 --> 01:46:44,300
So that's the try.
2394
01:46:44,300 --> 01:46:46,140
And if you get an exception, the accept
2395
01:46:46,140 --> 01:46:48,380
is kind of like if you get an exception.
2396
01:46:48,380 --> 01:46:52,100
And the problem is, is if you are running code,
2397
01:46:52,100 --> 01:46:54,580
here's a little bit of code, we put hello bob in
2398
01:46:54,580 --> 01:46:56,260
and we convert it to an integer, and we
2399
01:46:56,260 --> 01:47:01,100
know from past experience that this blows up.
2400
01:47:01,100 --> 01:47:03,220
You can't take hello bob and convert it to an integer.
2401
01:47:03,220 --> 01:47:04,500
It's just going to blow up.
2402
01:47:04,500 --> 01:47:06,860
The problem is, and here we are.
2403
01:47:06,860 --> 01:47:08,340
It says, oh, you blew up on line two.
2404
01:47:08,340 --> 01:47:09,260
That's great.
2405
01:47:09,260 --> 01:47:12,380
And I'm not very happy with hello bob and whatever.
2406
01:47:12,380 --> 01:47:17,660
But the important thing is your program stops.
2407
01:47:17,660 --> 01:47:22,540
These other lines, they don't exist.
2408
01:47:22,540 --> 01:47:24,780
It doesn't go any further.
2409
01:47:24,780 --> 01:47:27,740
Remember, the trace back is Python is really confused,
2410
01:47:27,740 --> 01:47:29,620
and I don't know what to do next.
2411
01:47:29,620 --> 01:47:32,460
So Python is just going to be conservative and stop.
2412
01:47:32,460 --> 01:47:34,900
So Python stops, and your program stops.
2413
01:47:34,900 --> 01:47:37,180
No matter how much error checking you put down here,
2414
01:47:37,180 --> 01:47:38,860
it doesn't matter because it's gone.
2415
01:47:38,860 --> 01:47:40,060
It's all gone.
2416
01:47:40,060 --> 01:47:42,380
And like I said, we take this kind of personally
2417
01:47:42,380 --> 01:47:44,580
because the code that you write is
2418
01:47:44,580 --> 01:47:49,180
like you being put into the computer giving it instructions.
2419
01:47:49,180 --> 01:47:52,620
And if the code blows up, well, that sort of wipes you out.
2420
01:47:52,620 --> 01:47:54,100
You're not in the game anymore.
2421
01:47:54,100 --> 01:47:56,140
You're not able to do anything.
2422
01:47:56,140 --> 01:47:58,460
So we want to be able to, especially
2423
01:47:58,460 --> 01:48:00,500
in these situations where we can anticipate
2424
01:48:00,500 --> 01:48:03,100
that an error that might happen in the normal course
2425
01:48:03,100 --> 01:48:05,980
or your program's execution might be something
2426
01:48:05,980 --> 01:48:07,940
that you want to compensate for.
2427
01:48:07,940 --> 01:48:09,980
And that's what the try and accept does.
2428
01:48:09,980 --> 01:48:14,140
So here's a bit of code for the try and accept.
2429
01:48:14,140 --> 01:48:16,940
And we just have two little bits of straight line code.
2430
01:48:16,940 --> 01:48:19,020
And so we put a string in here that's hello bob,
2431
01:48:19,020 --> 01:48:21,020
and then we're going to convert it to an integer.
2432
01:48:21,020 --> 01:48:22,060
This is the dangerous code.
2433
01:48:22,060 --> 01:48:23,860
This code, in this case, with hello bob,
2434
01:48:23,860 --> 01:48:25,780
is going to do a trace back.
2435
01:48:25,780 --> 01:48:28,820
And so we say try, and then we indent the dangerous code.
2436
01:48:28,820 --> 01:48:32,100
And then we add this little accept bit.
2437
01:48:32,100 --> 01:48:33,940
If it works, the accept is ignored.
2438
01:48:33,940 --> 01:48:35,900
If this blows up, it runs the accept.
2439
01:48:35,900 --> 01:48:37,480
So in this code, it's going to come in.
2440
01:48:37,480 --> 01:48:39,820
It's going to try this.
2441
01:48:39,820 --> 01:48:41,260
This is going to blow up.
2442
01:48:41,260 --> 01:48:42,620
But instead of giving a trace back,
2443
01:48:42,620 --> 01:48:44,700
it's going to say, oh, I've got an available accept.
2444
01:48:44,700 --> 01:48:46,540
I'm going to run this accept code,
2445
01:48:46,540 --> 01:48:47,980
and then I'm going to continue on.
2446
01:48:47,980 --> 01:48:49,940
And so that prints out first negative 1.
2447
01:48:49,940 --> 01:48:52,480
So because we set this variable Ister to negative 1,
2448
01:48:52,480 --> 01:48:55,860
like a little flag telling us that something went wrong.
2449
01:48:55,860 --> 01:48:57,180
And then we keep on going.
2450
01:48:57,180 --> 01:49:01,020
And now we have put in 1, 2, 3, the digits 1, 2, 3.
2451
01:49:01,020 --> 01:49:02,260
The digits 1, 2, 3.
2452
01:49:02,260 --> 01:49:03,900
And now it's going to work, but we still
2453
01:49:03,900 --> 01:49:05,020
have it in a try block.
2454
01:49:05,020 --> 01:49:06,380
And then this one works.
2455
01:49:06,380 --> 01:49:09,420
It does not blow up, and then ignores the accept block.
2456
01:49:09,420 --> 01:49:11,740
So the accept block is only triggered
2457
01:49:11,740 --> 01:49:13,820
when something goes wrong in the code.
2458
01:49:13,820 --> 01:49:16,140
It is ignored if something doesn't go wrong.
2459
01:49:16,140 --> 01:49:18,020
So it's like you bought an insurance policy
2460
01:49:18,020 --> 01:49:19,420
on this line of code.
2461
01:49:19,420 --> 01:49:22,180
And when things go wrong, your accept block
2462
01:49:22,180 --> 01:49:24,620
springs into action and does whatever
2463
01:49:24,620 --> 01:49:28,060
it is that you want it to do in the case of an error.
2464
01:49:28,060 --> 01:49:30,180
So that's a pretty useful thing.
2465
01:49:30,180 --> 01:49:31,580
You got to be a little bit careful
2466
01:49:31,580 --> 01:49:34,300
that you don't overuse it, because if you put more
2467
01:49:34,300 --> 01:49:36,820
than one line inside the try part,
2468
01:49:36,820 --> 01:49:39,340
and one of the lines blows up, it
2469
01:49:39,340 --> 01:49:41,260
doesn't come back to the try block.
2470
01:49:41,260 --> 01:49:44,860
And so in this one here, we have kind of a simple, silly one
2471
01:49:44,860 --> 01:49:47,540
where we set the string, we're worried about some stuff.
2472
01:49:47,540 --> 01:49:49,060
Well, the print statement's never going to blow up,
2473
01:49:49,060 --> 01:49:51,820
so it's a bad idea to put it in try accept anyways.
2474
01:49:51,820 --> 01:49:55,060
Then we do this conversion, and that's the dangerous part.
2475
01:49:55,060 --> 01:49:58,140
And in this one, it's going to blow up.
2476
01:49:58,140 --> 01:50:00,700
And so then it's going to go to the accept block,
2477
01:50:00,700 --> 01:50:02,980
and then run the accept block, and then continue.
2478
01:50:02,980 --> 01:50:05,380
What it does not do, what it doesn't do,
2479
01:50:05,380 --> 01:50:07,740
is somehow go back and finish this.
2480
01:50:07,740 --> 01:50:09,740
So these lines are gone.
2481
01:50:09,740 --> 01:50:13,220
So if you look at it like this, this works, the try starts.
2482
01:50:13,220 --> 01:50:16,940
Hello, this blows up, it goes to the accept,
2483
01:50:16,940 --> 01:50:19,060
it runs the accept, and it continues on.
2484
01:50:19,060 --> 01:50:21,780
Never runs that code.
2485
01:50:21,780 --> 01:50:26,140
So it's not like you took out an insurance on the whole block.
2486
01:50:26,140 --> 01:50:28,140
Any of those lines can blow up in the block,
2487
01:50:28,140 --> 01:50:29,660
but whichever line blows up, that
2488
01:50:29,660 --> 01:50:34,500
is the last line that's executing in that block.
2489
01:50:34,500 --> 01:50:37,740
So you tend to want, in this particular example,
2490
01:50:37,740 --> 01:50:40,300
you would probably, the print statement would go out there,
2491
01:50:40,300 --> 01:50:42,140
and this print statement would come down here,
2492
01:50:42,140 --> 01:50:44,540
and you would only put in your try block
2493
01:50:44,540 --> 01:50:47,060
the single line of code that you think might blow up,
2494
01:50:47,060 --> 01:50:48,780
because you kind of know print statements
2495
01:50:48,780 --> 01:50:50,180
aren't going to blow up.
2496
01:50:50,180 --> 01:50:53,660
So this is an example that's a more common real world
2497
01:50:53,660 --> 01:50:57,660
example, where the user is going to type some data,
2498
01:50:57,660 --> 01:50:59,740
and that's users that get us in trouble.
2499
01:50:59,740 --> 01:51:03,020
So our program starts by asking the user enter a number,
2500
01:51:03,020 --> 01:51:05,100
and we know that this could be dangerous.
2501
01:51:05,100 --> 01:51:09,540
So we're going to put the conversion from string
2502
01:51:09,540 --> 01:51:11,580
to integer in a try block, and we're
2503
01:51:11,580 --> 01:51:14,220
going to set negative 1 if that's a failure.
2504
01:51:14,220 --> 01:51:16,700
And then if it's greater than 0, we'll say nice work,
2505
01:51:16,700 --> 01:51:18,780
and if it's less than 0, well, not a number.
2506
01:51:18,780 --> 01:51:20,820
So first time we run this program,
2507
01:51:20,820 --> 01:51:22,860
out comes enter a number.
2508
01:51:22,860 --> 01:51:25,500
We type in 42, which is a string.
2509
01:51:25,500 --> 01:51:29,100
That 42 goes back into roster, runs in here.
2510
01:51:29,100 --> 01:51:29,940
This runs.
2511
01:51:29,940 --> 01:51:30,740
It's fine.
2512
01:51:30,740 --> 01:51:33,940
That becomes a 42 number, so we skip the accept block,
2513
01:51:33,940 --> 01:51:35,180
and iVal is greater than 0.
2514
01:51:35,180 --> 01:51:38,820
We print out nice work, and we skip the else.
2515
01:51:38,820 --> 01:51:41,420
So it says nice work.
2516
01:51:41,420 --> 01:51:46,180
On the other hand, if we run it again this time,
2517
01:51:46,180 --> 01:51:48,900
the input says enter a number, and we're silly.
2518
01:51:48,900 --> 01:51:53,620
We enter the word 42, but in words, 40, F-O-U-R-T-Y.
2519
01:51:53,620 --> 01:51:55,860
So that's a string, and that goes into roster.
2520
01:51:55,860 --> 01:51:57,220
And then the execution continues.
2521
01:51:57,220 --> 01:52:01,260
We run in here, and now this is going to blow up.
2522
01:52:01,260 --> 01:52:02,220
That's going to blow up.
2523
01:52:02,220 --> 01:52:04,980
Normally, we would see a trace back right there.
2524
01:52:04,980 --> 01:52:06,140
There would be a trace back.
2525
01:52:06,140 --> 01:52:08,420
But we're not going to, because we put this calculation
2526
01:52:08,420 --> 01:52:09,860
in a try and accept block.
2527
01:52:09,860 --> 01:52:11,700
It's going to immediately run the accept block,
2528
01:52:11,700 --> 01:52:14,580
set iVal to negative 1, continue on with the program,
2529
01:52:14,580 --> 01:52:16,980
see you are not blown up at this point.
2530
01:52:16,980 --> 01:52:19,220
And if iVal is greater than 0, well, it's negative 1,
2531
01:52:19,220 --> 01:52:21,620
so we're going to hit the else clause and print out not
2532
01:52:21,620 --> 01:52:22,300
a number.
2533
01:52:22,300 --> 01:52:24,580
So we've done error detection.
2534
01:52:24,580 --> 01:52:27,580
The user set something that caused a line of our code
2535
01:52:27,580 --> 01:52:29,580
to kind of blow up, but we put that line
2536
01:52:29,580 --> 01:52:31,980
in a try and accept block, and so we caught it.
2537
01:52:31,980 --> 01:52:36,100
And so we dealt with that fact.
2538
01:52:36,100 --> 01:52:39,140
So in summary in this, we talked about if statements.
2539
01:52:39,140 --> 01:52:40,340
We talked about else.
2540
01:52:40,340 --> 01:52:44,180
We talked about try and accept, how important indentation
2541
01:52:44,180 --> 01:52:48,100
is to mark blocks where they begin in the end,
2542
01:52:48,100 --> 01:52:50,380
and an else if, and try accept.
2543
01:52:50,380 --> 01:52:54,420
So up next, we're going to talk about loops and iteration.
2544
01:52:58,940 --> 01:53:01,780
Hello, and welcome to chapter four, functions.
2545
01:53:01,780 --> 01:53:04,340
This is the fourth of our basic patterns.
2546
01:53:04,340 --> 01:53:05,700
We'll get to iterations next.
2547
01:53:05,700 --> 01:53:07,740
Functions is the store and reuse.
2548
01:53:07,740 --> 01:53:10,980
One of the things in programming is that we never
2549
01:53:10,980 --> 01:53:12,300
like to repeat ourselves.
2550
01:53:12,300 --> 01:53:14,820
We don't like to, if we have four or five lines of code,
2551
01:53:14,820 --> 01:53:16,780
and we're going to do the same thing later,
2552
01:53:16,780 --> 01:53:19,740
we don't like to put the same four lines of code in,
2553
01:53:19,740 --> 01:53:24,380
even if it has to do with reliability.
2554
01:53:24,380 --> 01:53:26,580
If you find something wrong with those four lines of code
2555
01:53:26,580 --> 01:53:31,420
and you got them 12 different places in your program,
2556
01:53:31,420 --> 01:53:33,260
then you got to find all 12 places and fix them.
2557
01:53:33,260 --> 01:53:34,740
So we're like, collect those to one place
2558
01:53:34,740 --> 01:53:36,700
and then call them and reuse them,
2559
01:53:36,700 --> 01:53:38,700
and that's the idea of store and reuse.
2560
01:53:39,580 --> 01:53:44,140
So this is how functions work inside of Python.
2561
01:53:44,140 --> 01:53:47,100
The first thing we notice is there is a new keyword def
2562
01:53:47,100 --> 01:53:49,500
that stands for define function,
2563
01:53:49,500 --> 01:53:51,980
and the def is like an if statement
2564
01:53:51,980 --> 01:53:55,940
or we'll see fors and whiles that they end in a colon,
2565
01:53:55,940 --> 01:53:57,500
and then they have an indented block
2566
01:53:57,500 --> 01:53:59,300
and then the indented block deindents,
2567
01:53:59,300 --> 01:54:01,500
and that's the end of the function.
2568
01:54:01,500 --> 01:54:05,500
And so there's two statements make up this function.
2569
01:54:06,500 --> 01:54:09,740
The key thing that you have to understand and get used to
2570
01:54:09,740 --> 01:54:13,620
is this def part is actually not running any code whatsoever.
2571
01:54:13,620 --> 01:54:15,380
It's actually remembering the code,
2572
01:54:15,380 --> 01:54:17,220
and that's what I call the store phase.
2573
01:54:17,220 --> 01:54:22,220
The def creates a bit of code and records it like a macro,
2574
01:54:22,580 --> 01:54:24,980
although it's much more complex than a macro,
2575
01:54:24,980 --> 01:54:26,620
and it names it whatever you chose.
2576
01:54:26,620 --> 01:54:27,580
You gave it a name.
2577
01:54:27,580 --> 01:54:29,340
We named this one thing.
2578
01:54:29,340 --> 01:54:33,680
And so it has a side effect of Python reading
2579
01:54:33,680 --> 01:54:35,980
or parsing these three lines.
2580
01:54:35,980 --> 01:54:38,820
It doesn't do anything, but it remembers.
2581
01:54:38,820 --> 01:54:42,340
These two lines are what you would like to run
2582
01:54:42,340 --> 01:54:44,300
when you invoke thing.
2583
01:54:44,300 --> 01:54:46,300
So this is the definition of a function,
2584
01:54:46,300 --> 01:54:48,700
and this is the invoking of the function.
2585
01:54:48,700 --> 01:54:52,580
But so this doesn't do anything.
2586
01:54:52,580 --> 01:54:55,380
So there's no output here from that stuff right there.
2587
01:54:55,380 --> 01:54:57,780
But then what happens is you invoke it.
2588
01:54:57,780 --> 01:55:00,280
And this thing looks like it's part of Python,
2589
01:55:00,280 --> 01:55:02,340
but you an effective extended Python
2590
01:55:02,340 --> 01:55:04,140
with your def statement.
2591
01:55:04,140 --> 01:55:08,020
And so when it sees thing, it goes up and runs your code.
2592
01:55:08,020 --> 01:55:10,180
And so out comes hello fun,
2593
01:55:10,180 --> 01:55:14,500
and then it comes back and goes to the next line.
2594
01:55:14,500 --> 01:55:16,900
Does print, so print comes out.
2595
01:55:16,900 --> 01:55:18,060
And then it goes back and like, oh,
2596
01:55:18,060 --> 01:55:19,900
this is the reuse part, but we get to reuse it.
2597
01:55:19,900 --> 01:55:21,680
We define it once and we use it twice.
2598
01:55:21,680 --> 01:55:23,220
Then it runs this code again,
2599
01:55:23,220 --> 01:55:24,920
and it goes to the next line and it's all done.
2600
01:55:24,920 --> 01:55:27,420
So this little bit came out twice.
2601
01:55:27,420 --> 01:55:28,780
And of course this is really simple
2602
01:55:28,780 --> 01:55:30,860
so that I can fit it on a page.
2603
01:55:30,860 --> 01:55:33,780
But you get the idea that I don't want to repeat.
2604
01:55:33,780 --> 01:55:37,140
This might be 15 to 100 lines of code,
2605
01:55:37,140 --> 01:55:39,420
and I don't want to type those over and over again.
2606
01:55:39,420 --> 01:55:44,420
So I say, hey, store these in a name that I choose,
2607
01:55:44,520 --> 01:55:46,660
and then when I invoke them,
2608
01:55:46,660 --> 01:55:50,060
bring them back and then run them again, okay?
2609
01:55:50,060 --> 01:55:52,460
So that's the basic idea.
2610
01:55:52,460 --> 01:55:54,140
We actually have already been using functions
2611
01:55:54,140 --> 01:55:54,980
from the beginning.
2612
01:55:54,980 --> 01:55:56,660
The print is a function, right?
2613
01:55:56,660 --> 01:55:57,860
Print is a function.
2614
01:55:57,860 --> 01:56:01,140
Every time we see print, P-R-I-N-T,
2615
01:56:01,140 --> 01:56:03,540
parentheses, and then we have some stuff in here,
2616
01:56:03,540 --> 01:56:05,300
we are calling the print function.
2617
01:56:05,300 --> 01:56:08,780
This is the syntax with two little parentheses,
2618
01:56:08,780 --> 01:56:10,500
is the syntax for functions.
2619
01:56:11,820 --> 01:56:14,940
And so input's a function, type is a function,
2620
01:56:14,940 --> 01:56:17,340
float's a function, int's a function.
2621
01:56:17,340 --> 01:56:19,640
All these things are built-in functions
2622
01:56:19,640 --> 01:56:24,640
that come with Python at the moment that we started.
2623
01:56:24,660 --> 01:56:27,360
I mean, we installed Python and these came along.
2624
01:56:27,360 --> 01:56:31,580
And then there's other functions that we define and use,
2625
01:56:31,580 --> 01:56:33,420
and that's what the def is for.
2626
01:56:33,420 --> 01:56:37,380
And in effect we can create new reserved words
2627
01:56:37,380 --> 01:56:40,260
of our own making that extend the Python language
2628
01:56:40,260 --> 01:56:42,600
after we define the function.
2629
01:56:43,620 --> 01:56:45,740
So it's just this bit of reusable code
2630
01:56:45,740 --> 01:56:46,940
that takes some arguments.
2631
01:56:46,940 --> 01:56:48,140
We haven't seen any with arguments.
2632
01:56:48,140 --> 01:56:49,100
There's a little parentheses
2633
01:56:49,100 --> 01:56:50,820
and we'll see how that works in a bit.
2634
01:56:50,820 --> 01:56:54,420
We define using the def keyword and then we invoke it.
2635
01:56:54,420 --> 01:56:55,700
There's the defining phase,
2636
01:56:55,700 --> 01:56:56,780
which actually doesn't run the code,
2637
01:56:56,780 --> 01:56:57,780
it just remembers the code.
2638
01:56:57,780 --> 01:56:59,620
And then there's the invoking phase.
2639
01:56:59,620 --> 01:57:02,780
You define it once and then invoke it one or more times.
2640
01:57:02,780 --> 01:57:05,340
Calling the function or invoking the function,
2641
01:57:05,340 --> 01:57:07,940
we think of those two things as the same thing,
2642
01:57:07,940 --> 01:57:11,080
call, invoke, or just the terms we use.
2643
01:57:11,080 --> 01:57:12,900
Most people just say call the function,
2644
01:57:12,900 --> 01:57:15,480
but invoking is a perhaps more descriptive way
2645
01:57:15,480 --> 01:57:16,720
to think about it.
2646
01:57:16,720 --> 01:57:19,500
So here's an example of a function.
2647
01:57:19,500 --> 01:57:20,620
It is built into Python.
2648
01:57:20,620 --> 01:57:22,260
It's called the max function.
2649
01:57:22,260 --> 01:57:25,420
And we can pass some parameters into the max function.
2650
01:57:25,420 --> 01:57:27,620
So we pass the hello world string.
2651
01:57:27,620 --> 01:57:29,140
Now, like much of Python,
2652
01:57:29,140 --> 01:57:32,660
max knows what kind of thing is being passed into it.
2653
01:57:32,660 --> 01:57:34,660
And it knows that it's looking for
2654
01:57:34,660 --> 01:57:36,060
the largest character,
2655
01:57:36,060 --> 01:57:40,900
the lexographically largest character.
2656
01:57:40,900 --> 01:57:43,220
And in this case, it scans this little,
2657
01:57:43,220 --> 01:57:44,620
that's inside the max code,
2658
01:57:44,620 --> 01:57:46,900
it scans through and finds the largest character.
2659
01:57:46,900 --> 01:57:48,920
So apparently lowercase letters
2660
01:57:48,920 --> 01:57:50,820
are higher than uppercase letters
2661
01:57:50,820 --> 01:57:53,820
because in English we get back a W.
2662
01:57:53,820 --> 01:57:56,680
And so this is what's called the return value.
2663
01:57:56,680 --> 01:57:59,020
So this is an assignment statement.
2664
01:57:59,020 --> 01:58:00,960
Let me clear this and start over.
2665
01:58:00,960 --> 01:58:02,660
So this is an assignment statement.
2666
01:58:02,660 --> 01:58:05,180
So it has to evaluate this right-hand side.
2667
01:58:05,180 --> 01:58:08,460
And a function call is nothing more than like x plus one.
2668
01:58:08,460 --> 01:58:10,180
It's something to evaluate.
2669
01:58:10,180 --> 01:58:11,700
It runs the function code,
2670
01:58:11,700 --> 01:58:13,100
passes in this argument,
2671
01:58:13,100 --> 01:58:14,820
and then this residual value,
2672
01:58:14,820 --> 01:58:16,060
this call return value,
2673
01:58:16,060 --> 01:58:17,820
we'll look at this in more detail,
2674
01:58:17,820 --> 01:58:22,020
becomes the result of this little bit in the expression
2675
01:58:22,020 --> 01:58:23,180
and there's nothing else.
2676
01:58:23,180 --> 01:58:25,660
We could have W plus one or something.
2677
01:58:25,660 --> 01:58:29,100
And then the W is what's stored into big.
2678
01:58:29,100 --> 01:58:31,940
Okay, so we print big and big is a variable
2679
01:58:31,940 --> 01:58:34,660
that has the letter W inside of it.
2680
01:58:34,660 --> 01:58:36,700
And then we ask what is the smallest
2681
01:58:36,700 --> 01:58:37,940
and that finds the blank.
2682
01:58:37,940 --> 01:58:39,620
And so we get a blank to see this.
2683
01:58:39,620 --> 01:58:41,740
There's a min function and a max function.
2684
01:58:41,740 --> 01:58:43,420
Both of these are built-in.
2685
01:58:45,540 --> 01:58:46,700
These are built-in functions.
2686
01:58:46,700 --> 01:58:48,140
They're always there for us.
2687
01:58:50,180 --> 01:58:55,180
Okay, so here is another example of the max function.
2688
01:58:55,300 --> 01:58:58,220
And so we can think of this as invoking
2689
01:58:58,220 --> 01:58:59,420
or calling this function
2690
01:58:59,420 --> 01:59:01,900
as this right-hand side is being evaluated.
2691
01:59:02,740 --> 01:59:04,500
We are passing this variable in
2692
01:59:04,500 --> 01:59:06,100
and there's some code in here
2693
01:59:06,100 --> 01:59:08,420
and it's gonna do some stuff, yada, yada, yada,
2694
01:59:08,420 --> 01:59:12,340
and then it's gonna give us back a bit of stuff.
2695
01:59:12,340 --> 01:59:14,260
And that's its return value
2696
01:59:14,260 --> 01:59:17,300
and then that goes up into the big, right?
2697
01:59:17,300 --> 01:59:19,380
And so that's how this works.
2698
01:59:19,380 --> 01:59:21,740
And so this is actually built-in.
2699
01:59:23,700 --> 01:59:26,300
Built-in or burnt-in, I guess I can't draw.
2700
01:59:26,300 --> 01:59:30,340
And so you can think of this as some time a long time ago
2701
01:59:30,340 --> 01:59:32,740
when Python was being first formed,
2702
01:59:32,740 --> 01:59:34,460
somebody wrote some code.
2703
01:59:34,460 --> 01:59:35,900
And it's got some stuff in it.
2704
01:59:35,900 --> 01:59:38,860
It's got a little loop that reads through all the letters.
2705
01:59:38,860 --> 01:59:41,140
It has to figure out if it's a string or a list,
2706
01:59:41,140 --> 01:59:42,460
et cetera, et cetera, et cetera.
2707
01:59:42,460 --> 01:59:47,500
But this is store, except you didn't do the storing
2708
01:59:47,500 --> 01:59:48,500
because it's already built-in.
2709
01:59:48,500 --> 01:59:51,220
And then this is the reuse, store and reuse.
2710
01:59:51,220 --> 01:59:52,960
So we build these things into Python.
2711
01:59:52,960 --> 01:59:54,620
They're already pre-built
2712
01:59:54,620 --> 01:59:57,180
as if before the first line of your code executes
2713
01:59:57,180 --> 02:00:00,940
way up here, someone put all this code in for you
2714
02:00:00,940 --> 02:00:04,380
into Python and created a thing called max for you.
2715
02:00:05,740 --> 02:00:08,520
Now we've been using this already, built-in functions.
2716
02:00:08,520 --> 02:00:10,080
We've got type conversions.
2717
02:00:10,080 --> 02:00:13,580
We've got like the float that takes a integer
2718
02:00:13,580 --> 02:00:17,120
and returns a floating point version of that.
2719
02:00:17,120 --> 02:00:19,660
And again, this is kind of like an expression.
2720
02:00:19,660 --> 02:00:22,140
So it's like, I wanna divide this by 100,
2721
02:00:22,140 --> 02:00:24,820
but before I do that, I've gotta convert it to a float.
2722
02:00:24,820 --> 02:00:27,940
So it has to sort of do these function calls
2723
02:00:27,940 --> 02:00:32,460
as it's evaluating the expression, okay?
2724
02:00:32,460 --> 02:00:35,740
Sometimes like here, we just have,
2725
02:00:35,740 --> 02:00:38,560
we just have a prints out the return value.
2726
02:00:38,560 --> 02:00:39,400
That's what this is.
2727
02:00:39,400 --> 02:00:40,460
This is the return value.
2728
02:00:40,460 --> 02:00:42,940
If you just type a function in a parameter,
2729
02:00:42,940 --> 02:00:45,260
it can be in a constant or it can be a variable.
2730
02:00:45,260 --> 02:00:46,300
And as we'll see in a second,
2731
02:00:46,300 --> 02:00:48,380
we'll give you many of these if you like.
2732
02:00:48,380 --> 02:00:50,100
So you can either just run it
2733
02:00:50,100 --> 02:00:53,660
or take the result of this, this passes an integer in,
2734
02:00:53,660 --> 02:00:57,780
converts it to a float and then puts the float into that.
2735
02:00:57,780 --> 02:00:59,700
Type tells us what kind of thing that is
2736
02:00:59,700 --> 02:01:02,140
and you can use this inside of an expression.
2737
02:01:02,140 --> 02:01:03,820
And so it's like, what am I gonna do first?
2738
02:01:03,820 --> 02:01:05,820
Oh, I've gotta do two times this thing.
2739
02:01:05,820 --> 02:01:09,020
Oh, wait a sec, pause just briefly for a moment,
2740
02:01:09,020 --> 02:01:13,580
call out to some float code, pass a three into it
2741
02:01:13,580 --> 02:01:17,020
and then something comes back, the return value,
2742
02:01:17,020 --> 02:01:18,740
the residual value comes back
2743
02:01:18,740 --> 02:01:20,420
and then that participates,
2744
02:01:20,420 --> 02:01:22,600
in this case it's gonna be 3.0,
2745
02:01:22,600 --> 02:01:26,420
participates in this two times 3.0, okay?
2746
02:01:26,420 --> 02:01:29,660
And so two times 3.0 ends up being 6.0, et cetera, et cetera.
2747
02:01:29,660 --> 02:01:31,340
But you can see as it, it's like,
2748
02:01:31,340 --> 02:01:33,140
oh, wait a sec, I gotta figure out what this is,
2749
02:01:33,140 --> 02:01:34,740
call the function, get the return value
2750
02:01:34,740 --> 02:01:37,660
and then continue processing this expression.
2751
02:01:39,780 --> 02:01:42,020
We've also done this with string conversions,
2752
02:01:42,020 --> 02:01:44,780
partly because just as an example,
2753
02:01:44,780 --> 02:01:46,420
the input always returns a string,
2754
02:01:46,420 --> 02:01:48,780
the input function returns a string.
2755
02:01:48,780 --> 02:01:51,100
And so, you know, here's this string,
2756
02:01:51,100 --> 02:01:54,860
could be coming from input, but we'll just take one, two, three.
2757
02:01:54,860 --> 02:01:58,060
We know that that's a string, it's not the number 123.
2758
02:01:58,060 --> 02:02:01,160
And if we try to add one to it, we get a trace back,
2759
02:02:01,160 --> 02:02:06,160
cannot concatenate string and integer, trace back,
2760
02:02:06,200 --> 02:02:08,020
but we can convert that string to an integer.
2761
02:02:08,020 --> 02:02:10,980
And so int can take like a floating point number
2762
02:02:10,980 --> 02:02:13,940
or an integer or even a string and it says,
2763
02:02:13,940 --> 02:02:15,180
oh, I know what I'm supposed to do with string,
2764
02:02:15,180 --> 02:02:18,180
I'm supposed to look at this, interpret these as numbers
2765
02:02:18,180 --> 02:02:20,700
and, you know, multiply by 10
2766
02:02:20,700 --> 02:02:22,260
and figure out what the hundreds place is
2767
02:02:22,260 --> 02:02:23,740
and all that stuff, there's a little bit work to that
2768
02:02:23,740 --> 02:02:26,060
and it does it, but then it gives us back an integer
2769
02:02:26,060 --> 02:02:27,300
and we say, oh, what is that?
2770
02:02:27,300 --> 02:02:31,100
That's now the 123, but it is of type int.
2771
02:02:31,100 --> 02:02:34,260
And now we can add one to it and get 124.
2772
02:02:34,260 --> 02:02:37,340
And as before from this example that we're kind of reusing
2773
02:02:37,340 --> 02:02:42,340
from a previous chapter, you don't want to try to convert,
2774
02:02:42,340 --> 02:02:46,660
oops, sad face, sad face, sad face.
2775
02:02:46,660 --> 02:02:47,900
Don't want to try to convert something
2776
02:02:47,900 --> 02:02:49,420
that doesn't have digits using int
2777
02:02:49,420 --> 02:02:51,700
because it'll say, I don't know what to do
2778
02:02:51,700 --> 02:02:54,500
and then your program quits, right?
2779
02:02:54,500 --> 02:02:57,340
You don't want your program to stop, trace backs
2780
02:02:57,340 --> 02:03:00,260
and you can of course deal with that with try and accept,
2781
02:03:00,260 --> 02:03:02,820
but that's like a previous lecture.
2782
02:03:02,820 --> 02:03:05,500
Okay, so up next, we're gonna talk about building
2783
02:03:05,500 --> 02:03:12,500
our own functions, not just using the predefined ones.
2784
02:03:12,940 --> 02:03:14,380
So welcome back, we're gonna continue
2785
02:03:14,380 --> 02:03:18,280
and start talking about building our own functions.
2786
02:03:18,280 --> 02:03:22,620
So again, we use the def keyword to define a function
2787
02:03:22,620 --> 02:03:25,220
and then later we're gonna invoke this
2788
02:03:25,220 --> 02:03:26,660
and there's a bit to it.
2789
02:03:26,660 --> 02:03:28,620
We are defining the name of the function
2790
02:03:28,620 --> 02:03:30,340
and in effect we're extending Python
2791
02:03:30,340 --> 02:03:33,020
and creating new predefined things that we can use
2792
02:03:33,020 --> 02:03:34,460
except it's our code.
2793
02:03:34,460 --> 02:03:37,500
It starts with a def keyword, has some optional arguments
2794
02:03:37,500 --> 02:03:39,860
which we'll see in a bit, that's what the parenthesis is
2795
02:03:39,860 --> 02:03:41,800
and then the name and the function names file,
2796
02:03:41,800 --> 02:03:44,420
the same rules as variable names
2797
02:03:44,420 --> 02:03:46,620
and then you have an indented block,
2798
02:03:46,620 --> 02:03:47,820
whatever code you want to do
2799
02:03:47,820 --> 02:03:49,260
and then you have a deindented block
2800
02:03:49,260 --> 02:03:52,220
and that sort of defines the essence.
2801
02:03:52,220 --> 02:03:56,260
The key thing here is this is not calling,
2802
02:03:57,720 --> 02:03:59,860
it's not invoking, it's not executing,
2803
02:03:59,860 --> 02:04:03,540
it's remembering, it's storing, it's figuring things out.
2804
02:04:03,540 --> 02:04:06,860
So here is the output of a program that defines a function
2805
02:04:06,860 --> 02:04:08,100
but then doesn't use it.
2806
02:04:08,100 --> 02:04:10,620
So this is a sort of broken function.
2807
02:04:10,620 --> 02:04:12,860
So here we go, we start x equals five print.
2808
02:04:12,860 --> 02:04:15,580
You don't have to def, you have all the defs at the beginning.
2809
02:04:15,580 --> 02:04:16,980
The def runs whenever.
2810
02:04:16,980 --> 02:04:19,340
So you know, out comes hello
2811
02:04:19,340 --> 02:04:21,740
and then we define a function and this says,
2812
02:04:21,740 --> 02:04:24,020
oh, oh, you wanna make a new thing here.
2813
02:04:24,020 --> 02:04:24,940
So I'll make a new thing.
2814
02:04:24,940 --> 02:04:26,580
It's kinda like a variable in a sense
2815
02:04:26,580 --> 02:04:28,420
and then it copies this stuff,
2816
02:04:28,420 --> 02:04:30,740
copies it up there and says later you probably
2817
02:04:30,740 --> 02:04:33,100
are gonna wanna use this so I'm gonna remember it
2818
02:04:33,100 --> 02:04:35,100
so it doesn't do anything there.
2819
02:04:35,100 --> 02:04:38,740
No output comes out, then it says print yo
2820
02:04:38,740 --> 02:04:41,580
and out comes yo and then it adds two to x
2821
02:04:41,580 --> 02:04:43,740
so x is now seven and then it prints x
2822
02:04:43,740 --> 02:04:45,640
and there's no seven, there's seven.
2823
02:04:45,640 --> 02:04:48,700
These print statements never ran.
2824
02:04:48,700 --> 02:04:49,880
They never ran, why?
2825
02:04:49,880 --> 02:04:52,180
Because we did not invoke them down here.
2826
02:04:52,180 --> 02:04:55,340
We defined them but didn't invoke them.
2827
02:04:55,340 --> 02:04:58,820
So let's take a look at how you invoke a function, right?
2828
02:04:58,820 --> 02:05:00,560
You define it and then you use it.
2829
02:05:00,560 --> 02:05:02,420
Sometimes you define it once and use it once
2830
02:05:02,420 --> 02:05:04,660
but more commonly you define it once
2831
02:05:04,660 --> 02:05:06,220
and use more than one time.
2832
02:05:06,220 --> 02:05:07,840
Again, the store and reuse pattern.
2833
02:05:07,840 --> 02:05:11,460
The def is the store and the invoking is the reuse.
2834
02:05:12,540 --> 02:05:14,300
So here's just a slightly different version
2835
02:05:14,300 --> 02:05:16,220
of that last program and so now
2836
02:05:16,220 --> 02:05:18,340
it's gonna actually invoke it.
2837
02:05:19,420 --> 02:05:22,260
So x equals five, print hello, def,
2838
02:05:22,260 --> 02:05:23,820
so out comes hello.
2839
02:05:23,820 --> 02:05:27,460
This produces, the def produces no output, right?
2840
02:05:27,460 --> 02:05:29,700
But because there's a deindent here,
2841
02:05:29,700 --> 02:05:32,860
that is the entire blob of the code
2842
02:05:32,860 --> 02:05:34,620
that is part of print lyrics.
2843
02:05:34,620 --> 02:05:38,020
So it prints out yo and now we're gonna invoke.
2844
02:05:38,020 --> 02:05:39,540
This is the call.
2845
02:05:39,540 --> 02:05:40,980
We're gonna call the function.
2846
02:05:40,980 --> 02:05:44,520
Now the function goes up, let's clear this.
2847
02:05:45,680 --> 02:05:47,760
Somewhere down to here.
2848
02:05:47,760 --> 02:05:51,020
Now this like suspends at this place.
2849
02:05:51,020 --> 02:05:53,940
It's like remember to come back to here when we're done.
2850
02:05:53,940 --> 02:05:57,800
Go up, run this code and then come back
2851
02:05:57,800 --> 02:05:59,020
and then continue on.
2852
02:05:59,020 --> 02:06:01,180
So it like leaves like a breadcrumb
2853
02:06:01,180 --> 02:06:02,940
of where it's supposed to come back to.
2854
02:06:02,940 --> 02:06:05,180
And then it runs and then the print lyrics
2855
02:06:05,180 --> 02:06:08,900
of course produces the two lines of output.
2856
02:06:08,900 --> 02:06:11,880
And yeah, that should probably not have,
2857
02:06:11,880 --> 02:06:13,620
that day should be up there.
2858
02:06:13,620 --> 02:06:16,100
And then x equals x plus two which makes it seven
2859
02:06:16,100 --> 02:06:17,500
and then prints out seven.
2860
02:06:17,500 --> 02:06:22,500
Okay, so this is the invoke or call the function.
2861
02:06:24,140 --> 02:06:26,340
You defined it and then later you called it.
2862
02:06:26,340 --> 02:06:31,340
Now, in addition to just call and return and invoking,
2863
02:06:32,860 --> 02:06:34,680
we can pass parameters in.
2864
02:06:34,680 --> 02:06:37,900
And the example of the parameter is in the max function
2865
02:06:37,900 --> 02:06:39,620
we have to say, this is the thing I want you
2866
02:06:39,620 --> 02:06:42,420
to find the maximum of, the largest thing.
2867
02:06:42,420 --> 02:06:46,300
And part of it is in the whole store and reuse pattern,
2868
02:06:46,300 --> 02:06:48,860
we have a few lines of code but sometimes we wanna do
2869
02:06:48,860 --> 02:06:51,000
ever so slightly different things
2870
02:06:51,000 --> 02:06:52,300
in a different invocations.
2871
02:06:52,300 --> 02:06:55,900
And so we use the arguments to subtly adjust
2872
02:06:55,900 --> 02:07:00,220
like finding the maximum is a general thing
2873
02:07:00,220 --> 02:07:03,540
but what thing to find the maximum of that makes a function
2874
02:07:03,540 --> 02:07:06,300
that's much more useful and reusable
2875
02:07:06,300 --> 02:07:08,280
in a lot more situations.
2876
02:07:08,280 --> 02:07:10,740
So arguments are the thing we passed in
2877
02:07:10,740 --> 02:07:13,980
and we defined for our functions that we're going to build,
2878
02:07:13,980 --> 02:07:18,540
we on the def statement, so we say def, greet,
2879
02:07:18,540 --> 02:07:21,140
name a function and then this is the arguments,
2880
02:07:21,140 --> 02:07:22,700
the things that are coming in.
2881
02:07:22,700 --> 02:07:27,420
Now, this lang variable in a sense only exists
2882
02:07:27,420 --> 02:07:29,340
during the life of the function
2883
02:07:29,340 --> 02:07:31,260
and it represents sort of a placeholder,
2884
02:07:31,260 --> 02:07:34,200
it's not a real variable in the same sense,
2885
02:07:34,200 --> 02:07:37,780
it's a placeholder that refers to how you touch
2886
02:07:37,780 --> 02:07:40,660
that first parameter that's sitting in there.
2887
02:07:40,660 --> 02:07:45,360
Okay, and so lang, so lang is our first parameter,
2888
02:07:45,360 --> 02:07:48,500
whatever it is, we don't need to see this part down here
2889
02:07:48,500 --> 02:07:51,180
right now, all we know is we're gonna make a function
2890
02:07:51,180 --> 02:07:53,820
and we're gonna take a first, we're gonna take a parameter
2891
02:07:53,820 --> 02:07:56,240
and this lang is the placeholder that tells us
2892
02:07:56,240 --> 02:07:59,220
what that parameter is, okay?
2893
02:07:59,220 --> 02:08:01,380
So within the function, we're gonna check to see
2894
02:08:01,380 --> 02:08:04,540
if the language is Spanish, if we are print hello,
2895
02:08:04,540 --> 02:08:07,880
else if the language is French, print bonjour,
2896
02:08:07,880 --> 02:08:09,060
otherwise print hello.
2897
02:08:09,060 --> 02:08:12,140
We have a very highly simplified
2898
02:08:12,140 --> 02:08:14,580
language translation system here.
2899
02:08:14,580 --> 02:08:17,580
So the def, of course, does nothing,
2900
02:08:17,580 --> 02:08:21,740
except it remembers that and defines the concept greet.
2901
02:08:24,660 --> 02:08:26,940
So that comes down and now we're gonna call it
2902
02:08:26,940 --> 02:08:28,300
and that says go look up the thing
2903
02:08:28,300 --> 02:08:29,660
that I defined called greet.
2904
02:08:29,660 --> 02:08:31,420
If you don't put this in, greet is gonna give you
2905
02:08:31,420 --> 02:08:34,660
a trace back, but because you extended and named it greet,
2906
02:08:34,660 --> 02:08:38,340
so it runs in, it starts, suspends the code here,
2907
02:08:38,340 --> 02:08:43,340
starts up here, but then lang is now an alias to en.
2908
02:08:43,340 --> 02:08:48,340
So now we can run if that is es, else if,
2909
02:08:49,900 --> 02:08:52,220
oop, I'm getting it all wrong now.
2910
02:08:55,500 --> 02:08:58,760
Right, so en comes in as lang, we're coming in the code.
2911
02:08:59,820 --> 02:09:03,980
If it's not es, it's not fr, else, it prints hello,
2912
02:09:03,980 --> 02:09:06,060
and then it comes back to the next line.
2913
02:09:07,260 --> 02:09:10,340
And then we call it again and this time es is lang
2914
02:09:10,340 --> 02:09:14,580
and so it runs this code and prints hola,
2915
02:09:14,580 --> 02:09:17,020
and then next time it calls with this,
2916
02:09:17,020 --> 02:09:21,380
and then prints bonjour, you get the idea.
2917
02:09:21,380 --> 02:09:26,380
So this is a placeholder so that on the success of calls
2918
02:09:26,380 --> 02:09:30,340
or invokes, invocating invocation of the function,
2919
02:09:30,340 --> 02:09:33,340
we can get at whatever the programmer put in
2920
02:09:33,340 --> 02:09:34,620
as that first parameter.
2921
02:09:34,620 --> 02:09:37,300
And so we are saying in this definition,
2922
02:09:37,300 --> 02:09:39,740
we are ready to receive a first parameter.
2923
02:09:39,740 --> 02:09:42,220
Please call us with a parameter
2924
02:09:42,220 --> 02:09:44,700
and then we will be able to do something slightly different
2925
02:09:44,700 --> 02:09:45,580
for the different values.
2926
02:09:45,580 --> 02:09:48,820
So this is a reusable bit of function that prints hello
2927
02:09:48,820 --> 02:09:51,380
in three different languages and then we tell it
2928
02:09:51,380 --> 02:09:54,740
what language at the moment that we're actually invoking it.
2929
02:09:57,180 --> 02:09:59,300
So that's putting stuff into the function.
2930
02:09:59,300 --> 02:10:03,740
Now getting stuff back out is the concept of returning.
2931
02:10:03,740 --> 02:10:07,380
In the return statement, the return statement
2932
02:10:07,380 --> 02:10:12,380
is an executable statement that does two basic things.
2933
02:10:12,860 --> 02:10:16,420
The first thing that it does is it finishes.
2934
02:10:16,420 --> 02:10:19,460
Now this is a one line function so that's kind of redundant,
2935
02:10:19,460 --> 02:10:23,660
but when Python goes into the return statement,
2936
02:10:23,660 --> 02:10:26,180
it doesn't continue on to the next line.
2937
02:10:26,180 --> 02:10:27,340
It just returns.
2938
02:10:27,340 --> 02:10:29,300
That is the end of the invocation
2939
02:10:29,300 --> 02:10:30,980
of that particular function.
2940
02:10:30,980 --> 02:10:33,660
But even more importantly, it takes as its parameter.
2941
02:10:33,660 --> 02:10:36,220
You can say return without a parameter
2942
02:10:36,220 --> 02:10:38,460
and it will stop the execution of the function
2943
02:10:38,460 --> 02:10:41,100
kind of like a break does for a loop.
2944
02:10:41,100 --> 02:10:42,300
It's kind of a break for a loop.
2945
02:10:42,300 --> 02:10:43,260
Get out, we're done.
2946
02:10:43,260 --> 02:10:44,580
Don't run that next line.
2947
02:10:44,580 --> 02:10:45,500
Get out.
2948
02:10:45,500 --> 02:10:49,300
But it also allows the specification of what you want
2949
02:10:49,300 --> 02:10:51,460
as the residual value in an expression.
2950
02:10:51,460 --> 02:10:54,460
So we're doing a print and then we're saying greet.
2951
02:10:54,460 --> 02:10:59,060
And what's gonna show up here is whatever this function does
2952
02:10:59,060 --> 02:11:00,460
in its return statement.
2953
02:11:00,460 --> 02:11:02,900
And so that prints hello.
2954
02:11:02,900 --> 02:11:05,700
We call it again and it prints hello again.
2955
02:11:05,700 --> 02:11:06,540
Okay?
2956
02:11:10,180 --> 02:11:12,340
And so basically the return statement,
2957
02:11:13,740 --> 02:11:15,220
I call this the residual value.
2958
02:11:15,220 --> 02:11:18,740
It's like what shows up here when the function is all done
2959
02:11:18,740 --> 02:11:20,560
and it's the string hello.
2960
02:11:22,020 --> 02:11:24,620
We call the functions that return value is fruitful
2961
02:11:24,620 --> 02:11:27,620
because they produce something but you don't have to.
2962
02:11:27,620 --> 02:11:29,100
You can just say return.
2963
02:11:29,100 --> 02:11:30,860
Or you don't even have to have a return statement.
2964
02:11:30,860 --> 02:11:32,140
It goes to the last line of the function
2965
02:11:32,140 --> 02:11:33,660
and it does a return automatically
2966
02:11:33,660 --> 02:11:35,140
at the last line of the function.
2967
02:11:35,140 --> 02:11:36,560
So here's a little bit of a rewrite
2968
02:11:36,560 --> 02:11:39,660
of our little language program.
2969
02:11:39,660 --> 02:11:41,420
We are going to create a greeting program.
2970
02:11:41,420 --> 02:11:44,060
We're gonna take the language as the first parameter.
2971
02:11:44,060 --> 02:11:45,820
And instead of just doing a print statement,
2972
02:11:45,820 --> 02:11:46,960
which is what we did before,
2973
02:11:46,960 --> 02:11:49,380
this is now more like a function
2974
02:11:49,380 --> 02:11:52,860
because it takes some input and produces some output
2975
02:11:52,860 --> 02:11:54,860
as a return rather than just printing.
2976
02:11:54,860 --> 02:11:57,060
It's a little tacky for a function to print.
2977
02:11:58,060 --> 02:12:01,620
And so here we return hola bonjour and hello
2978
02:12:01,620 --> 02:12:03,540
based on the right thing.
2979
02:12:03,540 --> 02:12:05,900
So now we say print greet en.
2980
02:12:05,900 --> 02:12:08,460
So it runs the code once, lang is en.
2981
02:12:08,460 --> 02:12:12,460
And then it runs this code and the residual value is hello.
2982
02:12:12,460 --> 02:12:14,620
So it says hello glen.
2983
02:12:14,620 --> 02:12:18,340
And similarly, when it runs this code,
2984
02:12:18,340 --> 02:12:20,940
it passes es and is lang, it runs through
2985
02:12:20,940 --> 02:12:23,020
and it runs this statement.
2986
02:12:23,020 --> 02:12:25,120
If there was more statements, it still wouldn't run them.
2987
02:12:25,120 --> 02:12:26,640
As soon as this return runs,
2988
02:12:26,640 --> 02:12:31,640
that says that this bit right here is now hola.
2989
02:12:31,640 --> 02:12:34,720
And the same with French, goes in, runs again,
2990
02:12:34,720 --> 02:12:38,400
out comes the return statement, and then bonjour, Michael.
2991
02:12:38,400 --> 02:12:42,400
So you see how we can control as we're writing the application,
2992
02:12:43,400 --> 02:12:45,200
we can control as we're writing the function
2993
02:12:45,200 --> 02:12:48,080
what the residual value that we want to see
2994
02:12:48,080 --> 02:12:50,240
in whatever expression is calling us.
2995
02:12:50,240 --> 02:12:51,480
Sometimes we have returns
2996
02:12:51,480 --> 02:12:53,280
and sometimes we don't have returns.
2997
02:12:53,280 --> 02:12:57,280
So, if you think of the method as a function,
2998
02:12:57,280 --> 02:13:02,280
well, so if you think of the max code
2999
02:13:02,360 --> 02:13:03,600
that we talked about before,
3000
02:13:03,600 --> 02:13:06,180
we can kind of see that somewhere inside that max code,
3001
02:13:06,180 --> 02:13:07,020
there's a return.
3002
02:13:07,020 --> 02:13:09,840
And that's how it communicates the W back to us.
3003
02:13:09,840 --> 02:13:12,520
So we pass in his argument, hello world.
3004
02:13:12,520 --> 02:13:13,880
It comes in as a parameter
3005
02:13:13,880 --> 02:13:17,040
and it's gonna loop through this imp somewhere.
3006
02:13:17,040 --> 02:13:19,040
It's gonna loop over and over into imp.
3007
02:13:19,040 --> 02:13:21,120
And then at some point it's gonna figure something out
3008
02:13:21,120 --> 02:13:23,880
and tell us what it wants to send back to us
3009
02:13:23,880 --> 02:13:24,920
is a return statement.
3010
02:13:24,920 --> 02:13:29,800
And so the W comes back and gets assigned into big.
3011
02:13:31,400 --> 02:13:33,320
You can have more than one parameter
3012
02:13:33,320 --> 02:13:34,680
and there's just an order.
3013
02:13:34,680 --> 02:13:36,960
The first one and the second one, three and five.
3014
02:13:36,960 --> 02:13:41,080
So three becomes A and five becomes B and away we go.
3015
02:13:41,080 --> 02:13:43,040
So we just use this to add two numbers
3016
02:13:43,040 --> 02:13:45,120
and so three plus five is eight.
3017
02:13:47,920 --> 02:13:50,600
So you get as many as you like and the order matters.
3018
02:13:50,600 --> 02:13:53,940
And if you do things like you tell it you want parameters
3019
02:13:53,940 --> 02:13:54,960
and you don't give it to them,
3020
02:13:54,960 --> 02:13:58,000
then that'll become a trace back and it will blow up.
3021
02:13:58,000 --> 02:14:01,280
You can also talk about optional parameters later.
3022
02:14:01,280 --> 02:14:03,440
So you don't have to have return values
3023
02:14:03,440 --> 02:14:05,920
and that means that you simply don't call
3024
02:14:05,920 --> 02:14:07,400
the return with a value.
3025
02:14:07,400 --> 02:14:11,420
And return is always implicitly happening
3026
02:14:11,420 --> 02:14:13,900
as the last line of the function.
3027
02:14:14,800 --> 02:14:19,800
So that's kind of the basics of how functions operate.
3028
02:14:19,800 --> 02:14:23,340
But I don't want you to get too excited about writing
3029
02:14:23,340 --> 02:14:26,760
functions, some programming classes are like
3030
02:14:26,760 --> 02:14:28,560
gotta write a function, gotta write a function.
3031
02:14:28,560 --> 02:14:32,620
Functions to be clear are a very powerful mechanism.
3032
02:14:32,620 --> 02:14:37,620
And as we write programs 150, 200,000, 200 lines of code,
3033
02:14:37,800 --> 02:14:40,360
1000 lines of code, 10,000 lines of code,
3034
02:14:40,360 --> 02:14:43,080
the concept of a function is really important.
3035
02:14:43,080 --> 02:14:45,200
We would go crazy if we didn't have functions.
3036
02:14:45,200 --> 02:14:48,160
But if you're only writing 20 lines of code,
3037
02:14:48,160 --> 02:14:51,560
forcing yourself to write a function is kind of pointless.
3038
02:14:51,560 --> 02:14:56,560
So don't worry about maybe the lack of urge to use this.
3039
02:14:57,360 --> 02:14:59,480
We are calling lots of predefined functions
3040
02:14:59,480 --> 02:15:01,980
and we will for the next couple of lectures.
3041
02:15:01,980 --> 02:15:03,520
There will be a time when you go like,
3042
02:15:03,520 --> 02:15:05,300
oh I'm sick and tired of repeating myself.
3043
02:15:05,300 --> 02:15:07,000
Oh yeah, time to write a function.
3044
02:15:08,140 --> 02:15:11,260
So that's why we don't push functions prematurely.
3045
02:15:11,260 --> 02:15:14,200
We just want you to know what they are,
3046
02:15:14,200 --> 02:15:16,080
use them and at some moment you'll be like,
3047
02:15:16,080 --> 02:15:17,220
oh I wanna define one.
3048
02:15:17,220 --> 02:15:19,160
But don't worry about, it might take a while
3049
02:15:19,160 --> 02:15:21,700
before you really wanna define a function.
3050
02:15:21,700 --> 02:15:25,740
So that kind of summarizes our lecture on functions
3051
02:15:25,740 --> 02:15:28,460
and up next we're gonna do iterations.
3052
02:15:32,120 --> 02:15:35,180
Hello and welcome to chapter five, loops and iteration.
3053
02:15:35,180 --> 02:15:39,840
Now we're going to work on our fourth basic pattern
3054
02:15:39,840 --> 02:15:43,340
on sequential, conditional, store and reuse
3055
02:15:43,340 --> 02:15:44,540
and loops and iteration.
3056
02:15:44,540 --> 02:15:47,560
And this is the one where we teach the computer
3057
02:15:47,560 --> 02:15:49,440
how to do things a lot.
3058
02:15:49,440 --> 02:15:51,480
We can tell it to do something a million times.
3059
02:15:51,480 --> 02:15:56,240
And so that's where we get the doggedness of computers
3060
02:15:56,240 --> 02:15:58,960
or the fact that they're so good at doing work for us
3061
02:15:58,960 --> 02:16:01,240
because we can set them off to a task
3062
02:16:01,240 --> 02:16:03,340
and they'll do it until it's done.
3063
02:16:04,340 --> 02:16:07,960
So here's a very simple loop, a very simple loop.
3064
02:16:09,520 --> 02:16:11,280
Let's put the coffee over here.
3065
02:16:11,280 --> 02:16:16,280
The key word that we're gonna start using is the while loop.
3066
02:16:16,920 --> 02:16:18,920
We're also gonna use the for later on.
3067
02:16:20,000 --> 02:16:23,480
And the while loop functions very much like an if statement.
3068
02:16:23,480 --> 02:16:27,440
The while starts it and this is just like an if statement.
3069
02:16:27,440 --> 02:16:30,560
It's a question that leads to a true or a false answer.
3070
02:16:30,560 --> 02:16:33,200
And then there's a colon and then there's an indented block
3071
02:16:33,200 --> 02:16:36,200
and then we use the deindent to determine how long
3072
02:16:36,200 --> 02:16:38,940
the loop is and so this print is deindented
3073
02:16:38,940 --> 02:16:41,520
so that indicates the end of the loop.
3074
02:16:41,520 --> 02:16:45,920
And so at some level, what's gonna happen here
3075
02:16:45,920 --> 02:16:48,799
is it's just gonna run and if this is true,
3076
02:16:48,799 --> 02:16:51,319
it's gonna run this code and if it's false,
3077
02:16:51,320 --> 02:16:53,080
it's gonna skip the code and that way
3078
02:16:53,080 --> 02:16:54,280
it functions like an if.
3079
02:16:54,280 --> 02:16:56,559
The place that it doesn't function like an if
3080
02:16:56,559 --> 02:16:58,799
is after it's run the code once,
3081
02:16:58,799 --> 02:17:01,559
it goes up and then asks the question again
3082
02:17:01,559 --> 02:17:03,699
and so you can think of it going back up
3083
02:17:03,700 --> 02:17:05,520
kind of to the top of the while loop
3084
02:17:05,520 --> 02:17:07,719
and then re-asking the question like,
3085
02:17:07,719 --> 02:17:10,799
okay, is this going to run again?
3086
02:17:10,799 --> 02:17:13,679
And then it's gonna do that some number of times
3087
02:17:13,680 --> 02:17:15,000
and then it's gonna finish.
3088
02:17:15,000 --> 02:17:18,040
And so that's the loop, that's the iteration.
3089
02:17:18,040 --> 02:17:19,879
And we're going to make a variable,
3090
02:17:19,879 --> 02:17:21,919
we're gonna construct very carefully
3091
02:17:21,920 --> 02:17:24,840
a variable that we call the iteration variable
3092
02:17:24,840 --> 02:17:28,680
and that's n and it's a variable that's gonna change
3093
02:17:28,680 --> 02:17:30,320
and it's our way of running the loop
3094
02:17:30,320 --> 02:17:32,040
but not running the loop forever.
3095
02:17:33,240 --> 02:17:35,059
So let's just run this.
3096
02:17:35,059 --> 02:17:37,719
We come in, n is five, is n greater than zero?
3097
02:17:37,719 --> 02:17:40,119
Yes it is, so we're gonna run this code.
3098
02:17:40,120 --> 02:17:42,120
So we're gonna run this code, we're gonna print out five,
3099
02:17:42,120 --> 02:17:43,360
then we're gonna subtract one
3100
02:17:43,360 --> 02:17:45,040
and then we're gonna go back up,
3101
02:17:45,040 --> 02:17:47,200
go back up and ask the question,
3102
02:17:47,200 --> 02:17:48,639
is n greater than zero?
3103
02:17:48,639 --> 02:17:51,399
And the answer is, since it's four, the answer is yes.
3104
02:17:51,400 --> 02:17:53,000
So n, it runs again.
3105
02:17:53,000 --> 02:17:55,240
Then it prints out four, subtracts it again,
3106
02:17:55,240 --> 02:17:58,320
checks, prints three, subtracts it again,
3107
02:17:58,320 --> 02:18:00,480
prints two, subtracts it again,
3108
02:18:00,480 --> 02:18:02,620
prints one, subtracts it again.
3109
02:18:02,620 --> 02:18:05,760
Now n is zero and so it comes back up,
3110
02:18:05,760 --> 02:18:10,080
comes back up, is this question has now become false.
3111
02:18:10,080 --> 02:18:11,559
So it's gonna take the exit,
3112
02:18:11,559 --> 02:18:13,979
so it's gonna come down and run this line right here,
3113
02:18:13,980 --> 02:18:15,040
then it prints blast off
3114
02:18:15,040 --> 02:18:17,080
and we can kind of print out the residual value
3115
02:18:17,080 --> 02:18:19,360
of n just to sort of prove to ourselves
3116
02:18:19,360 --> 02:18:22,879
that it ran until n was no longer greater than zero
3117
02:18:22,879 --> 02:18:25,239
and then zero was the final value for n
3118
02:18:25,240 --> 02:18:30,240
and we carefully constructed this n, n equals, oops, go back.
3119
02:18:30,240 --> 02:18:32,860
We carefully constructed n, we set it to five,
3120
02:18:32,860 --> 02:18:36,139
then we carefully subtracted one each time through the loop
3121
02:18:36,139 --> 02:18:39,419
and then we're using that to control when to exit the loop.
3122
02:18:39,420 --> 02:18:41,379
And so you could think of this loop as,
3123
02:18:41,379 --> 02:18:43,499
for now, running five times,
3124
02:18:43,500 --> 02:18:47,620
true, true, true, true, true, and then false, finally.
3125
02:18:47,620 --> 02:18:50,260
So this question was true for a while
3126
02:18:50,260 --> 02:18:52,459
and as long as it was true, the loop ran
3127
02:18:52,459 --> 02:18:55,899
and then when it turned false, the loop stopped.
3128
02:18:57,180 --> 02:18:59,379
And so this variable that we construct
3129
02:18:59,379 --> 02:19:01,999
to control the loop was called the iteration variable
3130
02:19:02,000 --> 02:19:04,260
because it tells how many times this loop
3131
02:19:04,260 --> 02:19:06,299
is going to run over and over
3132
02:19:06,299 --> 02:19:08,519
or otherwise known as iterate.
3133
02:19:09,920 --> 02:19:12,260
So this is a badly constructed loop
3134
02:19:12,260 --> 02:19:15,500
with an iteration variable that we didn't do very well.
3135
02:19:15,500 --> 02:19:18,500
And so if we take a look at this,
3136
02:19:18,500 --> 02:19:21,540
we start it with n five and then this is greater than zero
3137
02:19:21,540 --> 02:19:23,500
so it's true so it runs it and then it runs it again
3138
02:19:23,500 --> 02:19:25,540
and then it's still greater than zero.
3139
02:19:25,540 --> 02:19:28,000
So you can pretty much see because we're not changing n,
3140
02:19:28,000 --> 02:19:30,139
this is gonna be true, true, true, true,
3141
02:19:30,139 --> 02:19:33,059
dot, dot, dot, dot, forever, true, forever.
3142
02:19:33,059 --> 02:19:35,579
And so this is an infinite loop
3143
02:19:35,580 --> 02:19:37,700
and it's just gonna run until your computer
3144
02:19:37,700 --> 02:19:40,379
runs out of battery or you hit the button.
3145
02:19:40,379 --> 02:19:42,099
This is the kind of thing where you often see
3146
02:19:42,100 --> 02:19:46,280
your computer spinning like a spinning beach ball
3147
02:19:46,280 --> 02:19:49,260
or some other indication that your computer's super busy.
3148
02:19:49,260 --> 02:19:51,180
It's in some kind of a loop, really tight
3149
02:19:51,180 --> 02:19:53,260
and it's running something and it's using up
3150
02:19:53,260 --> 02:19:55,920
all of the processing resources of your computer.
3151
02:19:55,920 --> 02:19:57,740
That's an infinite loop.
3152
02:19:57,740 --> 02:19:59,980
And so the problem is we did nothing
3153
02:19:59,980 --> 02:20:02,380
with the iteration variable.
3154
02:20:04,060 --> 02:20:05,460
Now here's a different loop.
3155
02:20:05,460 --> 02:20:08,180
And so this one demonstrates a different idea.
3156
02:20:08,180 --> 02:20:11,100
So in this case, we start out with n is zero
3157
02:20:11,100 --> 02:20:13,640
and it comes in here and is n greater than zero?
3158
02:20:13,640 --> 02:20:15,820
Question mark and the answer is false.
3159
02:20:15,820 --> 02:20:19,780
So it skips it, it doesn't run these lines of code at all.
3160
02:20:19,780 --> 02:20:21,700
And so this loop doesn't run at all
3161
02:20:21,700 --> 02:20:23,380
because it comes in, asks the question,
3162
02:20:23,380 --> 02:20:26,060
it says no and then it skips right around it.
3163
02:20:26,060 --> 02:20:27,980
So never run, never run.
3164
02:20:27,980 --> 02:20:31,420
And so this actually is, sometimes you write a while loop
3165
02:20:31,420 --> 02:20:34,700
on purpose like this, not quite as simple as this one.
3166
02:20:34,700 --> 02:20:38,320
But the idea is this emphasizes that these loops
3167
02:20:38,320 --> 02:20:39,960
are what we call zero trip.
3168
02:20:41,940 --> 02:20:45,220
They are not even guaranteed to run once.
3169
02:20:45,220 --> 02:20:46,780
They're gonna run maybe zero times.
3170
02:20:46,780 --> 02:20:49,260
And in this respect, it functions exactly
3171
02:20:49,260 --> 02:20:51,620
like an if statement, right?
3172
02:20:51,620 --> 02:20:53,860
Meaning the first time through the loop, if it's not true,
3173
02:20:53,860 --> 02:20:55,740
it's just gonna skip right by it.
3174
02:20:58,500 --> 02:21:01,340
So there's a couple of ways of getting out of loops.
3175
02:21:01,340 --> 02:21:03,580
In this case, I'm constructing an infinite loop
3176
02:21:03,580 --> 02:21:07,340
because remember the kind of definition of an infinite loop
3177
02:21:07,340 --> 02:21:10,220
is if this is gonna stay true.
3178
02:21:10,220 --> 02:21:13,060
Well, true is the constant true.
3179
02:21:13,060 --> 02:21:14,740
So this is gonna run forever.
3180
02:21:14,740 --> 02:21:16,780
And what it's gonna do is it's gonna prompt
3181
02:21:16,780 --> 02:21:21,080
with a little arrow and then let us type
3182
02:21:21,080 --> 02:21:24,460
and read whatever we type into the variable line.
3183
02:21:24,460 --> 02:21:26,660
And then if the line is done, we're gonna break.
3184
02:21:26,660 --> 02:21:29,220
Now break is an executable statement.
3185
02:21:29,220 --> 02:21:34,000
And if you hit the break, it exits the innermost loop
3186
02:21:34,000 --> 02:21:36,900
out to the place beyond the end of the loop.
3187
02:21:38,060 --> 02:21:43,060
So when this runs the first time and we say hello there,
3188
02:21:43,900 --> 02:21:45,560
line is not done, so it prints it.
3189
02:21:45,560 --> 02:21:47,840
So it prints out hello there and then goes up.
3190
02:21:47,840 --> 02:21:49,900
And then we type in again, we type finished.
3191
02:21:49,900 --> 02:21:52,980
And so it doesn't, it's not done, so it prints it.
3192
02:21:52,980 --> 02:21:54,460
So now comes that print statement.
3193
02:21:54,460 --> 02:21:57,660
Then we type in done and now this becomes true.
3194
02:21:57,660 --> 02:22:00,140
And it comes out and runs the code
3195
02:22:00,140 --> 02:22:02,180
beyond the end of the loop.
3196
02:22:02,180 --> 02:22:04,020
The key is it doesn't go back.
3197
02:22:04,020 --> 02:22:07,660
It's like once you've done a break, that loop is done.
3198
02:22:07,660 --> 02:22:11,740
And so you look at basically the block that is the loop.
3199
02:22:11,740 --> 02:22:14,180
So here's kind of the loop block.
3200
02:22:14,180 --> 02:22:15,940
And then the break goes to the line
3201
02:22:15,940 --> 02:22:20,940
after the end of the loop block.
3202
02:22:23,180 --> 02:22:24,660
And you can think of this as sort of like
3203
02:22:24,660 --> 02:22:26,060
just a hyperspace jump.
3204
02:22:26,060 --> 02:22:28,820
There is nothing really, this could be literally
3205
02:22:28,820 --> 02:22:31,580
hundreds of lines with if statements.
3206
02:22:31,580 --> 02:22:33,720
And you could be running and doing all kinds of stuff
3207
02:22:33,720 --> 02:22:35,900
and running and doing all these things.
3208
02:22:35,900 --> 02:22:38,460
And these things could run all kinds of ways, right?
3209
02:22:38,460 --> 02:22:40,940
The point is as soon as you hit a break statement,
3210
02:22:40,940 --> 02:22:42,460
however much stuff is down here,
3211
02:22:42,460 --> 02:22:44,100
however much stuff is up here,
3212
02:22:44,100 --> 02:22:47,760
it exits to whatever the next line is
3213
02:22:47,760 --> 02:22:50,060
beyond the end of the loop.
3214
02:22:51,500 --> 02:22:54,280
Continue is another loop control statement,
3215
02:22:54,280 --> 02:22:56,560
but it works differently than break.
3216
02:22:56,560 --> 02:22:59,480
So break says get out of this loop.
3217
02:22:59,480 --> 02:23:02,700
Continue effectively says stop this iteration.
3218
02:23:02,700 --> 02:23:04,400
We're done with this iteration.
3219
02:23:04,400 --> 02:23:08,180
And so continue says go up back to the top of the loop.
3220
02:23:08,180 --> 02:23:09,560
Oops, yeah.
3221
02:23:09,560 --> 02:23:11,200
Go up back to the top of the loop.
3222
02:23:11,200 --> 02:23:14,760
And so here we read a line.
3223
02:23:14,760 --> 02:23:17,400
If the first character is a pound sign,
3224
02:23:17,400 --> 02:23:20,480
line sub zero, if that first character is a pound sign,
3225
02:23:20,480 --> 02:23:21,920
we're gonna skip it.
3226
02:23:21,920 --> 02:23:24,380
And this is a way for us to make like little comments
3227
02:23:24,380 --> 02:23:25,560
in our typing.
3228
02:23:25,560 --> 02:23:28,440
And then if the line is done, we get out
3229
02:23:28,440 --> 02:23:29,640
and otherwise we print it.
3230
02:23:29,640 --> 02:23:31,400
And so that's why there is no print out here
3231
02:23:31,400 --> 02:23:35,560
because it comes in, runs, oops.
3232
02:23:35,560 --> 02:23:40,560
It comes in, this is true and that goes back up,
3233
02:23:42,160 --> 02:23:45,360
but it comes back and prints out the next one
3234
02:23:45,360 --> 02:23:46,300
and does another thing.
3235
02:23:46,300 --> 02:23:47,880
And so the loop continues,
3236
02:23:47,880 --> 02:23:50,440
whereas the break ends the loop.
3237
02:23:50,440 --> 02:23:52,520
And so again, the same kind of notion
3238
02:23:52,520 --> 02:23:54,880
that you're sort of doing all kinds of complexity.
3239
02:23:54,880 --> 02:23:56,800
Wherever you're at in this loop,
3240
02:23:56,800 --> 02:24:00,400
you hit continue and it doesn't go any further.
3241
02:24:00,400 --> 02:24:02,880
It goes back up and runs the question mark.
3242
02:24:02,880 --> 02:24:04,600
It asks the question mark.
3243
02:24:04,600 --> 02:24:07,600
And so, I mean, ask the question
3244
02:24:07,600 --> 02:24:09,620
and it might exit the loop in that particular case.
3245
02:24:09,620 --> 02:24:11,000
But this one here is a true,
3246
02:24:11,000 --> 02:24:13,300
this is an infinite loop that I've constructed.
3247
02:24:13,300 --> 02:24:15,600
This is not an infinite loop because at some point
3248
02:24:15,600 --> 02:24:17,000
the break gets us out of the loop.
3249
02:24:17,000 --> 02:24:20,560
And so it's an infinite loop with break to escape it.
3250
02:24:20,560 --> 02:24:23,340
And that's another common way to construct a loop.
3251
02:24:26,100 --> 02:24:29,100
So these loops that we've been drawing so far,
3252
02:24:29,100 --> 02:24:31,160
the ones that use while as their keyword,
3253
02:24:32,600 --> 02:24:34,280
are what are called indefinite loops.
3254
02:24:34,280 --> 02:24:36,640
And that's because they kind of go for a while
3255
02:24:36,640 --> 02:24:41,120
till a break hits or until some value becomes true.
3256
02:24:41,120 --> 02:24:44,200
I mean, as long as that value remains true.
3257
02:24:44,200 --> 02:24:48,120
So all the ones we've done so far are easy to look at
3258
02:24:48,120 --> 02:24:50,720
and know that they look pretty good
3259
02:24:50,720 --> 02:24:52,500
and they're probably gonna finish.
3260
02:24:52,500 --> 02:24:55,440
But there are some times if they're long and complex
3261
02:24:55,440 --> 02:24:58,480
and their exit or termination conditions
3262
02:24:58,480 --> 02:24:59,760
are a little more complex,
3263
02:24:59,760 --> 02:25:02,240
it's not clear that they're really gonna terminate.
3264
02:25:02,240 --> 02:25:05,280
And so we can use while loops for a lot of things,
3265
02:25:05,280 --> 02:25:08,360
but for most of our looping,
3266
02:25:08,360 --> 02:25:10,320
we're gonna use what are called definite loops.
3267
02:25:10,320 --> 02:25:12,480
And that's what we're gonna talk about next.
3268
02:25:16,320 --> 02:25:19,120
So definite loops use the for keyword.
3269
02:25:19,120 --> 02:25:20,800
And the idea of a definite loop
3270
02:25:20,800 --> 02:25:23,200
is it's going to loop through some set of things.
3271
02:25:23,200 --> 02:25:25,280
It might be a set of lines in a file,
3272
02:25:25,280 --> 02:25:28,320
it might be a set of characters in a string,
3273
02:25:28,320 --> 02:25:31,380
it might be a set of strings in a list of strings.
3274
02:25:31,380 --> 02:25:34,760
But whatever it is, it's sort of gonna run
3275
02:25:34,760 --> 02:25:37,360
a finite number of times depending on the thing
3276
02:25:37,360 --> 02:25:38,920
that it's looping through.
3277
02:25:38,920 --> 02:25:40,800
And we like this.
3278
02:25:40,800 --> 02:25:43,760
And it's an easier way to construct it
3279
02:25:43,760 --> 02:25:45,960
and we actually don't have to deal with the iteration
3280
02:25:45,960 --> 02:25:48,620
variable, the for loop includes a mechanism
3281
02:25:48,620 --> 02:25:50,520
to construct the iteration variable for us.
3282
02:25:50,520 --> 02:25:54,440
So it's definite loops iterate through the members of a set.
3283
02:25:54,440 --> 02:25:57,160
So here's a very simple for loop.
3284
02:25:57,160 --> 02:26:02,160
And so you see the for keyword and n is also a keyword.
3285
02:26:03,600 --> 02:26:06,640
And the iteration variable is something we put right here.
3286
02:26:06,640 --> 02:26:11,040
This i is declared, this i is like an assignment statement.
3287
02:26:11,040 --> 02:26:13,720
And i is going to take on successive values.
3288
02:26:13,720 --> 02:26:17,240
So i is going to be five the first time through the loop.
3289
02:26:17,240 --> 02:26:19,800
Then i is gonna be four the second time through the loop.
3290
02:26:19,800 --> 02:26:22,320
Third, two, one.
3291
02:26:22,320 --> 02:26:25,360
So i is gonna be assigned five different times
3292
02:26:25,360 --> 02:26:26,880
to five different values.
3293
02:26:26,880 --> 02:26:29,880
And then the loop is going to run.
3294
02:26:29,880 --> 02:26:32,440
It's gonna run once with five, once with four,
3295
02:26:32,440 --> 02:26:35,320
once with three, once with two, and once with one.
3296
02:26:35,320 --> 02:26:38,240
And so this block of code we have contracted,
3297
02:26:38,240 --> 02:26:42,300
say execute it five times with these values of i.
3298
02:26:42,300 --> 02:26:43,900
i is that iteration variable.
3299
02:26:43,900 --> 02:26:47,560
i is the thing changing through each iteration of the loop.
3300
02:26:47,560 --> 02:26:48,400
Okay?
3301
02:26:48,400 --> 02:26:52,480
And so that's why this prints out five, four, three, two,
3302
02:26:52,480 --> 02:26:55,240
one, and then when it's done it finishes it.
3303
02:26:55,240 --> 02:26:58,360
So this is a much more direct syntax
3304
02:26:58,360 --> 02:27:01,440
for looping five times and setting iteration variable.
3305
02:27:01,440 --> 02:27:05,600
You kind of all combine it into this one thing, right?
3306
02:27:05,600 --> 02:27:06,740
All into one thing.
3307
02:27:06,740 --> 02:27:08,860
So it's quite nice.
3308
02:27:08,860 --> 02:27:11,440
So you don't have to be going through a list of numbers.
3309
02:27:11,440 --> 02:27:12,560
There's all kinds of things
3310
02:27:12,560 --> 02:27:14,700
that we can iterate through with four.
3311
02:27:14,700 --> 02:27:16,640
And by the way, while I'm sitting here,
3312
02:27:16,640 --> 02:27:19,940
don't, I named my variable friends,
3313
02:27:19,940 --> 02:27:21,280
because that's a list of strings,
3314
02:27:21,280 --> 02:27:24,460
and friend, which is the iteration variable.
3315
02:27:24,460 --> 02:27:27,680
I'm using singular and plural because it helps you read it.
3316
02:27:27,680 --> 02:27:29,740
Python doesn't understand singular and plural.
3317
02:27:29,740 --> 02:27:31,920
So just because you say friends
3318
02:27:31,920 --> 02:27:33,560
doesn't mean Python knows it's a list.
3319
02:27:33,560 --> 02:27:35,340
Python does know it's a list,
3320
02:27:35,340 --> 02:27:37,560
but it doesn't know by the name of the variable I've chosen.
3321
02:27:37,560 --> 02:27:40,360
That's your basic mnemonic variable warning.
3322
02:27:40,360 --> 02:27:41,720
These are cool variable names,
3323
02:27:41,720 --> 02:27:44,360
but I don't want you to get confused by them.
3324
02:27:44,360 --> 02:27:46,400
So you can loop through a variable.
3325
02:27:46,400 --> 02:27:48,260
So we're gonna take this list of three strings
3326
02:27:48,260 --> 02:27:49,400
and stick it in friends.
3327
02:27:49,400 --> 02:27:51,500
And so friend is gonna iterate through that.
3328
02:27:51,500 --> 02:27:54,260
So the first time through, friend is gonna be Joseph.
3329
02:27:54,260 --> 02:27:56,140
Second time through, it's gonna be Glen.
3330
02:27:56,140 --> 02:27:58,520
Third time through, it's going to be Sally.
3331
02:27:58,520 --> 02:28:00,400
And so that just says run this loop,
3332
02:28:00,400 --> 02:28:02,540
run this code, the indented code,
3333
02:28:02,540 --> 02:28:05,660
three times each time the variable friend
3334
02:28:05,660 --> 02:28:08,840
takes on a successive version of,
3335
02:28:08,840 --> 02:28:12,820
a successive value that's in the friends array.
3336
02:28:13,740 --> 02:28:16,980
So it says happy birthday Joseph, Glen, Sally,
3337
02:28:16,980 --> 02:28:19,600
and then we come out of the loop and we print done.
3338
02:28:19,600 --> 02:28:24,600
So if we try to draw a picture of what this is really doing,
3339
02:28:29,040 --> 02:28:31,960
the for loop is actually doing a whole bunch of stuff
3340
02:28:31,960 --> 02:28:34,760
that we would have to do with maybe
3341
02:28:34,760 --> 02:28:37,480
separate statements in the while loop.
3342
02:28:37,480 --> 02:28:40,000
First it decides how many times to run the loop.
3343
02:28:40,000 --> 02:28:43,100
So it's answering the done question, which way do we go?
3344
02:28:43,100 --> 02:28:45,720
And it is also then moving I ahead.
3345
02:28:45,720 --> 02:28:47,880
It's managing the iteration variable.
3346
02:28:47,880 --> 02:28:51,640
If you go back to the, it's initializing it too.
3347
02:28:51,640 --> 02:28:53,200
If you go back to the while loop,
3348
02:28:53,200 --> 02:28:55,640
we had n equals zero, while n greater than zero,
3349
02:28:55,640 --> 02:28:57,200
n equals n minus one.
3350
02:28:57,200 --> 02:29:00,720
So we had like three lines to control the loop
3351
02:29:00,720 --> 02:29:02,520
to manage the iteration variable.
3352
02:29:02,520 --> 02:29:04,760
But with a for loop, we don't have to do that.
3353
02:29:04,760 --> 02:29:07,120
And so that's all taken care of.
3354
02:29:07,120 --> 02:29:10,720
And so that basically says the for loop,
3355
02:29:10,720 --> 02:29:13,320
by you using a for loop, are we done?
3356
02:29:13,320 --> 02:29:14,720
No, we have five things to work.
3357
02:29:14,720 --> 02:29:16,980
Well set out of the first one, run it.
3358
02:29:16,980 --> 02:29:18,520
We're not done, because we've got one more.
3359
02:29:18,520 --> 02:29:20,780
Set it to the second one, third one, fourth one,
3360
02:29:20,780 --> 02:29:22,560
fifth one, and now we're done.
3361
02:29:22,560 --> 02:29:26,240
And that is all handled in a single line of code
3362
02:29:26,240 --> 02:29:28,040
and that includes the iteration variable
3363
02:29:28,040 --> 02:29:30,200
and the set of things through which
3364
02:29:30,200 --> 02:29:32,040
we are going to iterate through.
3365
02:29:34,080 --> 02:29:37,360
I really like the word in.
3366
02:29:37,360 --> 02:29:42,040
It is mathematically, I mean, it reminds me of
3367
02:29:42,040 --> 02:29:45,800
the set theory where you say this is a member of this set
3368
02:29:45,800 --> 02:29:49,040
or the for each.
3369
02:29:49,040 --> 02:29:51,120
Math isn't important here, but if you do know math,
3370
02:29:51,120 --> 02:29:53,760
the vertical bar means such that, right,
3371
02:29:53,760 --> 02:29:56,600
is a member of this set and that kind of stuff,
3372
02:29:56,600 --> 02:29:59,600
member of the set, I'll erase the math stuff
3373
02:29:59,600 --> 02:30:00,980
so we don't over math.
3374
02:30:00,980 --> 02:30:04,520
But it's like for each of the values in the set,
3375
02:30:04,520 --> 02:30:07,600
five, four, three, two, one, run this loop,
3376
02:30:07,600 --> 02:30:10,980
setting the iteration variable i to the members of that set.
3377
02:30:10,980 --> 02:30:14,960
So n reminds me, for those of us who are math oriented,
3378
02:30:14,960 --> 02:30:19,120
n reminds me of a really nice concept in mathematics.
3379
02:30:23,360 --> 02:30:26,640
Now, you could think of this as sort of this
3380
02:30:26,640 --> 02:30:29,220
looping structure where the for loop,
3381
02:30:29,220 --> 02:30:30,920
and this is pretty much how it actually runs
3382
02:30:30,920 --> 02:30:34,320
inside the computer, right, where it initializes it,
3383
02:30:34,320 --> 02:30:37,320
i, it runs this, runs this thing five times,
3384
02:30:37,320 --> 02:30:38,500
and then executes.
3385
02:30:38,500 --> 02:30:40,080
That's one way to think about it.
3386
02:30:40,080 --> 02:30:43,040
You could also think about it in a somewhat
3387
02:30:43,040 --> 02:30:47,680
more abstract way, and think of it as all we're really doing
3388
02:30:47,680 --> 02:30:51,440
is we have a contract with Python that says i,
3389
02:30:51,440 --> 02:30:53,000
we're supposed to run this code five times,
3390
02:30:53,000 --> 02:30:56,800
and i's supposed to be five, four, three, two, and one.
3391
02:30:56,800 --> 02:30:59,200
So you could imagine this might be what's going on.
3392
02:30:59,200 --> 02:31:01,640
The for loop sets i to five, runs our code.
3393
02:31:01,640 --> 02:31:04,280
The for loop sets i to four, runs our code.
3394
02:31:04,280 --> 02:31:06,560
The for loop sets i to three, runs our code.
3395
02:31:06,560 --> 02:31:09,120
The for loop sets i to two, runs our code.
3396
02:31:09,120 --> 02:31:11,480
For loop sets i to one, and runs our code.
3397
02:31:11,480 --> 02:31:15,720
All we know is our code was run five, ran five times,
3398
02:31:15,720 --> 02:31:18,720
and by contract, each success of time,
3399
02:31:20,440 --> 02:31:22,040
we're getting a different value for i,
3400
02:31:22,040 --> 02:31:24,040
and the value for i is taken from this set.
3401
02:31:24,040 --> 02:31:27,040
And so this is just one way to think about it,
3402
02:31:27,040 --> 02:31:31,400
to say to yourself, oh yeah, this is one way to think about it
3403
02:31:31,400 --> 02:31:34,000
as it's actually, and this is how it really works,
3404
02:31:34,000 --> 02:31:36,400
but this is also kind of logically the contract
3405
02:31:36,400 --> 02:31:39,120
that Python is making for us.
3406
02:31:39,120 --> 02:31:42,200
So up next, we're gonna talk about taking this notion
3407
02:31:42,200 --> 02:31:44,680
of doing something to a lot of items,
3408
02:31:44,680 --> 02:31:46,240
but accomplishing something with that,
3409
02:31:46,240 --> 02:31:48,740
and I call these loop idioms.
3410
02:31:52,040 --> 02:31:54,040
So now we're gonna talk about loop idioms,
3411
02:31:54,040 --> 02:31:57,040
and loop idioms are patterns
3412
02:31:57,040 --> 02:31:59,520
that have to do with how we construct loops.
3413
02:31:59,520 --> 02:32:03,160
We have the mechanics of fors and whiles,
3414
02:32:03,160 --> 02:32:05,680
but ultimately we wanna get something done.
3415
02:32:05,680 --> 02:32:08,000
We wanna solve a problem with a loop,
3416
02:32:08,000 --> 02:32:11,560
and often what we have to do is if we have a set of things,
3417
02:32:11,560 --> 02:32:14,780
whether it's lines, or strings, or characters, or numbers,
3418
02:32:14,780 --> 02:32:16,480
we're looking for something like the largest,
3419
02:32:16,480 --> 02:32:18,280
or the smallest, or we wanna add them up,
3420
02:32:18,280 --> 02:32:20,040
or something like that.
3421
02:32:20,040 --> 02:32:22,800
And so we can't just say add them up,
3422
02:32:22,800 --> 02:32:25,200
we have to say go through each one
3423
02:32:25,200 --> 02:32:26,640
and do something to each one,
3424
02:32:26,640 --> 02:32:28,800
and somehow achieve adding them up.
3425
02:32:28,800 --> 02:32:31,020
And the pattern that we're gonna follow is
3426
02:32:31,020 --> 02:32:33,320
we're gonna have this loop that's gonna do all,
3427
02:32:33,320 --> 02:32:38,320
run once for each thing in some chunk of data,
3428
02:32:38,480 --> 02:32:40,680
and then, but we're gonna set something at the beginning,
3429
02:32:40,680 --> 02:32:42,520
and then we're gonna do something to each one,
3430
02:32:42,520 --> 02:32:44,660
and then at the end we're gonna kinda get the payoff,
3431
02:32:44,660 --> 02:32:46,220
we're gonna get the result.
3432
02:32:46,220 --> 02:32:49,680
So if we're doing sort of summing things,
3433
02:32:49,680 --> 02:32:51,160
we're gonna have a running total,
3434
02:32:51,160 --> 02:32:54,040
and so this'll be like t equals zero,
3435
02:32:54,040 --> 02:32:58,280
and then this'll be t equals t plus the thing value.
3436
02:32:58,280 --> 02:33:00,400
And then, but this is not the real total,
3437
02:33:00,400 --> 02:33:02,480
it's the running total during the loop,
3438
02:33:02,480 --> 02:33:04,500
but at the end it is the real total.
3439
02:33:05,440 --> 02:33:07,960
And so we're gonna look at what you do
3440
02:33:07,960 --> 02:33:09,680
before the loop starts, during the loop,
3441
02:33:09,680 --> 02:33:12,120
and then what you get after the loop,
3442
02:33:12,120 --> 02:33:13,520
and how you can use that.
3443
02:33:14,600 --> 02:33:16,240
So we're gonna use this loop,
3444
02:33:16,240 --> 02:33:18,440
it's just gonna loop through a set of six numbers
3445
02:33:18,440 --> 02:33:20,840
over and over and over again, right?
3446
02:33:20,840 --> 02:33:22,480
So we're gonna do something before the loop,
3447
02:33:22,480 --> 02:33:23,680
we're gonna do something after the loop,
3448
02:33:23,680 --> 02:33:25,960
and then we're gonna run the loop some number of times,
3449
02:33:25,960 --> 02:33:28,600
and in this case thing is our iteration variable,
3450
02:33:28,600 --> 02:33:32,000
because I'm using unnemonic variables now.
3451
02:33:32,000 --> 02:33:36,960
So it's gonna run 9, 41, 12, three, 74, and 15,
3452
02:33:36,960 --> 02:33:38,840
so it's gonna run and print these things out.
3453
02:33:38,840 --> 02:33:41,480
So it runs this loop six times, and away we go.
3454
02:33:41,480 --> 02:33:44,200
Now this loop does nothing except print stuff out.
3455
02:33:44,200 --> 02:33:45,800
Of course I like to do that first,
3456
02:33:45,800 --> 02:33:47,660
is always print things out,
3457
02:33:47,660 --> 02:33:50,780
to make sure that sort of my brain is functioning.
3458
02:33:52,080 --> 02:33:57,080
So, to kind of understand how these loops work,
3459
02:33:57,540 --> 02:34:00,200
I'm gonna ask you to function as a program,
3460
02:34:00,200 --> 02:34:03,000
and I'm gonna show you some numbers in succession,
3461
02:34:03,000 --> 02:34:06,520
and I want you to mentally figure out
3462
02:34:06,520 --> 02:34:08,320
what the largest number is, but more importantly,
3463
02:34:08,320 --> 02:34:11,320
think about how your brain is solving this problem
3464
02:34:11,320 --> 02:34:12,700
of what is the largest number,
3465
02:34:12,700 --> 02:34:14,100
given that I'm only gonna show them to you
3466
02:34:14,100 --> 02:34:15,920
one at a time for a little while,
3467
02:34:15,920 --> 02:34:17,200
and your brain has to do something,
3468
02:34:17,200 --> 02:34:19,880
and imagine I was gonna show you thousands of numbers,
3469
02:34:19,880 --> 02:34:21,680
I'm not, but imagine I was.
3470
02:34:21,680 --> 02:34:24,440
How would you organize yourself in a way,
3471
02:34:24,440 --> 02:34:26,160
so that for like an hour and a half,
3472
02:34:26,160 --> 02:34:28,200
you could sit here as I showed you numbers,
3473
02:34:28,200 --> 02:34:30,920
and you keep track of the largest number
3474
02:34:30,920 --> 02:34:33,760
that you've seen of all the numbers, okay?
3475
02:34:33,760 --> 02:34:35,920
So here we go, here's your first number,
3476
02:34:39,760 --> 02:34:44,760
second number, third number, fourth number,
3477
02:34:47,860 --> 02:34:52,860
fifth number, sixth and last number.
3478
02:34:52,860 --> 02:34:57,860
What was the largest number, hmm? What was it?
3479
02:34:59,940 --> 02:35:04,780
Well, it wasn't too hard, it was 74,
3480
02:35:04,780 --> 02:35:06,640
but that's not the question.
3481
02:35:06,640 --> 02:35:11,320
How did your brain arrive at 74?
3482
02:35:11,320 --> 02:35:12,480
So here's all the numbers,
3483
02:35:12,480 --> 02:35:13,920
if I was showing you all the numbers,
3484
02:35:13,920 --> 02:35:17,360
and asked you what's the largest number,
3485
02:35:17,360 --> 02:35:19,280
your eyes would have sort of gone,
3486
02:35:19,280 --> 02:35:22,580
zer, zer, zer, zer, zer, and then you got to 74,
3487
02:35:22,580 --> 02:35:25,720
and you wouldn't do it in any particular order,
3488
02:35:25,720 --> 02:35:27,960
your eyes would just like see the 74,
3489
02:35:27,960 --> 02:35:30,280
and it would just throw smaller numbers away,
3490
02:35:30,280 --> 02:35:32,800
and it would move really quickly to what the answer is.
3491
02:35:32,800 --> 02:35:36,160
Even if there was several hundred numbers on the screen,
3492
02:35:36,160 --> 02:35:39,160
your mind would sort of move fluidly
3493
02:35:39,160 --> 02:35:42,040
wherever it felt like moving, and then arrive at it.
3494
02:35:42,040 --> 02:35:43,840
And probably what it would do is,
3495
02:35:43,840 --> 02:35:45,680
it would do something like, you know,
3496
02:35:45,680 --> 02:35:47,600
kind of move like this, find this,
3497
02:35:47,600 --> 02:35:49,800
and then sort of check to make sure that it's okay,
3498
02:35:49,800 --> 02:35:52,880
and then say like, okay I got 74, I'm done.
3499
02:35:53,900 --> 02:35:55,320
That's not how computers do it,
3500
02:35:55,320 --> 02:35:57,280
that is not how computers do it.
3501
02:35:57,280 --> 02:35:59,240
They do not move fluidly,
3502
02:35:59,240 --> 02:36:01,640
but they are highly dedicated,
3503
02:36:01,640 --> 02:36:02,960
they're gonna do something,
3504
02:36:02,960 --> 02:36:06,600
gee, gee, gee, gee, gee, gee, gee.
3505
02:36:06,600 --> 02:36:11,300
74, but how would you construct a loop to achieve this?
3506
02:36:11,300 --> 02:36:12,400
So let's take a look.
3507
02:36:13,440 --> 02:36:15,960
You could create a variable called largest so far,
3508
02:36:15,960 --> 02:36:17,780
and this is the largest variable,
3509
02:36:17,780 --> 02:36:19,740
the value that you've seen in the list so far,
3510
02:36:19,740 --> 02:36:21,540
I don't know, I haven't shown you any numbers yet,
3511
02:36:21,540 --> 02:36:24,420
so we'll just set this to negative one to get us started.
3512
02:36:24,420 --> 02:36:27,460
So now, we see three, and we're like,
3513
02:36:27,460 --> 02:36:29,160
oh, that's better than negative one,
3514
02:36:29,160 --> 02:36:30,800
it's our first number, so it's probably the largest
3515
02:36:30,800 --> 02:36:32,620
we've seen so far, right?
3516
02:36:32,620 --> 02:36:35,260
Great, 41, oh, that's bigger than the largest
3517
02:36:35,260 --> 02:36:37,200
we've seen so far, so we'll keep it.
3518
02:36:37,200 --> 02:36:40,320
12 is not bigger than 41, so we're not gonna keep it.
3519
02:36:40,320 --> 02:36:42,300
Notice this keeping thing.
3520
02:36:42,300 --> 02:36:44,360
Nine is not bigger than 41, so there's no point
3521
02:36:44,360 --> 02:36:48,000
to keeping it, 74 is bigger than 41, so we'll keep it.
3522
02:36:48,000 --> 02:36:49,040
Is this the largest number?
3523
02:36:49,040 --> 02:36:51,460
We don't know, we don't know until we're done.
3524
02:36:51,460 --> 02:36:55,340
15, not better than 74, so now, we're all done,
3525
02:36:55,340 --> 02:37:00,100
and hooray, hooray, hooray, we have the largest number.
3526
02:37:00,100 --> 02:37:03,680
And we had this variable that we kept the largest number
3527
02:37:03,680 --> 02:37:06,160
that we'd seen up to this point,
3528
02:37:06,160 --> 02:37:08,960
and then when we know that we're done at the end,
3529
02:37:08,960 --> 02:37:11,460
then that becomes the largest.
3530
02:37:12,800 --> 02:37:13,900
So if you look at all the numbers,
3531
02:37:13,900 --> 02:37:15,320
keeping track of the largest so far,
3532
02:37:15,320 --> 02:37:17,060
at the end of all the numbers, the largest so far
3533
02:37:17,060 --> 02:37:19,820
and the largest are the same thing.
3534
02:37:19,820 --> 02:37:22,460
And so that's how you get this idea
3535
02:37:22,460 --> 02:37:24,760
of something you're doing during the loop
3536
02:37:24,760 --> 02:37:28,160
is not really the answer, but by the time the loop is done,
3537
02:37:28,160 --> 02:37:30,360
you will have the answer.
3538
02:37:30,360 --> 02:37:32,440
And so here's a bit of code that does this.
3539
02:37:32,440 --> 02:37:35,040
Use it with our numbers, right?
3540
02:37:35,040 --> 02:37:36,780
So let's take a look.
3541
02:37:36,780 --> 02:37:38,440
So I have this variable called largest so far,
3542
02:37:38,440 --> 02:37:41,040
I set it to negative one, before the loop.
3543
02:37:41,040 --> 02:37:42,880
Remember, there's a loop before and a loop after
3544
02:37:42,880 --> 02:37:45,320
and loop in the middle, before it's negative one.
3545
02:37:45,320 --> 02:37:49,060
So now the num, remember underscores are okay,
3546
02:37:49,060 --> 02:37:51,060
that's my iteration variable.
3547
02:37:51,060 --> 02:37:53,700
If nine is greater than largest so far,
3548
02:37:53,700 --> 02:37:55,100
well largest so far is negative one,
3549
02:37:55,100 --> 02:37:56,800
so that's true, so this code's gonna run.
3550
02:37:56,800 --> 02:37:59,300
So we're gonna remember the new number.
3551
02:37:59,300 --> 02:38:02,860
So this is nine, and so nine ends up in largest so far,
3552
02:38:02,860 --> 02:38:05,900
and then we print it out, and so largest so far is nine
3553
02:38:05,900 --> 02:38:08,280
after we saw the number nine.
3554
02:38:08,280 --> 02:38:09,780
Then we do it again.
3555
02:38:09,780 --> 02:38:13,960
Do it again, so now 41 comes in, and is 41 greater than nine?
3556
02:38:15,860 --> 02:38:18,600
The answer is yes it is, so we're gonna run this code,
3557
02:38:18,600 --> 02:38:23,340
copy 41 into largest so far, and then print it out,
3558
02:38:23,340 --> 02:38:27,500
and largest so far is 41 after we saw the number 41.
3559
02:38:29,000 --> 02:38:32,180
Now we're gonna run the loop again with 12, okay?
3560
02:38:32,180 --> 02:38:33,740
And you get the idea, I hope.
3561
02:38:33,740 --> 02:38:36,340
Is 12 greater than 41, which is the largest we've seen
3562
02:38:36,340 --> 02:38:39,620
so far, and the answer is no it is not, so we skip.
3563
02:38:39,620 --> 02:38:43,580
So the largest so far stays 41 even though we saw 12,
3564
02:38:43,580 --> 02:38:45,780
meaning we're sort of like ratcheting up,
3565
02:38:45,780 --> 02:38:48,060
but we never ratchet back down.
3566
02:38:48,060 --> 02:38:51,060
So we run it again with three and 41,
3567
02:38:52,420 --> 02:38:56,560
and we skip this, and then the largest so far is 41
3568
02:38:56,560 --> 02:38:59,240
even though we just saw three,
3569
02:38:59,240 --> 02:39:03,200
and now we see 74 is 74 greater than 41.
3570
02:39:03,200 --> 02:39:05,300
See, we never are looking at all the numbers.
3571
02:39:05,300 --> 02:39:07,140
We're only looking at the window on the numbers
3572
02:39:07,140 --> 02:39:10,480
of the current number that we're looking at.
3573
02:39:10,480 --> 02:39:13,020
So is 74 greater than 41?
3574
02:39:13,020 --> 02:39:15,220
The answer is yes, so we run this code,
3575
02:39:15,220 --> 02:39:17,580
and then we capture the 74.
3576
02:39:17,580 --> 02:39:22,060
So we've seen, we just saw 74, and it is the largest so far.
3577
02:39:22,060 --> 02:39:24,800
And then we run it again with 15,
3578
02:39:24,800 --> 02:39:28,600
but 74 is our largest so far, and so it skips.
3579
02:39:28,600 --> 02:39:32,440
So 74 remains largest so far after 15,
3580
02:39:32,440 --> 02:39:33,600
and now we're finished,
3581
02:39:33,600 --> 02:39:35,140
because we just ran the last thing,
3582
02:39:35,140 --> 02:39:37,200
before loop takes care of everything,
3583
02:39:37,200 --> 02:39:38,880
and jumps to this print statement, and says,
3584
02:39:38,880 --> 02:39:41,340
afterwards, largest so far is 74,
3585
02:39:41,340 --> 02:39:45,420
but at this point, it's also the largest, right?
3586
02:39:45,420 --> 02:39:49,580
So largest so far became largest when our loop finished.
3587
02:39:50,880 --> 02:39:52,680
So that sort of gives you this notion
3588
02:39:52,680 --> 02:39:56,760
of how we construct something at the beginning,
3589
02:39:57,620 --> 02:39:58,920
some kind of thing that we're gonna do
3590
02:39:58,920 --> 02:40:00,860
over and over and over again,
3591
02:40:00,860 --> 02:40:03,060
and then something at the end.
3592
02:40:03,060 --> 02:40:04,360
And we put some print statements in
3593
02:40:04,360 --> 02:40:08,840
just so we can watch it and see what's going on.
3594
02:40:08,840 --> 02:40:10,980
So coming up next, we're gonna talk about
3595
02:40:10,980 --> 02:40:13,280
some more loop patterns, some counting,
3596
02:40:13,280 --> 02:40:17,440
totaling, averaging, and finding the smallest number.
3597
02:40:20,580 --> 02:40:22,280
So now we're gonna look at some more patterns
3598
02:40:22,280 --> 02:40:24,660
of the different things we can do at the top of the loop,
3599
02:40:24,660 --> 02:40:26,360
in the middle of the loop, and at the bottom of the loop.
3600
02:40:26,360 --> 02:40:28,460
And the first one we're going to do is counting.
3601
02:40:28,460 --> 02:40:31,960
Now we're gonna take a look at the number of something,
3602
02:40:31,960 --> 02:40:33,300
the number of things in our list.
3603
02:40:33,300 --> 02:40:35,500
Now we could just inspect it and see six,
3604
02:40:35,500 --> 02:40:37,960
but you'll have four loops like you're reading through
3605
02:40:37,960 --> 02:40:41,600
a file or scanning through some data.
3606
02:40:41,600 --> 02:40:43,900
And so the notion of counting,
3607
02:40:43,900 --> 02:40:46,680
but you have to assume that you don't really know
3608
02:40:46,680 --> 02:40:47,840
dot, dot, dot, dot, dot,
3609
02:40:47,840 --> 02:40:50,240
that there's gonna be a lot more than just six.
3610
02:40:50,240 --> 02:40:51,940
But for now, we're just gonna do six,
3611
02:40:51,940 --> 02:40:53,380
and we're gonna count how many things
3612
02:40:53,380 --> 02:40:55,720
that we see in this loop.
3613
02:40:55,720 --> 02:40:57,660
And the pattern is simple.
3614
02:40:57,660 --> 02:41:00,860
You set a variable, zork to zero at the beginning.
3615
02:41:00,860 --> 02:41:03,920
We often call this variable count in mnemonic.
3616
02:41:03,920 --> 02:41:06,220
And now we're gonna run this loop six times.
3617
02:41:06,220 --> 02:41:08,600
One, two, three, four, five, six.
3618
02:41:08,600 --> 02:41:10,600
And each time through, we're just gonna add one to zork.
3619
02:41:10,600 --> 02:41:13,440
So zork start at zero, then it goes one, two, three,
3620
02:41:13,440 --> 02:41:14,760
four, five, six.
3621
02:41:14,760 --> 02:41:16,000
And we're gonna print it out.
3622
02:41:16,000 --> 02:41:18,700
So we see the nine and zork is one.
3623
02:41:18,700 --> 02:41:20,380
See 41, zork is two.
3624
02:41:20,380 --> 02:41:21,880
And in it, zork is 16.
3625
02:41:21,880 --> 02:41:24,380
When we see the 15, four stops.
3626
02:41:24,380 --> 02:41:25,520
And we print out afterwards.
3627
02:41:25,520 --> 02:41:30,220
And this then is six is then the ultimate count that we got.
3628
02:41:30,220 --> 02:41:32,280
So that's very, very simple.
3629
02:41:32,280 --> 02:41:36,460
The pattern is that set it to zero at the beginning,
3630
02:41:36,460 --> 02:41:39,260
add one to it, and if you run that enough times,
3631
02:41:39,260 --> 02:41:42,600
then this is how many times that happened.
3632
02:41:42,600 --> 02:41:46,160
And in a sense, it's how many times this line ran, right?
3633
02:41:46,160 --> 02:41:48,100
Sometimes you put this in an if statement,
3634
02:41:48,100 --> 02:41:50,640
et cetera, et cetera, et cetera, okay?
3635
02:41:52,700 --> 02:41:53,540
Oops.
3636
02:41:54,640 --> 02:41:57,420
Now, we can do the same thing to get a total.
3637
02:41:57,420 --> 02:42:00,820
And the way the total works is you compute a running total
3638
02:42:00,820 --> 02:42:03,420
of the number of the items that you've seen so far.
3639
02:42:03,420 --> 02:42:04,780
And at the end, the running total
3640
02:42:04,780 --> 02:42:06,820
in effect becomes the total.
3641
02:42:08,420 --> 02:42:09,660
A better variable name for this
3642
02:42:09,660 --> 02:42:11,380
would be like sum or total or something.
3643
02:42:11,380 --> 02:42:13,220
But zork, I'll use zork again.
3644
02:42:13,220 --> 02:42:16,460
So you set zork to zero, and it starts up.
3645
02:42:16,460 --> 02:42:19,300
The total we've seen so far is indeed zero.
3646
02:42:19,300 --> 02:42:22,020
And then we're gonna run this one, two, three, four,
3647
02:42:22,020 --> 02:42:23,140
five, six times.
3648
02:42:23,140 --> 02:42:25,700
And thing is gonna be the iteration variable.
3649
02:42:25,700 --> 02:42:27,700
It's gonna take on the successive values.
3650
02:42:27,700 --> 02:42:29,300
And each time through, we're just gonna take
3651
02:42:29,300 --> 02:42:32,740
our running total and add to it the thing we've seen.
3652
02:42:32,740 --> 02:42:35,340
So we see nine, and the running total is nine.
3653
02:42:35,340 --> 02:42:37,980
We see 41, and the running total becomes 50.
3654
02:42:37,980 --> 02:42:40,580
We see 12, the running total becomes 62.
3655
02:42:40,580 --> 02:42:43,780
We get a three, it becomes 65, we get 74,
3656
02:42:43,780 --> 02:42:45,500
running total is 139.
3657
02:42:45,500 --> 02:42:48,060
How many more, how many more are we gonna see?
3658
02:42:48,060 --> 02:42:49,920
We don't know, it could be a million, could be one.
3659
02:42:49,920 --> 02:42:50,940
Oh, it's only one.
3660
02:42:50,940 --> 02:42:53,700
We get a 15, our running total is 154.
3661
02:42:53,700 --> 02:42:55,820
And what's true at any moment here
3662
02:42:55,820 --> 02:42:59,680
is the running total is right, up of what we've seen so far.
3663
02:42:59,680 --> 02:43:02,940
Now, when we're done, the for loop quits for us,
3664
02:43:02,940 --> 02:43:06,880
and afterwards 154 is indeed the total.
3665
02:43:06,880 --> 02:43:09,500
So the running total while we're in the loop,
3666
02:43:09,500 --> 02:43:11,620
at the end of the loop, after the end of the loop,
3667
02:43:11,620 --> 02:43:13,940
we have the actual total.
3668
02:43:13,940 --> 02:43:17,340
So it's not very difficult to convert this to the average,
3669
02:43:17,340 --> 02:43:18,740
because we've calculated the count,
3670
02:43:18,740 --> 02:43:20,580
and we've calculated the running total,
3671
02:43:20,580 --> 02:43:22,300
and now we're gonna have the average
3672
02:43:22,300 --> 02:43:24,780
by simply dividing those, okay?
3673
02:43:24,780 --> 02:43:29,400
So, now this time I've used mnemonic variables.
3674
02:43:29,400 --> 02:43:30,900
Don't get confused by this,
3675
02:43:30,900 --> 02:43:33,360
mnemonic variables are just friendly names I chose
3676
02:43:33,360 --> 02:43:35,280
for you to read the code easier.
3677
02:43:35,280 --> 02:43:37,660
I am not communicating to Python in any way
3678
02:43:37,660 --> 02:43:42,060
by naming this count and sum, but count and sum is nice.
3679
02:43:42,060 --> 02:43:44,680
Okay, so I set count to zero and sum to zero,
3680
02:43:44,680 --> 02:43:46,220
oh, go back up.
3681
02:43:46,220 --> 02:43:48,980
I set count to zero and sum to zero at the beginning,
3682
02:43:48,980 --> 02:43:51,340
and the count is zero and the sum is zero,
3683
02:43:51,340 --> 02:43:53,140
and then I'm gonna run this loop six times,
3684
02:43:53,140 --> 02:43:54,900
one, two, three, four, five, six,
3685
02:43:54,900 --> 02:43:59,260
and each time value is the iteration variable.
3686
02:43:59,260 --> 02:44:01,140
I count, every time I run the loop,
3687
02:44:01,140 --> 02:44:02,980
I count equals count plus one,
3688
02:44:02,980 --> 02:44:04,300
sum equals sum plus value,
3689
02:44:04,300 --> 02:44:06,980
so I have a running count and a running total,
3690
02:44:06,980 --> 02:44:09,620
and they show up here, one, two, three, four, five, six,
3691
02:44:09,620 --> 02:44:10,940
and then the running total,
3692
02:44:10,940 --> 02:44:12,620
and then at some point the for loop,
3693
02:44:12,620 --> 02:44:15,220
we do the last one and the for loop jumps out,
3694
02:44:15,220 --> 02:44:19,420
and it divides, 654 is the count and running total,
3695
02:44:19,420 --> 02:44:22,820
and then it divides the average, sum over count, okay?
3696
02:44:22,820 --> 02:44:25,380
So that's just, again, a pattern of something
3697
02:44:25,380 --> 02:44:27,460
in the beginning, something in the middle,
3698
02:44:27,460 --> 02:44:28,720
something in the end.
3699
02:44:31,260 --> 02:44:33,040
Another kind of thing we tend to do in loops
3700
02:44:33,040 --> 02:44:36,500
is we look for things, we hunt for things,
3701
02:44:36,500 --> 02:44:38,860
and so this is where we have an if statement
3702
02:44:38,860 --> 02:44:40,220
inside of a loop, and of course,
3703
02:44:40,220 --> 02:44:42,220
I've created a silly, simple thing.
3704
02:44:42,220 --> 02:44:46,920
In this code, I am looking for large values
3705
02:44:46,920 --> 02:44:48,560
that are values that are greater than 20,
3706
02:44:48,560 --> 02:44:51,220
and again, don't think of this as just six numbers,
3707
02:44:51,220 --> 02:44:52,660
but I'm looking for all the values,
3708
02:44:52,660 --> 02:44:54,020
and I'm gonna print them out.
3709
02:44:54,020 --> 02:44:56,940
So, you know, it says before, it's gonna run this,
3710
02:44:56,940 --> 02:44:59,940
nine, well, if nine's greater than 20, it's false,
3711
02:44:59,940 --> 02:45:03,500
so it goes back up, 41, true,
3712
02:45:03,500 --> 02:45:05,540
so it prints out 41, then goes back up,
3713
02:45:05,540 --> 02:45:08,980
12, false, goes back up,
3714
02:45:08,980 --> 02:45:12,820
three, false, goes back up, 74, true,
3715
02:45:12,820 --> 02:45:15,740
so it runs this, so out comes that little print statement,
3716
02:45:15,740 --> 02:45:18,500
goes back up, and then 15 is the last one,
3717
02:45:18,500 --> 02:45:20,420
and that's false, it goes back up,
3718
02:45:20,420 --> 02:45:23,500
and the four says we're done, and then we do afterwards,
3719
02:45:23,500 --> 02:45:26,540
and so this is just the notion of having
3720
02:45:26,540 --> 02:45:30,980
an if statement inside of a for loop,
3721
02:45:30,980 --> 02:45:34,220
where we're sort of picking, or choosing,
3722
02:45:34,220 --> 02:45:37,120
or selecting, or looking for something
3723
02:45:37,120 --> 02:45:40,100
in a large set of things that we're looping through.
3724
02:45:41,780 --> 02:45:45,920
We can also say I wanna know if a particular value is there,
3725
02:45:45,920 --> 02:45:48,420
and so we're gonna use a Boolean variable,
3726
02:45:48,420 --> 02:45:52,420
and we've talked about integer variables like one, 42,
3727
02:45:52,420 --> 02:45:55,020
and then floating point variables like 98.6,
3728
02:45:55,020 --> 02:45:57,740
and then string variables like hello world,
3729
02:45:57,740 --> 02:45:58,580
that have quotes in them.
3730
02:45:58,580 --> 02:46:02,460
This is a fourth type, type, a kind of variable.
3731
02:46:03,460 --> 02:46:06,620
It's called a Boolean variable, and it only has two values.
3732
02:46:06,620 --> 02:46:10,220
It has true and false.
3733
02:46:10,220 --> 02:46:12,180
Matter of fact, these if statements,
3734
02:46:12,180 --> 02:46:15,300
they return Boolean values, value equal equal three,
3735
02:46:15,300 --> 02:46:17,940
that is returning a true or a false
3736
02:46:17,940 --> 02:46:21,220
based on the value of value.
3737
02:46:21,220 --> 02:46:23,700
There's a new monic confusion there, but I'm using,
3738
02:46:23,700 --> 02:46:26,220
so I'm gonna make a variable called found,
3739
02:46:26,220 --> 02:46:27,840
and that's a decent name for a variable,
3740
02:46:27,840 --> 02:46:29,660
so don't get hung up on that,
3741
02:46:29,660 --> 02:46:32,920
and I'm gonna initially say found is gonna indicate to me
3742
02:46:32,920 --> 02:46:35,780
whether or not I found a three in my list,
3743
02:46:35,780 --> 02:46:37,700
and I'm gonna start before the loop starts,
3744
02:46:37,700 --> 02:46:40,620
let's say false, because we haven't found anything yet,
3745
02:46:40,620 --> 02:46:44,040
so found equals false, and so at the beginning of the loop,
3746
02:46:44,040 --> 02:46:47,020
found is false, before the loop starts, found is false,
3747
02:46:47,020 --> 02:46:49,580
and now we're gonna run this loop a bunch of times.
3748
02:46:49,580 --> 02:46:51,320
Nine, is that true?
3749
02:46:51,320 --> 02:46:52,700
No, skip.
3750
02:46:52,700 --> 02:46:54,580
41, is that true?
3751
02:46:54,580 --> 02:46:56,140
Skip.
3752
02:46:56,140 --> 02:46:59,900
12, skip, right, so nine, 41, 12,
3753
02:46:59,900 --> 02:47:01,640
and found has remained false,
3754
02:47:01,640 --> 02:47:03,360
because we haven't done anything to it,
3755
02:47:03,360 --> 02:47:06,140
but now in comes a three, and this becomes true,
3756
02:47:06,140 --> 02:47:09,280
so it runs this code, so found becomes true,
3757
02:47:09,280 --> 02:47:10,740
and then we print it, and you'll notice
3758
02:47:10,740 --> 02:47:13,340
that when we see a three, we get true,
3759
02:47:13,340 --> 02:47:16,420
and then it runs again, we get 74, it's still false,
3760
02:47:16,420 --> 02:47:19,760
15, it's still false, run, run, run, quit,
3761
02:47:19,760 --> 02:47:22,800
and the residual afterwards is true,
3762
02:47:22,800 --> 02:47:25,040
and in fact, if you didn't know any of this,
3763
02:47:25,040 --> 02:47:26,760
and you don't print that out,
3764
02:47:26,760 --> 02:47:28,660
all you know is that afterwards,
3765
02:47:28,660 --> 02:47:30,260
we loop through all those things,
3766
02:47:30,260 --> 02:47:32,840
and we know that there was a three in there.
3767
02:47:32,840 --> 02:47:36,500
That's what we're doing, so we searched all of them,
3768
02:47:36,500 --> 02:47:39,460
we checked for threes when we found a three,
3769
02:47:39,460 --> 02:47:43,940
and you can see basically that the found remains false
3770
02:47:43,940 --> 02:47:45,460
until it flips to true,
3771
02:47:45,460 --> 02:47:47,340
but then there's nothing to set it back to false,
3772
02:47:47,340 --> 02:47:48,220
there's nothing in this loop
3773
02:47:48,220 --> 02:47:49,700
that's gonna set it back to false,
3774
02:47:49,700 --> 02:47:52,480
so once it sort of catches the three,
3775
02:47:52,480 --> 02:47:54,860
then it remains true for the rest of the loop,
3776
02:47:54,860 --> 02:47:57,500
and then it just finds its way out.
3777
02:47:57,500 --> 02:47:59,640
Now if you wanna think about it for a moment,
3778
02:47:59,640 --> 02:48:03,180
ask yourself, how might we make this loop more efficient
3779
02:48:03,180 --> 02:48:06,380
by putting a statement right in here?
3780
02:48:06,380 --> 02:48:10,540
Think about a way to, once you've found it,
3781
02:48:10,540 --> 02:48:14,500
and it's true, there is sort of no reason to keep on going,
3782
02:48:14,500 --> 02:48:18,540
so what would you put there to perhaps make this loop,
3783
02:48:18,540 --> 02:48:21,160
to look for threes, just to tell you whether or not
3784
02:48:21,160 --> 02:48:23,980
there was at least one three in there,
3785
02:48:23,980 --> 02:48:25,220
how to make that more efficient?
3786
02:48:25,220 --> 02:48:26,320
Just think about that.
3787
02:48:28,260 --> 02:48:32,820
Okay, so now let's look back at the largest value
3788
02:48:32,820 --> 02:48:34,820
that we started out with, right?
3789
02:48:34,820 --> 02:48:36,720
And so if you think about this,
3790
02:48:36,720 --> 02:48:41,140
let's kind of give it a sort of a rough look here.
3791
02:48:41,140 --> 02:48:44,180
Largest so far is our kind of, like a running total,
3792
02:48:44,180 --> 02:48:47,660
but it's our hypothesis is the best large number.
3793
02:48:47,660 --> 02:48:49,560
And we have this if statement that says,
3794
02:48:49,560 --> 02:48:51,500
if the number we just see right now
3795
02:48:51,500 --> 02:48:54,820
is greater than the largest so far, then capture it, right?
3796
02:48:54,820 --> 02:48:56,700
Take whatever number we saw and capture it.
3797
02:48:56,700 --> 02:49:00,040
So when we see a nine, it's better, we capture it.
3798
02:49:00,040 --> 02:49:02,600
We see a 41, it's better, we capture it.
3799
02:49:02,600 --> 02:49:04,180
We don't capture this, we don't capture this,
3800
02:49:04,180 --> 02:49:07,340
we capture the 74, and we don't capture the 15,
3801
02:49:07,340 --> 02:49:08,220
and that's how we do it.
3802
02:49:08,220 --> 02:49:10,740
So you could think of this as better.
3803
02:49:10,740 --> 02:49:15,500
When the number we're looking at is greater
3804
02:49:15,500 --> 02:49:18,780
than our working hypothesis of the largest,
3805
02:49:18,780 --> 02:49:20,180
we grab it because it's better.
3806
02:49:20,180 --> 02:49:25,180
So this line right here is the grab line, grab it, okay?
3807
02:49:28,100 --> 02:49:31,540
So then the question is how would you modify this code
3808
02:49:31,540 --> 02:49:33,820
to teach it to find the smallest value
3809
02:49:33,820 --> 02:49:35,420
in this list of numbers?
3810
02:49:36,960 --> 02:49:39,260
Think of it as you have a starting number,
3811
02:49:39,260 --> 02:49:43,940
you have a sort of what's better in this grabbing notion.
3812
02:49:43,940 --> 02:49:45,420
How could you do that?
3813
02:49:45,420 --> 02:49:46,260
Take a look.
3814
02:49:51,940 --> 02:49:53,940
Okay, so let's take a look.
3815
02:49:53,940 --> 02:49:55,700
So let's do a couple things.
3816
02:49:55,700 --> 02:49:59,700
Like if you look at this if statement that's better,
3817
02:49:59,700 --> 02:50:03,020
well, it's better now if the number is less than.
3818
02:50:03,020 --> 02:50:05,700
So if the, but then we should probably change this
3819
02:50:05,700 --> 02:50:08,660
to be smallest so far, smallest so far,
3820
02:50:08,660 --> 02:50:11,080
smallest so far, smallest so far,
3821
02:50:11,080 --> 02:50:13,900
smallest so far, smallest so far, right?
3822
02:50:14,820 --> 02:50:16,520
Matter of fact, that's what this is.
3823
02:50:16,520 --> 02:50:20,060
We've changed the word largest so far to smallest so far,
3824
02:50:20,060 --> 02:50:24,940
and we've changed the greater than to a less than.
3825
02:50:24,940 --> 02:50:26,140
Is that gonna fix it?
3826
02:50:27,780 --> 02:50:30,820
Give you a second to look at it, pause if you need.
3827
02:50:30,820 --> 02:50:31,740
It's not gonna fix it.
3828
02:50:31,740 --> 02:50:33,700
It's not gonna find our smallest number.
3829
02:50:33,700 --> 02:50:38,700
The answer is, of course, no, it's not.
3830
02:50:41,040 --> 02:50:43,040
So if we run this code,
3831
02:50:43,040 --> 02:50:45,000
so we set the smallest so far to negative one
3832
02:50:45,000 --> 02:50:46,400
and it starts out negative one.
3833
02:50:46,400 --> 02:50:49,160
We run it, and it's nine.
3834
02:50:49,160 --> 02:50:51,680
Is nine less than negative one?
3835
02:50:51,680 --> 02:50:53,100
No, it's not.
3836
02:50:53,100 --> 02:50:57,140
So after we see a nine, the smallest so far is negative one.
3837
02:50:57,140 --> 02:50:58,760
Now we're gonna run 41.
3838
02:50:58,760 --> 02:51:01,320
Is 41 less than negative one?
3839
02:51:01,320 --> 02:51:04,320
No, it is not.
3840
02:51:04,320 --> 02:51:06,160
So the smallest so far is still negative one.
3841
02:51:06,160 --> 02:51:08,160
As a matter of fact, it isn't the smallest so far anymore.
3842
02:51:08,160 --> 02:51:09,840
Just because we named it smallest so far
3843
02:51:09,840 --> 02:51:11,920
doesn't mean it is the smallest so far.
3844
02:51:11,920 --> 02:51:13,720
It didn't work out so well.
3845
02:51:13,720 --> 02:51:16,240
And so you see that none of these,
3846
02:51:16,240 --> 02:51:18,400
because they're never less than negative one,
3847
02:51:18,400 --> 02:51:20,400
do anything, and we claim that afterwards,
3848
02:51:20,400 --> 02:51:23,120
the smallest we've seen so far is negative one.
3849
02:51:23,120 --> 02:51:25,620
And that is because, of course,
3850
02:51:25,620 --> 02:51:28,600
negative one is smaller than any of the numbers that we saw.
3851
02:51:28,600 --> 02:51:31,620
So how could we fix this?
3852
02:51:31,620 --> 02:51:34,120
Well, if we started the smallest so far
3853
02:51:34,120 --> 02:51:37,260
with some like arbitrary big number, then it'd be better.
3854
02:51:37,260 --> 02:51:39,720
So if we made this 100, whoops, come back.
3855
02:51:41,540 --> 02:51:44,420
If we made this be like 100, that'd be good,
3856
02:51:44,420 --> 02:51:46,380
because the first time through the nine
3857
02:51:46,380 --> 02:51:48,780
would be less than 100, so we would capture the nine
3858
02:51:48,780 --> 02:51:51,500
and then the rest of the loop would work just fine.
3859
02:51:51,500 --> 02:51:53,600
But then what if we didn't know
3860
02:51:53,600 --> 02:51:54,660
how big these numbers were?
3861
02:51:54,660 --> 02:51:57,340
As a matter of fact, the largest so far wouldn't have worked
3862
02:51:57,340 --> 02:51:59,700
if all the numbers were negative.
3863
02:51:59,700 --> 02:52:01,500
Think about that.
3864
02:52:01,500 --> 02:52:02,940
We just assumed they were positive,
3865
02:52:02,940 --> 02:52:04,620
and so we kind of wrote lazy code
3866
02:52:04,620 --> 02:52:06,020
that assumed all numbers were positive.
3867
02:52:06,020 --> 02:52:07,260
That might not be a good assumption
3868
02:52:07,260 --> 02:52:10,520
depending on the numbers that you're dealing with, right?
3869
02:52:10,520 --> 02:52:12,420
So maybe 100's a good number to start with,
3870
02:52:12,420 --> 02:52:16,500
or maybe like 1,000, or 10,000,
3871
02:52:16,500 --> 02:52:19,980
or like some number with lots of zeros in it.
3872
02:52:19,980 --> 02:52:21,940
How big should we make this?
3873
02:52:21,940 --> 02:52:24,300
And the answer is we're kind of solving
3874
02:52:24,300 --> 02:52:26,500
this problem the wrong way.
3875
02:52:26,500 --> 02:52:29,940
And the thing we really want to do to solve the problem
3876
02:52:29,940 --> 02:52:34,940
is to just accept the fact that if we're looking
3877
02:52:35,100 --> 02:52:38,380
for the smallest number so far,
3878
02:52:38,380 --> 02:52:42,740
that the right hypothesis is the first number.
3879
02:52:42,740 --> 02:52:46,080
And if we just knew what that first number was, the nine,
3880
02:52:46,080 --> 02:52:49,500
that would either, because it's the first number,
3881
02:52:49,500 --> 02:52:52,380
we know that it's both the largest so far
3882
02:52:52,380 --> 02:52:54,560
and the smallest so far, as soon as you see the first number.
3883
02:52:54,560 --> 02:52:57,440
But we don't know here before the loop starts
3884
02:52:57,440 --> 02:52:58,640
what that first number is.
3885
02:52:58,640 --> 02:53:01,220
I mean, you can look at it, but assume this is just data
3886
02:53:01,220 --> 02:53:02,640
that's coming from somewhere else,
3887
02:53:02,640 --> 02:53:04,940
and we don't know it until we start reading it.
3888
02:53:04,940 --> 02:53:07,680
So we have to construct a loop that deals with the fact
3889
02:53:07,680 --> 02:53:10,400
that we want to capture the first value
3890
02:53:10,400 --> 02:53:13,400
as our hypothesis for smallest so far.
3891
02:53:14,580 --> 02:53:15,860
So how do we do that?
3892
02:53:15,860 --> 02:53:16,700
Let's take a look.
3893
02:53:17,700 --> 02:53:22,700
So what we do is we use yet another type.
3894
02:53:22,700 --> 02:53:25,900
So we have integer, floating point, string, Boolean,
3895
02:53:25,900 --> 02:53:27,940
and now we have a thing called the none type.
3896
02:53:27,940 --> 02:53:32,380
None type is a special marker in that it only has one value.
3897
02:53:32,380 --> 02:53:34,260
Boolean has true and false.
3898
02:53:34,260 --> 02:53:37,100
You know, floating point has an infinite number of values
3899
02:53:37,100 --> 02:53:38,700
and integer has an infinite number of values,
3900
02:53:38,700 --> 02:53:41,820
but none type has one value, none.
3901
02:53:41,820 --> 02:53:42,860
None is a constant.
3902
02:53:42,860 --> 02:53:45,540
Capital none is a constant.
3903
02:53:47,260 --> 02:53:49,340
The difference is, is we can check to see
3904
02:53:49,340 --> 02:53:51,060
if we have stored none.
3905
02:53:51,060 --> 02:53:54,460
None is often used to indicate emptiness.
3906
02:53:54,460 --> 02:53:58,380
Not non-existence, because smallest doesn't exist
3907
02:53:58,380 --> 02:54:00,060
until we assign it, but we're gonna assign it
3908
02:54:00,060 --> 02:54:03,500
to like a mark, a flag, a marker.
3909
02:54:03,500 --> 02:54:06,180
Some way to say, oh, this is not even a number.
3910
02:54:06,180 --> 02:54:08,060
It's nothing.
3911
02:54:08,060 --> 02:54:09,580
And so we're gonna, and you can do this.
3912
02:54:09,580 --> 02:54:12,340
So that's like, makes a variable called smallest,
3913
02:54:12,340 --> 02:54:14,460
and then it puts none.
3914
02:54:14,460 --> 02:54:15,300
It sticks it right in.
3915
02:54:15,300 --> 02:54:16,140
It's not a string none.
3916
02:54:16,140 --> 02:54:19,740
It's like a special type, okay?
3917
02:54:19,740 --> 02:54:22,540
So that actually captures the notion
3918
02:54:22,540 --> 02:54:24,340
that before the loop starts,
3919
02:54:24,340 --> 02:54:27,860
the smallest number that we've seen so far is none.
3920
02:54:27,860 --> 02:54:29,980
We haven't seen any numbers, okay?
3921
02:54:31,980 --> 02:54:35,780
So, then we come in and we have an if statement.
3922
02:54:35,780 --> 02:54:38,340
And we have a new operator called is.
3923
02:54:38,340 --> 02:54:40,940
Is is stronger than equal sign.
3924
02:54:40,940 --> 02:54:44,660
And so if smallest is none, that becomes true.
3925
02:54:44,660 --> 02:54:46,300
It runs this case.
3926
02:54:46,300 --> 02:54:48,820
And so then what it does is it copies this first value,
3927
02:54:48,820 --> 02:54:50,900
which is nine, into smallest.
3928
02:54:50,900 --> 02:54:53,780
And so we see a nine and a smallest so far is nine,
3929
02:54:53,780 --> 02:54:55,380
which is the first value.
3930
02:54:55,380 --> 02:54:57,420
And again, we're assuming we don't know
3931
02:54:57,420 --> 02:54:59,940
what the first value is before the loop starts.
3932
02:54:59,940 --> 02:55:02,400
So we use the first iteration through the loop
3933
02:55:02,400 --> 02:55:05,780
as the moment where we capture that, okay?
3934
02:55:05,780 --> 02:55:10,340
So smallest is the value,
3935
02:55:10,340 --> 02:55:12,340
and then we print it and we go back up.
3936
02:55:12,340 --> 02:55:15,380
And now it runs again with 41.
3937
02:55:15,380 --> 02:55:17,300
41 is not none.
3938
02:55:17,300 --> 02:55:19,300
None is, there's only one thing that's none.
3939
02:55:19,300 --> 02:55:22,020
So it is not equal to none.
3940
02:55:22,020 --> 02:55:24,700
Smallest is not equal to none or is not none.
3941
02:55:24,700 --> 02:55:27,600
So this is false, so it skips over here.
3942
02:55:27,600 --> 02:55:29,180
Then it asks the question,
3943
02:55:29,180 --> 02:55:32,140
is the value we're looking at 41 less than smallest?
3944
02:55:32,140 --> 02:55:35,860
Well, smallest is nine in this case, and this is 41.
3945
02:55:35,860 --> 02:55:38,540
So that's false, so it skips that and goes on.
3946
02:55:38,540 --> 02:55:41,500
So we see 41, we don't take it.
3947
02:55:41,500 --> 02:55:44,980
And then you can see that this will never become true again.
3948
02:55:44,980 --> 02:55:48,400
This is pretty much false for the rest of the iterations
3949
02:55:48,400 --> 02:55:49,260
of the loop.
3950
02:55:51,540 --> 02:55:53,740
It's false for the rest of the iterations for the loop.
3951
02:55:53,740 --> 02:55:56,460
So it just is gonna run down here and ask this question.
3952
02:55:56,460 --> 02:55:58,300
And at some point, we see a three,
3953
02:55:58,300 --> 02:56:00,420
and we run this code, we capture it.
3954
02:56:00,420 --> 02:56:02,120
We see 74, we don't capture it.
3955
02:56:02,120 --> 02:56:04,180
We see 15, we don't capture it.
3956
02:56:04,180 --> 02:56:07,100
So then the for loop skips out.
3957
02:56:07,100 --> 02:56:09,060
At the end, we have the smallest.
3958
02:56:09,060 --> 02:56:11,580
And actually, this would be a good technique
3959
02:56:11,580 --> 02:56:13,340
for the largest as well.
3960
02:56:13,340 --> 02:56:15,100
Because it really is just a technique
3961
02:56:15,100 --> 02:56:17,100
to put a marker in this variable
3962
02:56:17,100 --> 02:56:19,620
so that we snag that first number,
3963
02:56:19,620 --> 02:56:23,500
or first whatever as we read and parse through them.
3964
02:56:25,740 --> 02:56:30,080
So the is and is not operators are very useful in Python.
3965
02:56:30,080 --> 02:56:32,680
You can think of them as like the double equal sign,
3966
02:56:32,680 --> 02:56:34,180
they're asking a question.
3967
02:56:35,100 --> 02:56:37,700
And they're asking a question,
3968
02:56:37,700 --> 02:56:40,340
they return a true, you know, blank is blank,
3969
02:56:40,340 --> 02:56:42,000
returns a true or a false.
3970
02:56:42,000 --> 02:56:43,100
It is stronger.
3971
02:56:44,380 --> 02:56:49,380
Double equal says are these things equal in type and value?
3972
02:56:50,380 --> 02:56:52,980
So just as an example,
3973
02:56:55,060 --> 02:56:56,260
if I were to say
3974
02:56:59,660 --> 02:57:03,720
is zero equal to 0.0,
3975
02:57:03,720 --> 02:57:05,780
it would say, yeah, that's true.
3976
02:57:05,780 --> 02:57:10,780
But then if I says zero is 0.0, that would be false.
3977
02:57:10,780 --> 02:57:15,740
So that's because these two are the same value-wise,
3978
02:57:15,740 --> 02:57:17,740
and these two are not the same type-wise.
3979
02:57:17,740 --> 02:57:20,060
So is is stronger than equals,
3980
02:57:20,060 --> 02:57:23,320
meaning that it demands equality
3981
02:57:23,320 --> 02:57:25,300
in both the type of the variable
3982
02:57:25,300 --> 02:57:27,340
and the value of the variable.
3983
02:57:27,340 --> 02:57:29,020
And no conversion is done.
3984
02:57:29,020 --> 02:57:31,000
And so that's just a very strong.
3985
02:57:31,000 --> 02:57:32,900
Don't overuse is.
3986
02:57:32,900 --> 02:57:35,020
If you're dealing with numbers or even strings,
3987
02:57:35,020 --> 02:57:36,780
use double equals, don't use is,
3988
02:57:36,780 --> 02:57:40,620
because sometimes it gets a little confusing.
3989
02:57:40,620 --> 02:57:42,780
So use is sparingly.
3990
02:57:42,780 --> 02:57:47,020
I tend to only use is on booleans and on none types.
3991
02:57:47,020 --> 02:57:48,580
I don't use is on integers,
3992
02:57:48,580 --> 02:57:51,420
and I don't use is on floats,
3993
02:57:51,420 --> 02:57:53,260
and I don't use is on strings.
3994
02:57:53,260 --> 02:57:56,300
Just none or true false.
3995
02:57:58,100 --> 02:58:00,580
And also is not is also an operator.
3996
02:58:00,580 --> 02:58:02,980
So you just say blah, blah, blah, is not none,
3997
02:58:02,980 --> 02:58:04,860
or blah, blah, blah, is not false.
3998
02:58:06,580 --> 02:58:09,020
Okay, so we've been looping around
3999
02:58:09,020 --> 02:58:11,620
and doing loops and loops of loops.
4000
02:58:11,620 --> 02:58:14,540
We looked at the indefinite loops,
4001
02:58:14,540 --> 02:58:16,900
the while loops that kind of run for a while.
4002
02:58:16,900 --> 02:58:19,740
The definite loop, and we looked at break and continue
4003
02:58:19,740 --> 02:58:23,160
as a way to either escape completely from the loop
4004
02:58:23,160 --> 02:58:27,320
or go back up and discard the current iteration of the loop.
4005
02:58:28,260 --> 02:58:29,380
We looked at none.
4006
02:58:29,380 --> 02:58:32,220
We looked at Boolean variables with for loops,
4007
02:58:32,220 --> 02:58:34,580
definite loops, where you've got some kind of a set
4008
02:58:34,580 --> 02:58:36,900
or a list or some kind of sequence
4009
02:58:36,900 --> 02:58:37,980
that you're looping through.
4010
02:58:37,980 --> 02:58:39,580
And then the concept of loop idioms
4011
02:58:39,580 --> 02:58:41,100
where you do something at the top,
4012
02:58:41,100 --> 02:58:42,360
something to each item,
4013
02:58:42,360 --> 02:58:45,860
and then you sort of get a benefit at the bottom.
4014
02:58:45,860 --> 02:58:49,020
And so that gets us through iterations.
4015
02:58:53,020 --> 02:58:54,460
Hello and welcome to chapter six.
4016
02:58:54,460 --> 02:58:56,420
And this chapter we're gonna talk about strings.
4017
02:58:56,420 --> 02:58:58,780
And chapter seven is the payoff chapter.
4018
02:58:58,780 --> 02:59:01,940
So up to this point we're still learning
4019
02:59:01,940 --> 02:59:03,800
sort of basic building blocks,
4020
02:59:03,800 --> 02:59:06,900
and actually we're gonna write a real program in chapter seven.
4021
02:59:06,900 --> 02:59:11,060
So just learn this, and the payoff's in chapter seven.
4022
02:59:11,060 --> 02:59:12,700
So we've actually been using strings
4023
02:59:12,700 --> 02:59:14,020
from the very first lecture,
4024
02:59:14,020 --> 02:59:17,380
because if you print Hello World, well, that's a string.
4025
02:59:17,380 --> 02:59:19,060
And so we've been doing things.
4026
02:59:19,060 --> 02:59:22,260
This slide here is all review.
4027
02:59:22,260 --> 02:59:24,100
We use plastic and catenate strings.
4028
02:59:24,100 --> 02:59:25,820
We use print to print them out.
4029
02:59:25,820 --> 02:59:28,180
Print's just a function that takes as a parameter
4030
02:59:28,180 --> 02:59:30,700
something, strings, integers, et cetera.
4031
02:59:31,700 --> 02:59:35,860
We can put digits in strings, but we can't add to them.
4032
02:59:35,860 --> 02:59:37,460
By now you've figured this out,
4033
02:59:37,460 --> 02:59:40,660
but you can use things like ints to convert the strings
4034
02:59:40,660 --> 02:59:42,980
to integers and then print things out.
4035
02:59:42,980 --> 02:59:45,540
So we've been doing this for a while,
4036
02:59:45,540 --> 02:59:47,540
and we've been talking about strings all along.
4037
02:59:47,540 --> 02:59:51,780
Now today what we're gonna do is going to just get
4038
02:59:51,780 --> 02:59:54,860
into strings in more detail.
4039
02:59:54,860 --> 02:59:59,860
We're reading the input data with the input function.
4040
02:59:59,860 --> 03:00:01,780
Input returns us a string.
4041
03:00:03,180 --> 03:00:04,920
And if we want to input a number,
4042
03:00:04,920 --> 03:00:06,740
we have to run some kind of conversion,
4043
03:00:06,740 --> 03:00:10,580
like we have to do on int before we take this data
4044
03:00:10,580 --> 03:00:13,140
that we read from input, you know?
4045
03:00:13,140 --> 03:00:15,540
And so there's things that we've gotta do,
4046
03:00:15,540 --> 03:00:19,060
and we've been doing all these things in programs so far.
4047
03:00:19,060 --> 03:00:21,140
But if we look a little in a little more detail
4048
03:00:21,140 --> 03:00:25,220
inside strings, we can index within strings each character.
4049
03:00:25,220 --> 03:00:27,940
So each character has a separate position
4050
03:00:27,940 --> 03:00:29,940
and a separate index.
4051
03:00:29,940 --> 03:00:34,940
And basically the letters have positions,
4052
03:00:35,300 --> 03:00:37,180
and the positions start at zero,
4053
03:00:37,180 --> 03:00:39,860
and the best way I explain this to remember this
4054
03:00:39,860 --> 03:00:42,060
is it's the elevators.
4055
03:00:42,060 --> 03:00:44,660
As we used in one of our examples a long time ago,
4056
03:00:44,660 --> 03:00:46,300
elevators in Europe start at zero,
4057
03:00:46,300 --> 03:00:48,960
and so strings start at zero as well.
4058
03:00:49,860 --> 03:00:52,180
Turns out in the old days there's some efficiency
4059
03:00:52,180 --> 03:00:55,380
with the notion of lists of things starting with zero.
4060
03:00:55,380 --> 03:00:57,620
These days the efficiency isn't the issue,
4061
03:00:57,620 --> 03:01:00,300
but there's a certain elegance starting at zero,
4062
03:01:00,300 --> 03:01:02,100
even though intellectually you might think
4063
03:01:02,100 --> 03:01:06,420
one would be the first character in the string
4064
03:01:06,420 --> 03:01:08,400
might make most sense to be sub one,
4065
03:01:08,400 --> 03:01:09,780
but it's not, it's sub zero.
4066
03:01:09,780 --> 03:01:11,580
But just remember that.
4067
03:01:12,580 --> 03:01:15,280
And so we have this operator called the index operator,
4068
03:01:15,280 --> 03:01:16,620
and it's square brackets.
4069
03:01:16,620 --> 03:01:21,620
So fruit is a variable that contains the string banana,
4070
03:01:21,620 --> 03:01:26,260
and then fruit sub one is the character
4071
03:01:26,260 --> 03:01:27,580
that's in position one.
4072
03:01:27,580 --> 03:01:29,820
Now that actually is the second character.
4073
03:01:29,820 --> 03:01:31,940
I'll keep reminding you until I get tired of reminding you.
4074
03:01:31,940 --> 03:01:35,780
So that assigns A, the letter A,
4075
03:01:35,780 --> 03:01:40,780
into, I mean A, the letter A into the variable letter.
4076
03:01:41,380 --> 03:01:42,820
Of course that's a badly chosen,
4077
03:01:42,820 --> 03:01:44,460
it's either a well chosen variable name
4078
03:01:44,460 --> 03:01:46,940
or a badly chosen variable name.
4079
03:01:46,940 --> 03:01:48,740
And the thing that goes inside this
4080
03:01:48,740 --> 03:01:50,900
can either be a constant or it can be an expression,
4081
03:01:50,900 --> 03:01:52,740
so this is x equals three,
4082
03:01:52,740 --> 03:01:54,420
and then fruit sub x minus one,
4083
03:01:54,420 --> 03:01:56,940
well that means two, which is position two,
4084
03:01:56,940 --> 03:02:00,220
which is an n, and so that gives us an n.
4085
03:02:00,220 --> 03:02:02,260
So the index is an operator,
4086
03:02:02,260 --> 03:02:04,380
and you can add this bracket syntax
4087
03:02:04,380 --> 03:02:06,500
to the end of a string variable.
4088
03:02:08,140 --> 03:02:11,020
You can't index beyond the length of the string,
4089
03:02:11,020 --> 03:02:13,140
so if I say zot sub five,
4090
03:02:13,140 --> 03:02:14,420
well there's only three characters,
4091
03:02:14,420 --> 03:02:16,420
which means zero, one, two,
4092
03:02:16,420 --> 03:02:18,500
but sub five doesn't work,
4093
03:02:18,500 --> 03:02:22,180
and of course we get a happy little trace back.
4094
03:02:23,540 --> 03:02:24,660
So you have to be careful
4095
03:02:24,660 --> 03:02:26,340
when you're starting to pull stuff out of strings,
4096
03:02:26,340 --> 03:02:27,980
although some of the things allow it,
4097
03:02:27,980 --> 03:02:31,060
some of them don't, and you'll kind of get used to that.
4098
03:02:31,060 --> 03:02:33,900
We can ask how long a string is,
4099
03:02:33,900 --> 03:02:35,780
and so we use the len function,
4100
03:02:35,780 --> 03:02:37,860
we pass the string variable,
4101
03:02:37,860 --> 03:02:40,140
and we pass it into len as parameter,
4102
03:02:40,140 --> 03:02:42,540
and len gives us back the length of the string,
4103
03:02:42,540 --> 03:02:47,540
not the position, so it's zero through len minus one.
4104
03:02:47,540 --> 03:02:51,080
So it's zero through len minus one.
4105
03:02:51,080 --> 03:02:53,520
So len is just another function
4106
03:02:53,520 --> 03:02:55,880
that we've been doing functions now for a while,
4107
03:02:55,880 --> 03:02:57,820
you pass in a parameter,
4108
03:02:57,820 --> 03:02:59,200
and then len does some work,
4109
03:02:59,200 --> 03:03:02,360
and out comes six, and that goes back into x,
4110
03:03:02,360 --> 03:03:04,360
because the function has a residual value,
4111
03:03:04,360 --> 03:03:07,460
it just happens to be a built-in function.
4112
03:03:08,360 --> 03:03:11,720
And so, you know, somewhere deep inside Python,
4113
03:03:11,720 --> 03:03:14,520
there is code that takes this,
4114
03:03:14,520 --> 03:03:16,960
and somebody wrote a loop, or looked something up,
4115
03:03:16,960 --> 03:03:18,640
and then returned a return value,
4116
03:03:18,640 --> 03:03:23,640
and sent back a six to go into our x variable.
4117
03:03:24,080 --> 03:03:25,880
And so a function is there,
4118
03:03:25,880 --> 03:03:29,040
and like I said, we've been using this for a while.
4119
03:03:29,040 --> 03:03:31,280
Another thing we tend to do is to look through strings,
4120
03:03:31,280 --> 03:03:34,480
and look at strings, and dig data out of strings.
4121
03:03:34,480 --> 03:03:36,760
Python is excellent for doing
4122
03:03:36,760 --> 03:03:39,240
sort of these kinds of lookups.
4123
03:03:39,240 --> 03:03:42,360
And so we can write a simple loop,
4124
03:03:42,360 --> 03:03:44,120
we can write a for loop that
4125
03:03:44,120 --> 03:03:49,120
creates some kind of iteration variable, like index,
4126
03:03:49,680 --> 03:03:52,120
and given that we know that these positions are zero
4127
03:03:52,120 --> 03:03:54,520
through five, we can set this to be zero,
4128
03:03:54,520 --> 03:03:56,080
and then write a while loop,
4129
03:03:56,080 --> 03:03:58,160
while the iteration variable is less than
4130
03:03:58,160 --> 03:04:00,680
the length of fruit, and remember, this is six,
4131
03:04:00,680 --> 03:04:03,380
so it's gonna be zero through five.
4132
03:04:03,380 --> 03:04:07,800
Zero through five are the values we wanna generate,
4133
03:04:07,800 --> 03:04:09,680
and then we can look up one at a time,
4134
03:04:09,680 --> 03:04:12,220
pull out fruit sub index, so fruit sub zero,
4135
03:04:12,220 --> 03:04:14,360
fruit sub one, two, three, four, five,
4136
03:04:14,360 --> 03:04:17,920
and then print out the position and the letter, index,
4137
03:04:17,920 --> 03:04:20,040
and then add one to index, and it runs,
4138
03:04:20,040 --> 03:04:22,600
this'll run six times, zero through five,
4139
03:04:22,600 --> 03:04:26,040
and out we go to produce this output right here.
4140
03:04:26,040 --> 03:04:28,660
And so that's one way of looping through strings.
4141
03:04:28,660 --> 03:04:32,980
That is a basic indeterminate loop,
4142
03:04:32,980 --> 03:04:35,760
but we construct carefully an iteration value,
4143
03:04:36,960 --> 03:04:38,520
construct an iteration value,
4144
03:04:38,520 --> 03:04:43,520
and work our way through that loop data.
4145
03:04:43,680 --> 03:04:46,760
The other way is to use a determinate loop, a for loop,
4146
03:04:46,760 --> 03:04:50,320
and generally when we are able to use a while loop
4147
03:04:50,320 --> 03:04:52,880
or a for loop, all else being equal,
4148
03:04:52,880 --> 03:04:55,320
we generally prefer a for loop.
4149
03:04:55,320 --> 03:04:58,400
And so here we have the for keyword and fruit,
4150
03:04:58,400 --> 03:05:01,920
and it's an in, and so for letter in fruit,
4151
03:05:01,920 --> 03:05:04,380
well that just says letter is our iteration variable
4152
03:05:04,380 --> 03:05:06,440
and it's gonna take on the successive values
4153
03:05:06,440 --> 03:05:08,200
of each of the characters.
4154
03:05:08,200 --> 03:05:10,320
So this loop is gonna run six times,
4155
03:05:10,320 --> 03:05:14,840
and letter's gonna be B-A-N-A-N-A, banana.
4156
03:05:14,840 --> 03:05:16,680
I'm always terrified when I make these slides
4157
03:05:16,680 --> 03:05:18,200
that I'm gonna misspell banana,
4158
03:05:18,200 --> 03:05:22,020
because somehow I always think that there are two ends,
4159
03:05:22,020 --> 03:05:24,240
somewhere, I don't know.
4160
03:05:24,240 --> 03:05:25,980
It's not one of my favorite words to spell.
4161
03:05:25,980 --> 03:05:28,960
I actually didn't choose banana as the constant.
4162
03:05:28,960 --> 03:05:32,000
The author who I borrowed the textbook from,
4163
03:05:32,000 --> 03:05:34,580
Alan Downey and Jeff Elkner, they used banana,
4164
03:05:34,580 --> 03:05:36,080
and so I'm still using banana.
4165
03:05:36,080 --> 03:05:38,560
So some of the jokes in the book aren't my book,
4166
03:05:38,560 --> 03:05:42,680
aren't my jokes, they are the jokes of Jeff and Alan.
4167
03:05:42,680 --> 03:05:45,760
So here are just two equivalent,
4168
03:05:45,760 --> 03:05:47,040
so you can have the while loop,
4169
03:05:47,040 --> 03:05:48,520
they sort of both do the same thing,
4170
03:05:48,520 --> 03:05:51,640
they both just print the letters out one time through,
4171
03:05:51,640 --> 03:05:53,580
each of these loops runs five times,
4172
03:05:53,580 --> 03:05:56,360
but you can see how the determinant loop,
4173
03:05:56,360 --> 03:05:58,640
the for loop is a prettier loop,
4174
03:05:58,640 --> 03:06:01,040
unless you truly somehow need to know this number
4175
03:06:01,040 --> 03:06:02,040
as you're going through the loop.
4176
03:06:02,040 --> 03:06:03,320
But if all you're doing is going through
4177
03:06:03,320 --> 03:06:05,880
and you wanna touch in order each of the characters
4178
03:06:05,880 --> 03:06:09,360
of the string, you then simply write a for loop
4179
03:06:09,360 --> 03:06:10,480
because it's more elegant.
4180
03:06:10,480 --> 03:06:13,240
The less code you write, the less code you write,
4181
03:06:13,240 --> 03:06:15,600
the less chance there is for you to make a mistake.
4182
03:06:15,600 --> 03:06:17,800
And so the fact that these are equivalent,
4183
03:06:17,800 --> 03:06:20,120
this is three lines, well, two lines of a loop
4184
03:06:20,120 --> 03:06:21,440
and this is four lines of a loop,
4185
03:06:21,440 --> 03:06:24,320
that's twice as many places as you could make a mistake
4186
03:06:24,320 --> 03:06:27,100
because you might misspell index or something.
4187
03:06:27,100 --> 03:06:29,680
I mean, why even make an iteration variable
4188
03:06:29,680 --> 03:06:33,120
if you don't need to make an iteration variable?
4189
03:06:33,120 --> 03:06:36,680
And so we can do things that harken back
4190
03:06:36,680 --> 03:06:39,200
to our iterations and loops chapter
4191
03:06:39,200 --> 03:06:41,520
where anything that you can do in those things
4192
03:06:41,520 --> 03:06:43,000
like look for the largest letter,
4193
03:06:43,000 --> 03:06:44,520
look for the smallest letter,
4194
03:06:44,520 --> 03:06:46,280
search to see if a letter exists
4195
03:06:46,280 --> 03:06:50,320
or say count the number of A's in the word banana.
4196
03:06:50,320 --> 03:06:52,700
And so that's what this is doing.
4197
03:06:52,700 --> 03:06:55,880
And so we have a counter.
4198
03:06:55,880 --> 03:06:58,320
So again, we do something at the top of the loop,
4199
03:06:58,320 --> 03:06:59,680
we're gonna do something in the middle loop
4200
03:06:59,680 --> 03:07:00,840
and then we're gonna print it out at the bottom.
4201
03:07:00,840 --> 03:07:02,440
So we start our counter at zero,
4202
03:07:02,440 --> 03:07:05,360
we're gonna loop through all the letters
4203
03:07:05,360 --> 03:07:06,960
and then if the letter is A,
4204
03:07:06,960 --> 03:07:08,360
then count equals count plus one.
4205
03:07:08,360 --> 03:07:10,680
This is kind of a pattern in a loop
4206
03:07:10,680 --> 03:07:12,400
where we're noticing something
4207
03:07:12,400 --> 03:07:14,080
and instead of like we did it earlier
4208
03:07:14,080 --> 03:07:15,800
where we said found equals true,
4209
03:07:15,800 --> 03:07:17,100
well, we're gonna count them this time.
4210
03:07:17,100 --> 03:07:18,680
So if we have one, we'll get one,
4211
03:07:18,680 --> 03:07:20,080
if we have zero, we get zero
4212
03:07:20,080 --> 03:07:24,040
and how many there are but there should be three
4213
03:07:24,040 --> 03:07:25,940
because it's gonna run three times
4214
03:07:25,940 --> 03:07:28,240
and there's three A's in banana.
4215
03:07:28,240 --> 03:07:32,480
And so this is a conditional within count.
4216
03:07:32,480 --> 03:07:35,600
We've seen counts, we've seen conditionals in loop
4217
03:07:35,600 --> 03:07:37,080
in prior chapters.
4218
03:07:37,080 --> 03:07:42,080
And so again, I love the in keyword in Python.
4219
03:07:42,760 --> 03:07:46,000
It again reminds me of a set notation in algebra.
4220
03:07:46,000 --> 03:07:48,960
If you're a math whiz, if you're not, don't worry about it
4221
03:07:48,960 --> 03:07:50,600
or maybe you will be a math whiz
4222
03:07:50,600 --> 03:07:52,080
and you'll say, whoa, this set notation
4223
03:07:52,080 --> 03:07:57,080
reminds me a lot of the in keyword in Python.
4224
03:07:59,240 --> 03:08:04,240
So again, it's for iteration variable letter.
4225
03:08:04,960 --> 03:08:06,800
Again, don't get stuck with letter.
4226
03:08:06,800 --> 03:08:09,520
I just happen to be using it here in banana.
4227
03:08:09,520 --> 03:08:14,520
And that is for each character in the string banana,
4228
03:08:14,760 --> 03:08:18,400
run this loop once, changing the variable letter
4229
03:08:18,400 --> 03:08:21,000
to be the particular character that we're pointing at.
4230
03:08:21,000 --> 03:08:23,520
And so, it's taking care of, four is taking care
4231
03:08:23,520 --> 03:08:25,200
of a lot for us, right?
4232
03:08:25,200 --> 03:08:27,840
And so this is sort of this really smart for loop.
4233
03:08:27,840 --> 03:08:31,160
The for loop is both deciding how many times
4234
03:08:31,160 --> 03:08:33,680
to run the loop, in this case six,
4235
03:08:33,680 --> 03:08:35,160
and it's advancing the letter.
4236
03:08:35,160 --> 03:08:38,400
So advance print.
4237
03:08:38,400 --> 03:08:40,800
decide whether you're done, advance print.
4238
03:08:40,800 --> 03:08:43,200
Decide whether you're done, advance print.
4239
03:08:43,200 --> 03:08:45,400
Decide whether you're done, advance print.
4240
03:08:45,400 --> 03:08:47,080
Decide whether you're done, advance print.
4241
03:08:47,080 --> 03:08:47,980
Decide whether you're done,
4242
03:08:47,980 --> 03:08:48,820
advance print.
4243
03:08:48,820 --> 03:08:50,720
Decide whether you're not, I am now done
4244
03:08:50,720 --> 03:08:53,520
because I, whoop, you know,
4245
03:08:53,520 --> 03:08:55,600
we're done with that particular string.
4246
03:08:55,600 --> 03:08:59,920
And so, you can think of the four as, you know,
4247
03:08:59,920 --> 03:09:02,800
magically doing all of this for you,
4248
03:09:02,800 --> 03:09:05,400
of both deciding how long to run the loop,
4249
03:09:05,400 --> 03:09:06,680
when you're done or not,
4250
03:09:06,680 --> 03:09:09,680
and moving down through all the success
4251
03:09:09,680 --> 03:09:11,800
of letters in the loop.
4252
03:09:11,800 --> 03:09:13,680
So up next, we'll talk a little bit
4253
03:09:13,680 --> 03:09:20,680
about additional things that we can do with strings.
4254
03:09:20,800 --> 03:09:22,800
So now we're gonna dig into strings a bit,
4255
03:09:22,800 --> 03:09:24,800
and we've already looked at how you can pull out
4256
03:09:24,800 --> 03:09:26,160
a single character in a string,
4257
03:09:26,160 --> 03:09:28,120
and now we're going to look at what we call slicing,
4258
03:09:28,120 --> 03:09:30,800
and that is pulling chunks of a string out.
4259
03:09:30,800 --> 03:09:33,960
And again, we're gonna use the square bracket operator,
4260
03:09:33,960 --> 03:09:38,960
and so S, and the way I say it is sub, S sub zero through
4261
03:09:39,740 --> 03:09:41,320
four, that's how I read this.
4262
03:09:41,320 --> 03:09:44,120
S sub zero through four.
4263
03:09:44,120 --> 03:09:46,800
So I look at the colon as through,
4264
03:09:46,800 --> 03:09:49,400
I look at the brackets as sub.
4265
03:09:50,600 --> 03:09:53,680
And so, S sub zero through four says,
4266
03:09:53,680 --> 03:09:55,680
start at position zero,
4267
03:09:57,360 --> 03:10:00,360
and then go up through, but not including four, right?
4268
03:10:00,360 --> 03:10:01,760
So we don't include four.
4269
03:10:01,760 --> 03:10:04,800
So that's probably the hardest part of this,
4270
03:10:04,800 --> 03:10:07,640
up to but not including, up to but not including.
4271
03:10:08,640 --> 03:10:09,840
This seems counterintuitive,
4272
03:10:09,840 --> 03:10:11,760
kind of like starting at zero seems counterintuitive,
4273
03:10:11,760 --> 03:10:15,160
but after a while, you'll kind of get used to it,
4274
03:10:15,160 --> 03:10:17,200
and there'll be situations where you're writing code like,
4275
03:10:17,200 --> 03:10:18,920
oh, that's why that works better.
4276
03:10:18,920 --> 03:10:22,100
But just for now, remember it, up to but not including.
4277
03:10:22,100 --> 03:10:23,880
It's just kind of a little thing.
4278
03:10:25,520 --> 03:10:29,880
We'll come back to when that is useful for us.
4279
03:10:30,920 --> 03:10:34,720
Six through seven, well that ends up being starting at six,
4280
03:10:34,720 --> 03:10:36,400
up to but not including seven.
4281
03:10:36,400 --> 03:10:38,220
So that's why we only get the P out.
4282
03:10:38,220 --> 03:10:41,600
Now one thing that Python is pretty nice about,
4283
03:10:41,600 --> 03:10:43,600
is it's not gonna give you a trace back.
4284
03:10:44,720 --> 03:10:47,040
We might expect that six through 20,
4285
03:10:47,040 --> 03:10:48,720
well there's no 20 characters, but it's like,
4286
03:10:48,720 --> 03:10:51,860
ah, that's okay, we'll just let you stop at the end,
4287
03:10:51,860 --> 03:10:54,080
and we'll start at six and go all the way to the end.
4288
03:10:54,080 --> 03:10:56,040
Oh, no trace back.
4289
03:10:56,040 --> 03:10:57,400
It's almost disappointing sometimes
4290
03:10:57,400 --> 03:11:00,320
when Python doesn't trace back when you think,
4291
03:11:00,320 --> 03:11:02,320
ah, you know, if you're so obsessed about everything,
4292
03:11:02,320 --> 03:11:04,080
now I would have traced back in that situation.
4293
03:11:04,080 --> 03:11:07,600
But hey, I guess if you're allowed, you're allowed.
4294
03:11:07,600 --> 03:11:08,640
And so there we go.
4295
03:11:09,800 --> 03:11:14,320
Now you can eliminate or omit the first or last.
4296
03:11:14,320 --> 03:11:15,720
If you eliminate the first,
4297
03:11:15,720 --> 03:11:17,180
it assumes the beginning of string.
4298
03:11:17,180 --> 03:11:19,640
If you eliminate the second,
4299
03:11:19,640 --> 03:11:21,200
it assumes the end of the string.
4300
03:11:21,200 --> 03:11:23,640
And why you would do this, I don't know,
4301
03:11:23,640 --> 03:11:25,680
but that's from beginning to end,
4302
03:11:25,680 --> 03:11:26,680
so it's the whole string.
4303
03:11:26,680 --> 03:11:31,360
So whole string, eight through the end is thon,
4304
03:11:31,360 --> 03:11:35,920
and up to but not including two is mo, all right?
4305
03:11:35,920 --> 03:11:37,400
So you get that.
4306
03:11:37,400 --> 03:11:40,080
So just, that's pretty simple.
4307
03:11:40,080 --> 03:11:41,720
Once you've got the rest of slicing
4308
03:11:41,720 --> 03:11:43,440
and the rest of string indexing,
4309
03:11:43,440 --> 03:11:46,080
the notion of eliminating the first or the last
4310
03:11:46,080 --> 03:11:47,880
of the colon expression,
4311
03:11:47,880 --> 03:11:50,200
the first or second of the colon expression,
4312
03:11:50,200 --> 03:11:53,080
I think is actually pretty intuitive, pretty nice.
4313
03:11:54,080 --> 03:11:56,320
We've already been concatenating strings together.
4314
03:11:56,320 --> 03:11:59,040
We overload the plus operator,
4315
03:11:59,040 --> 03:12:01,180
and there is no space added.
4316
03:12:01,180 --> 03:12:05,160
Remember when you're doing print, x comma y,
4317
03:12:05,160 --> 03:12:07,540
this comma does turn into a space,
4318
03:12:07,540 --> 03:12:09,040
but that's not what's happening here.
4319
03:12:09,040 --> 03:12:11,280
There is no automatic space being added,
4320
03:12:11,280 --> 03:12:12,920
and so we see hello in there,
4321
03:12:12,920 --> 03:12:14,800
and it's just as hello there with no space.
4322
03:12:14,800 --> 03:12:15,760
And so if we want,
4323
03:12:15,760 --> 03:12:18,660
we just have to concatenate the space explicitly
4324
03:12:18,660 --> 03:12:21,240
if we wanna put spaces into strings.
4325
03:12:21,240 --> 03:12:23,360
The problem is, is if this,
4326
03:12:23,360 --> 03:12:24,680
you might think it's more convenient
4327
03:12:24,680 --> 03:12:26,600
to add a space with a concatenation,
4328
03:12:26,600 --> 03:12:27,440
but then you have to think,
4329
03:12:27,440 --> 03:12:29,800
well, what about if I wanna concatenate things
4330
03:12:29,800 --> 03:12:31,200
and not put the space in,
4331
03:12:31,200 --> 03:12:32,840
then I'd need a different operator.
4332
03:12:32,840 --> 03:12:35,800
So that's kind of why it works that way.
4333
03:12:37,600 --> 03:12:41,080
We can use in differently as a logical operator,
4334
03:12:41,080 --> 03:12:44,540
so we're using it as an iteration structure in for loops,
4335
03:12:44,540 --> 03:12:48,600
but we can also use it as a logical operator in if statements.
4336
03:12:48,600 --> 03:12:51,680
So it's kind of like the double equals,
4337
03:12:51,680 --> 03:12:54,520
or not equals, or less than or equals,
4338
03:12:54,520 --> 03:12:55,360
or something like that.
4339
03:12:55,360 --> 03:12:57,000
It's like those guys.
4340
03:12:57,000 --> 03:13:01,080
And so, and it returns a true or a false,
4341
03:13:01,080 --> 03:13:02,440
is n in fruit.
4342
03:13:02,440 --> 03:13:03,760
So that's a question,
4343
03:13:03,760 --> 03:13:04,880
and the answer is true.
4344
03:13:04,880 --> 03:13:06,720
Is m in fruit?
4345
03:13:06,720 --> 03:13:08,760
No, that's the answer to a question.
4346
03:13:08,760 --> 03:13:09,840
Is nan in fruit?
4347
03:13:09,840 --> 03:13:11,000
Doesn't have to be single character,
4348
03:13:11,000 --> 03:13:12,320
can be more than one character,
4349
03:13:12,320 --> 03:13:13,820
and the answer is true.
4350
03:13:13,820 --> 03:13:15,520
And then you say something like,
4351
03:13:15,520 --> 03:13:16,760
if a in fruit.
4352
03:13:16,760 --> 03:13:18,440
And so this is the logical value
4353
03:13:18,440 --> 03:13:20,320
that returns a true or a false,
4354
03:13:20,320 --> 03:13:21,680
and yes, we found it.
4355
03:13:21,680 --> 03:13:24,040
So that becomes true in this particular case,
4356
03:13:24,040 --> 03:13:26,720
so it runs the little indented bit.
4357
03:13:26,720 --> 03:13:30,600
So n is an operator in this particular situation.
4358
03:13:30,600 --> 03:13:32,680
In a for loop, n means something different.
4359
03:13:32,680 --> 03:13:35,600
And we'll use n for other things as operators,
4360
03:13:35,600 --> 03:13:38,740
as logical operators, coming up in a bit.
4361
03:13:40,240 --> 03:13:41,500
You can compare strings,
4362
03:13:41,500 --> 03:13:45,200
and this has to do with the character set of your computer,
4363
03:13:45,200 --> 03:13:47,320
the character set that Python is.
4364
03:13:47,320 --> 03:13:48,780
But in general,
4365
03:13:50,160 --> 03:13:52,240
it is lexographically less than
4366
03:13:52,240 --> 03:13:54,400
and lexographically greater than.
4367
03:13:54,400 --> 03:13:57,040
Uppercase and lowercase are a little weird.
4368
03:13:57,040 --> 03:14:00,620
I think when we used the max function earlier,
4369
03:14:00,620 --> 03:14:02,400
the way my computer was set up,
4370
03:14:03,660 --> 03:14:07,580
uppercase was less than lowercase.
4371
03:14:07,580 --> 03:14:12,580
But in general, uppercase is less than lowercase.
4372
03:14:12,900 --> 03:14:16,140
But in general, it's bad to assume case,
4373
03:14:16,140 --> 03:14:19,100
but there is a deterministic way to sort strings.
4374
03:14:20,060 --> 03:14:23,260
You can have something equal to or less than
4375
03:14:23,260 --> 03:14:24,900
or greater than,
4376
03:14:24,900 --> 03:14:28,340
and all those operations work naturally,
4377
03:14:28,340 --> 03:14:29,420
the less than and greater than.
4378
03:14:29,420 --> 03:14:32,140
You have to kind of be aware of uppercase, lowercase,
4379
03:14:32,140 --> 03:14:36,380
things like where punctuation
4380
03:14:36,380 --> 03:14:39,300
sorts less than or greater than letters.
4381
03:14:39,300 --> 03:14:40,980
That's kind of unpredictable
4382
03:14:40,980 --> 03:14:44,100
and depends on the character set of your computer
4383
03:14:44,100 --> 03:14:45,340
and something you just play with
4384
03:14:45,340 --> 03:14:47,800
and figure out if you're doing sorting stuff
4385
03:14:47,800 --> 03:14:49,100
by first name and last name,
4386
03:14:49,100 --> 03:14:54,100
as long as the case is kind of the same, you know, if,
4387
03:14:55,980 --> 03:14:59,980
if you were sorting chuck with uppercase and glen,
4388
03:15:01,300 --> 03:15:02,700
the fact that these uppercases,
4389
03:15:02,700 --> 03:15:05,240
they'd sort right and these lowercases would sort right,
4390
03:15:05,240 --> 03:15:07,260
but if you were to subdue instead,
4391
03:15:08,320 --> 03:15:12,000
lowercase chuck and uppercase glen,
4392
03:15:12,940 --> 03:15:15,180
then that would sort weird as a matter of fact,
4393
03:15:15,180 --> 03:15:16,860
the G would come before that.
4394
03:15:16,860 --> 03:15:18,880
And so case can mess this up,
4395
03:15:18,880 --> 03:15:21,620
but in general, other than case
4396
03:15:21,620 --> 03:15:24,260
and special characters and other things,
4397
03:15:24,260 --> 03:15:26,220
it technically works.
4398
03:15:26,220 --> 03:15:28,060
It's just hard to kind of predict it.
4399
03:15:28,940 --> 03:15:32,080
A lot of what we do is use the string library.
4400
03:15:32,080 --> 03:15:35,120
And so the strings are objects,
4401
03:15:35,120 --> 03:15:37,740
and we'll talk later about what that really means.
4402
03:15:37,740 --> 03:15:40,940
And objects have these things we call methods.
4403
03:15:43,580 --> 03:15:47,100
So a string object has some built-in capabilities.
4404
03:15:47,100 --> 03:15:49,660
And one of the built-in capabilities
4405
03:15:49,660 --> 03:15:53,380
that the string object has is here is a string object.
4406
03:15:53,380 --> 03:15:55,340
And because greet is a string object,
4407
03:15:55,340 --> 03:15:58,080
if we said type, we'd see that it was an str.
4408
03:15:58,080 --> 03:16:01,260
Dot lower says, hey, dear string,
4409
03:16:01,260 --> 03:16:03,680
make a lowercase version of yourself.
4410
03:16:03,680 --> 03:16:05,040
It's like calling this function lower
4411
03:16:05,040 --> 03:16:07,380
and passing greet into it.
4412
03:16:07,380 --> 03:16:08,860
And then give that back to me.
4413
03:16:08,860 --> 03:16:10,620
Now it doesn't actually change greet.
4414
03:16:10,620 --> 03:16:12,420
It gives me a lowercase copy.
4415
03:16:12,420 --> 03:16:15,220
So here I have hello Bob with an H and a B uppercase.
4416
03:16:15,220 --> 03:16:18,140
And when I get back in zap is hello Bob all lowercase.
4417
03:16:18,140 --> 03:16:21,100
And note that greet is unchanged.
4418
03:16:21,100 --> 03:16:23,020
So hello Bob is still there.
4419
03:16:23,020 --> 03:16:25,580
And you can even call these methods on constants.
4420
03:16:25,580 --> 03:16:28,620
So this is a string object, quote, hi there, quote.
4421
03:16:28,620 --> 03:16:33,620
Dot lower, that says call lower on this bit of string
4422
03:16:33,840 --> 03:16:35,580
and give me back a lowercase version of it.
4423
03:16:35,580 --> 03:16:39,260
And so it prints out as the residual return value.
4424
03:16:39,260 --> 03:16:40,460
This is like a function call.
4425
03:16:40,460 --> 03:16:44,060
A method call is a kind of special form of a function call.
4426
03:16:44,060 --> 03:16:46,700
It's a function call where you say the thing dot
4427
03:16:46,700 --> 03:16:49,220
the function name rather than function name
4428
03:16:49,220 --> 03:16:50,920
pressed in as a parameter.
4429
03:16:50,920 --> 03:16:54,380
Like len, for example, is non-object oriented.
4430
03:16:54,380 --> 03:16:57,260
You know, len of x, that's non-object oriented.
4431
03:16:57,260 --> 03:17:00,680
Object oriented would be x dot something, parenthesis.
4432
03:17:03,420 --> 03:17:06,400
But, so constants are objects as well.
4433
03:17:06,400 --> 03:17:09,420
And taking the lower gives us back lowercase, hi there.
4434
03:17:09,420 --> 03:17:11,520
And so that's just one of the things
4435
03:17:11,520 --> 03:17:13,180
that you can do in the string library.
4436
03:17:13,180 --> 03:17:17,220
These are built into string variables and constants.
4437
03:17:17,220 --> 03:17:18,420
They're just always there.
4438
03:17:18,420 --> 03:17:21,540
As soon as you make a string, they're part of it.
4439
03:17:21,540 --> 03:17:24,540
And when you do type and it says it's class STR,
4440
03:17:26,180 --> 03:17:27,500
we'll get to object oriented, don't worry.
4441
03:17:27,500 --> 03:17:29,420
We'll get to object oriented.
4442
03:17:29,420 --> 03:17:32,240
Okay, and so you can do things like use the type.
4443
03:17:33,380 --> 03:17:36,580
If you're just, this used to say type str
4444
03:17:36,580 --> 03:17:39,340
but it's class str, kind of this is more of an oh oh.
4445
03:17:39,340 --> 03:17:41,740
The word class is an object oriented concept.
4446
03:17:41,740 --> 03:17:42,900
But it is a string.
4447
03:17:42,900 --> 03:17:44,260
And you can use the dir and of course
4448
03:17:44,260 --> 03:17:45,540
there's extra stuff up here.
4449
03:17:45,540 --> 03:17:48,540
And this is showing all the different methods
4450
03:17:50,940 --> 03:17:53,700
or capabilities, things we can do to strings.
4451
03:17:53,700 --> 03:17:58,060
So, you know, x dot something, parenthesis.
4452
03:17:58,060 --> 03:17:59,500
Well, what can we do there?
4453
03:17:59,500 --> 03:18:02,780
This is all of those things that we can do to x's
4454
03:18:02,780 --> 03:18:05,660
that are built in and come with x's,
4455
03:18:05,660 --> 03:18:09,100
I mean come with strings when we build them.
4456
03:18:09,100 --> 03:18:12,580
And Python of course has great documentation online
4457
03:18:12,580 --> 03:18:14,780
for all of these string methods and what they do
4458
03:18:14,780 --> 03:18:17,620
and how they work and why they work the way they do.
4459
03:18:17,620 --> 03:18:20,020
And so here's some of that Python documentation.
4460
03:18:20,020 --> 03:18:22,540
We'll look at a few of these.
4461
03:18:22,540 --> 03:18:25,860
But, you know, don't hesitate to say Python string upper case
4462
03:18:25,860 --> 03:18:29,460
and then we're like oh yeah, yeah, that is upper, right?
4463
03:18:29,460 --> 03:18:31,820
And so here's a few things that we can
4464
03:18:34,540 --> 03:18:37,060
do and use, some of the ones I use a lot.
4465
03:18:37,060 --> 03:18:39,060
And we'll look at each one of these things.
4466
03:18:39,060 --> 03:18:44,060
So, the find operation says find me a substring
4467
03:18:45,820 --> 03:18:48,060
within a string, right?
4468
03:18:48,060 --> 03:18:49,580
Find me a substring within a string,
4469
03:18:49,580 --> 03:18:53,340
so find me the first na and give me back the position.
4470
03:18:53,340 --> 03:18:55,220
So that gives me back two.
4471
03:18:56,100 --> 03:18:59,140
And then I can say go find a z in there.
4472
03:18:59,140 --> 03:19:02,060
Well, there's no z and so it returns me negative one.
4473
03:19:02,060 --> 03:19:03,220
So that's what the find does.
4474
03:19:03,220 --> 03:19:07,180
So we're gonna use this kind of stuff a lot
4475
03:19:07,180 --> 03:19:09,580
and we do a lot of looking in strings.
4476
03:19:09,580 --> 03:19:11,340
Converting things to upper or lower case,
4477
03:19:11,340 --> 03:19:14,380
there is an upper method and a lower method.
4478
03:19:14,380 --> 03:19:17,100
So greet, greet dot upper and that means
4479
03:19:17,100 --> 03:19:21,580
the upper case nnn is hello bob, greet dot lower,
4480
03:19:21,580 --> 03:19:24,420
that means that dub dub dub is the lower case hello world
4481
03:19:24,420 --> 03:19:26,260
and greet is unchanged.
4482
03:19:26,260 --> 03:19:29,020
Greet is still hello bob with upper and lower
4483
03:19:29,020 --> 03:19:31,660
because each of these methods basically say
4484
03:19:31,660 --> 03:19:34,940
I'm going to give you back a upper case copy
4485
03:19:34,940 --> 03:19:37,220
or a lower case copy of the original thing
4486
03:19:37,220 --> 03:19:39,380
without changing the original thing.
4487
03:19:42,740 --> 03:19:46,860
Search and replace is super useful, super duper useful.
4488
03:19:46,860 --> 03:19:48,860
And it's pretty clean.
4489
03:19:48,860 --> 03:19:51,420
Here we have a string and we use the replace method.
4490
03:19:51,420 --> 03:19:55,340
In this case, we're passing in the old and the new bob,
4491
03:19:55,340 --> 03:19:57,060
replace all bobs with janes.
4492
03:19:57,060 --> 03:19:59,300
And so that takes this hello bob
4493
03:19:59,300 --> 03:20:01,380
and turns it to hello jane.
4494
03:20:01,380 --> 03:20:06,380
Again, greet is unchanged, greet is unchanged
4495
03:20:07,500 --> 03:20:09,220
and it does more than one thing.
4496
03:20:09,220 --> 03:20:12,740
So this says go find, well, let's clear that.
4497
03:20:12,740 --> 03:20:14,860
This says go find all the o's
4498
03:20:14,860 --> 03:20:16,620
and replace all the o's with x's.
4499
03:20:16,620 --> 03:20:18,620
And so it goes and finds two of them
4500
03:20:18,620 --> 03:20:21,100
and then out come two x's.
4501
03:20:21,100 --> 03:20:23,300
And so that really is a replace,
4502
03:20:23,300 --> 03:20:25,220
it's not just replace the first one
4503
03:20:25,220 --> 03:20:26,880
but replace all of them.
4504
03:20:26,880 --> 03:20:30,680
White space, as we'll see, is a big deal.
4505
03:20:32,920 --> 03:20:34,420
And white space is not just blanks
4506
03:20:34,420 --> 03:20:35,660
although the most common thing
4507
03:20:35,660 --> 03:20:37,560
but it's also sort of non-printing characters
4508
03:20:37,560 --> 03:20:40,240
like tabs and new lines and other kinds of things.
4509
03:20:40,240 --> 03:20:42,520
And so we have a number of different ways
4510
03:20:42,520 --> 03:20:43,960
to strip white space.
4511
03:20:45,080 --> 03:20:46,800
So here we've got some spaces at the beginning
4512
03:20:46,800 --> 03:20:48,280
and spaces at the end.
4513
03:20:48,280 --> 03:20:50,400
And we print out, we do an L strip
4514
03:20:50,400 --> 03:20:52,200
and that throws away the spaces at the beginning.
4515
03:20:52,200 --> 03:20:54,280
That's the left, so that's the left strip.
4516
03:20:54,280 --> 03:20:56,560
It all takes any, if there's nothing there
4517
03:20:56,560 --> 03:20:58,000
it doesn't harm it.
4518
03:20:58,000 --> 03:21:01,080
R strip means throw away all the blanks on the far end.
4519
03:21:01,080 --> 03:21:04,940
And then strip says go take both sides,
4520
03:21:04,940 --> 03:21:07,260
both sides for strip and so that pulls out
4521
03:21:07,260 --> 03:21:09,220
all the spaces on both sides.
4522
03:21:09,220 --> 03:21:10,240
This will be useful
4523
03:21:10,240 --> 03:21:12,080
because sometimes when you're tearing stuff apart
4524
03:21:12,080 --> 03:21:15,120
you'll find yourself getting extra spaces.
4525
03:21:15,120 --> 03:21:17,220
Sometimes at the beginning, sometimes at the end.
4526
03:21:17,220 --> 03:21:21,500
And it can be tab or new line.
4527
03:21:23,400 --> 03:21:26,000
It's sort of white space.
4528
03:21:26,000 --> 03:21:30,400
Space that is kind of not visible, clear.
4529
03:21:30,400 --> 03:21:31,720
That's what white space is.
4530
03:21:31,720 --> 03:21:33,720
It's like if you were on a piece of paper
4531
03:21:33,720 --> 03:21:35,200
it's the white space.
4532
03:21:35,200 --> 03:21:37,000
It's like X, well that's not white space
4533
03:21:37,000 --> 03:21:39,040
but right here, oh that's white space.
4534
03:21:39,040 --> 03:21:43,920
It's any character that doesn't cause printing to happen.
4535
03:21:43,920 --> 03:21:46,000
If that makes any sense.
4536
03:21:46,000 --> 03:21:48,140
It's any character where nothing would be printed.
4537
03:21:48,140 --> 03:21:49,600
And there are characters like that.
4538
03:21:49,600 --> 03:21:51,640
There's like even bell characters
4539
03:21:51,640 --> 03:21:53,240
but we don't use them very much.
4540
03:21:53,240 --> 03:21:56,040
We can ask very conveniently we can say
4541
03:21:56,040 --> 03:21:59,760
hey, does this line start with a particular string?
4542
03:21:59,760 --> 03:22:03,600
And so line, this is a question,
4543
03:22:03,600 --> 03:22:05,240
gonna return a true or false.
4544
03:22:06,160 --> 03:22:07,760
Does this line start with please?
4545
03:22:07,760 --> 03:22:10,100
And the answer is true, it does start with please.
4546
03:22:10,100 --> 03:22:12,280
Does this line start with a lowercase p?
4547
03:22:12,280 --> 03:22:14,400
No, it does not.
4548
03:22:14,400 --> 03:22:16,240
And so again you'll use this in the context
4549
03:22:16,240 --> 03:22:19,360
of if something colon some block of text.
4550
03:22:19,360 --> 03:22:20,560
It's a block of code.
4551
03:22:20,560 --> 03:22:25,560
So we can combine these things to tear stuff out.
4552
03:22:26,280 --> 03:22:29,740
And so let's assume that what we wanna do in this case
4553
03:22:29,740 --> 03:22:32,240
is we wanna take a from line.
4554
03:22:32,240 --> 03:22:36,060
This is from an email format from a mailbox.
4555
03:22:37,480 --> 03:22:40,200
And this has got the from with a space
4556
03:22:40,200 --> 03:22:42,680
and the person's email and then at sign
4557
03:22:42,680 --> 03:22:45,240
in the school they're from and a space
4558
03:22:45,240 --> 03:22:47,720
and then the rest of the stuff like when this mail was sent.
4559
03:22:47,720 --> 03:22:50,600
And this is a real mail message from this guy Steven
4560
03:22:50,600 --> 03:22:52,880
from the University of Cape Town in South Africa.
4561
03:22:52,880 --> 03:22:55,600
It's really Steven and this really is the first line
4562
03:22:55,600 --> 03:22:57,380
of a file that you'll get to know pretty well
4563
03:22:57,380 --> 03:22:58,560
by the rest of this course.
4564
03:22:58,560 --> 03:23:01,480
Hi Steven, you, we like you.
4565
03:23:01,480 --> 03:23:03,040
You are the example in my class
4566
03:23:03,040 --> 03:23:05,360
and have been for a long time.
4567
03:23:05,360 --> 03:23:07,440
People actually who know Steven have taken this class
4568
03:23:07,440 --> 03:23:09,780
and they're like Steven, I saw your picture in the class.
4569
03:23:09,780 --> 03:23:12,200
So if you're ever in Cape Town at the University of Cape Town
4570
03:23:12,200 --> 03:23:14,340
say hi to Steven and tell him that you saw him in the class.
4571
03:23:14,340 --> 03:23:16,640
But okay, that's neither here nor there.
4572
03:23:16,640 --> 03:23:20,480
What I really want to do is I want to extract his school
4573
03:23:20,480 --> 03:23:23,320
from this email line.
4574
03:23:23,320 --> 03:23:26,720
Okay, so now eventually we will do things
4575
03:23:26,720 --> 03:23:28,440
like the data will come from files
4576
03:23:28,440 --> 03:23:29,760
but this is still chapter six.
4577
03:23:29,760 --> 03:23:31,720
So this is the data we're going to search through.
4578
03:23:31,720 --> 03:23:36,440
And so we can say, hey, let's go find the at sign.
4579
03:23:36,440 --> 03:23:38,680
Search up to this position and find the at sign.
4580
03:23:38,680 --> 03:23:42,280
So data.find at sign and give me back where that's at.
4581
03:23:42,280 --> 03:23:46,120
That's in position 21, it's position zero.
4582
03:23:46,120 --> 03:23:48,840
Then what we're going to do is we're going to look
4583
03:23:48,840 --> 03:23:51,440
for the next space after the at sign.
4584
03:23:51,440 --> 03:23:53,560
So we're going to start at the at sign
4585
03:23:53,560 --> 03:23:56,060
until find to start here and look forward
4586
03:23:56,060 --> 03:23:57,920
until it finds a space.
4587
03:23:57,920 --> 03:24:00,640
So data.find, look for a space starting
4588
03:24:00,640 --> 03:24:02,640
at the position of the at sign
4589
03:24:02,640 --> 03:24:05,440
and then that'll be in position 31.
4590
03:24:05,440 --> 03:24:07,920
So 31 is what we get in the space position.
4591
03:24:07,920 --> 03:24:11,480
So now what we have is we have in two variables,
4592
03:24:11,480 --> 03:24:15,000
we have the position of the at sign
4593
03:24:15,000 --> 03:24:16,840
and the position of the space after the at sign.
4594
03:24:16,840 --> 03:24:19,600
Now what we really want is this bit right here.
4595
03:24:19,600 --> 03:24:22,360
So we have to go one beyond the at sign
4596
03:24:22,360 --> 03:24:24,660
and we don't want the space.
4597
03:24:24,660 --> 03:24:27,000
So we say we're going to use slicing here,
4598
03:24:27,000 --> 03:24:30,680
data sub at position plus one up to
4599
03:24:30,680 --> 03:24:32,400
but not including the space.
4600
03:24:32,400 --> 03:24:36,080
Oh, smiley face, because we didn't have to say space minus one
4601
03:24:36,080 --> 03:24:41,080
because that is up to but not including.
4602
03:24:41,080 --> 03:24:45,000
And so we get that little bit right there.
4603
03:24:45,000 --> 03:24:47,920
So we don't have to say minus one there
4604
03:24:47,920 --> 03:24:49,840
because this is not actually included.
4605
03:24:49,840 --> 03:24:51,400
The thing that's at the position of the space
4606
03:24:51,400 --> 03:24:52,320
is not included.
4607
03:24:52,320 --> 03:24:54,040
So that's already a little benefit
4608
03:24:54,040 --> 03:24:56,040
for the up to but not including.
4609
03:24:56,040 --> 03:24:58,300
And so when we print this variable out host,
4610
03:24:58,300 --> 03:25:02,640
we get exactly just the school that Steven works at
4611
03:25:02,640 --> 03:25:04,900
and probably went to as a matter of fact.
4612
03:25:06,080 --> 03:25:07,760
I don't know if you went there or not.
4613
03:25:07,760 --> 03:25:12,760
So this is just kind of a note for non-Latin character sets.
4614
03:25:15,360 --> 03:25:17,600
All programming languages from the 60s on
4615
03:25:17,600 --> 03:25:21,960
tended to work in what we call the Latin character set
4616
03:25:21,960 --> 03:25:25,520
which is United States and England and Europe
4617
03:25:25,520 --> 03:25:28,800
and lots of places use this ABC character set
4618
03:25:28,800 --> 03:25:30,680
and the special characters.
4619
03:25:30,680 --> 03:25:35,320
But it's really common to want to use different characters.
4620
03:25:35,320 --> 03:25:38,720
And so if you're going from Python two to Python three
4621
03:25:38,720 --> 03:25:40,240
and we'll talk about this a little later
4622
03:25:40,240 --> 03:25:44,160
when it matters more, luckily we're in Python three
4623
03:25:44,160 --> 03:25:48,060
and so one of the big things about Python three
4624
03:25:48,060 --> 03:25:51,120
is that all the internal strings are Unicode.
4625
03:25:51,120 --> 03:25:55,440
In Python two, there was sort of some confusion
4626
03:25:55,440 --> 03:25:56,960
as you went between strings
4627
03:25:56,960 --> 03:25:58,560
and this is just a little bit of code
4628
03:25:58,560 --> 03:26:01,340
and so I'm putting a in here,
4629
03:26:01,340 --> 03:26:06,100
some Asian characters, this is Korean actually,
4630
03:26:06,100 --> 03:26:08,860
Asian characters into X and I say
4631
03:26:08,860 --> 03:26:13,020
what kind of a thing this is and that is a string
4632
03:26:13,020 --> 03:26:15,020
and then there's this Unicode
4633
03:26:15,020 --> 03:26:17,300
and this comes from Python two.
4634
03:26:17,300 --> 03:26:20,940
If it's a Unicode operation, it's still a string
4635
03:26:20,940 --> 03:26:23,180
whereas in Python two, if you put
4636
03:26:23,180 --> 03:26:26,580
a international characters into X, then it was a string
4637
03:26:26,580 --> 03:26:29,000
and then there was a separate kind of a constant
4638
03:26:29,000 --> 03:26:30,620
called a Unicode constant
4639
03:26:30,620 --> 03:26:33,280
and it was a different type and there was ways
4640
03:26:33,280 --> 03:26:36,900
that you had to mess with these Unicode variables
4641
03:26:36,900 --> 03:26:39,340
as you did things like read them from files
4642
03:26:39,340 --> 03:26:41,860
and put them back into files and did other things.
4643
03:26:41,860 --> 03:26:43,740
So it was much more difficult
4644
03:26:46,740 --> 03:26:49,620
in Python two but we're doing in Python three
4645
03:26:49,620 --> 03:26:53,700
and in Python three, it natively understands
4646
03:26:53,700 --> 03:26:57,460
non-Latin character sets, international Asian character sets,
4647
03:26:57,460 --> 03:26:59,060
Spanish, French character sets
4648
03:26:59,060 --> 03:27:01,540
and so this is a good thing for Python three
4649
03:27:01,540 --> 03:27:04,620
and this is one of the real benefits of using Python three
4650
03:27:04,620 --> 03:27:07,220
and as we start doing stuff where we're exchanging data
4651
03:27:07,220 --> 03:27:10,460
with the outside world, this will come into play
4652
03:27:10,460 --> 03:27:13,180
and I'll have to show you how to use it.
4653
03:27:13,180 --> 03:27:14,580
There was weird things that you had to do,
4654
03:27:14,580 --> 03:27:18,180
it just makes a lot more sense in Python three, okay?
4655
03:27:18,180 --> 03:27:20,540
So we've talked about strings,
4656
03:27:20,540 --> 03:27:23,500
we learned about the string, we're converting it,
4657
03:27:23,500 --> 03:27:24,620
we've done a whole bunch of stuff
4658
03:27:24,620 --> 03:27:28,740
and this is again, we're not yet doing anything
4659
03:27:28,740 --> 03:27:31,180
super useful, we're learning sort of how to like slice
4660
03:27:31,180 --> 03:27:33,980
and dice even though we're sort of not making the meal yet.
4661
03:27:33,980 --> 03:27:36,500
Up next, we're gonna talk about files,
4662
03:27:36,500 --> 03:27:38,180
we're gonna read some data and we're gonna slice
4663
03:27:38,180 --> 03:27:41,740
and dice and use all the things in the next chapter
4664
03:27:41,740 --> 03:27:43,420
that we've learned up to this point.
4665
03:27:43,420 --> 03:27:45,020
So see you in a bit.
4666
03:27:49,100 --> 03:27:51,260
Hello and welcome to chapter seven.
4667
03:27:51,260 --> 03:27:53,940
This is the chapter where it all really starts to pay off.
4668
03:27:53,940 --> 03:27:56,980
We have been learning bits and pieces
4669
03:27:56,980 --> 03:28:01,820
and doing little two lines, three lines, four lines of code
4670
03:28:01,820 --> 03:28:04,420
to learn the basic building blocks of Python
4671
03:28:04,420 --> 03:28:07,660
and learn some of the syntax and find lots of terms
4672
03:28:07,660 --> 03:28:11,140
but now we're actually going to start doing something.
4673
03:28:11,140 --> 03:28:14,060
So if you look at what we've been doing so far,
4674
03:28:14,060 --> 03:28:17,580
you know, we have been, we're inside this little computer
4675
03:28:17,580 --> 03:28:20,660
and you type up, you know, the Python says what next
4676
03:28:20,660 --> 03:28:22,620
and you give it its command and it does something
4677
03:28:22,620 --> 03:28:24,900
and you do something else and does something
4678
03:28:24,900 --> 03:28:26,500
and you do this three or four times
4679
03:28:26,500 --> 03:28:28,260
unless you write a loop and then it goes like,
4680
03:28:28,260 --> 03:28:30,500
you know, 10, 20 times and that's it.
4681
03:28:30,500 --> 03:28:33,380
And then maybe we write a thing that reads something
4682
03:28:33,380 --> 03:28:35,580
from our keyboard, gives us something back
4683
03:28:35,580 --> 03:28:37,740
and then we write something and print something out,
4684
03:28:37,740 --> 03:28:39,300
print a few foot things out
4685
03:28:39,300 --> 03:28:42,260
and so we've been pretty much using the keyboard,
4686
03:28:42,260 --> 03:28:45,460
the screen, the CPU and the memory.
4687
03:28:45,460 --> 03:28:47,740
That's kind of where we've been living.
4688
03:28:47,740 --> 03:28:49,620
And while it's important to talk to the keyboard
4689
03:28:49,620 --> 03:28:53,580
and the screen, the real world is things like databases
4690
03:28:53,580 --> 03:28:56,380
that live out here, files live on our systems
4691
03:28:56,380 --> 03:28:58,580
and, you know, connecting to the network
4692
03:28:58,580 --> 03:29:01,380
and reading data from the network.
4693
03:29:01,380 --> 03:29:03,420
And so that's what we're starting to do right now
4694
03:29:03,420 --> 03:29:07,160
is we're starting to be able to work outside
4695
03:29:07,160 --> 03:29:09,860
kind of our code and create things that are permanent.
4696
03:29:10,860 --> 03:29:12,220
And so we're gonna be talking,
4697
03:29:12,220 --> 03:29:14,300
initially we're gonna work on files.
4698
03:29:14,300 --> 03:29:16,500
We'll later talk to databases and the network
4699
03:29:16,500 --> 03:29:20,020
and other stuff, but for now we are talking about files.
4700
03:29:20,020 --> 03:29:22,820
And so really kind of, we're stepping out a little bit
4701
03:29:22,820 --> 03:29:25,340
and creating, reading things that are prominent
4702
03:29:25,340 --> 03:29:28,020
and creating things that are permanent.
4703
03:29:28,020 --> 03:29:30,620
The kinds of files that we're going to talk about mostly
4704
03:29:30,620 --> 03:29:33,360
are text files and you can think of these
4705
03:29:33,360 --> 03:29:36,300
as a sequence of lines in a file
4706
03:29:36,300 --> 03:29:38,060
that are easily read by Python.
4707
03:29:39,060 --> 03:29:40,920
You've been making text files all along.
4708
03:29:40,920 --> 03:29:42,900
You're, you know, hello.py.
4709
03:29:44,860 --> 03:29:46,100
That file's a text file too.
4710
03:29:46,100 --> 03:29:48,620
You're using a text editor to create that file.
4711
03:29:48,620 --> 03:29:50,140
You put your Python commands in a file,
4712
03:29:50,140 --> 03:29:52,540
you run those files and that's what it is.
4713
03:29:52,540 --> 03:29:55,980
And so a file can be thought of as a bunch of lines,
4714
03:29:55,980 --> 03:29:58,020
you know, one, two, three, four, five, six, seven,
4715
03:29:58,020 --> 03:29:59,300
a blank line here.
4716
03:29:59,300 --> 03:30:03,500
That's possible and, but the reality is,
4717
03:30:03,500 --> 03:30:05,500
is that these are actually just lines
4718
03:30:05,500 --> 03:30:07,540
and we have a special character called the new line
4719
03:30:07,540 --> 03:30:09,300
that we'll talk about in a second.
4720
03:30:10,660 --> 03:30:14,180
So to read a file, you have to call the open function.
4721
03:30:14,180 --> 03:30:16,960
And open returns what we call a file handle.
4722
03:30:16,960 --> 03:30:19,320
Open doesn't actually read the file.
4723
03:30:19,320 --> 03:30:23,540
Open makes it possible so that you can read the file.
4724
03:30:25,100 --> 03:30:27,380
So the parameters to open are,
4725
03:30:27,380 --> 03:30:29,320
it takes one parameter that's required,
4726
03:30:29,320 --> 03:30:30,780
which is the name of the file,
4727
03:30:30,780 --> 03:30:31,960
another parameter that's optional,
4728
03:30:31,960 --> 03:30:33,900
whether or not to read it or write it.
4729
03:30:33,900 --> 03:30:36,100
If we're reading the file, it doesn't harm it.
4730
03:30:36,100 --> 03:30:37,100
You can read it over and over.
4731
03:30:37,100 --> 03:30:38,500
If you write it, it actually,
4732
03:30:38,500 --> 03:30:39,660
if there's already data in that file,
4733
03:30:39,660 --> 03:30:41,140
it truncates it and writes something.
4734
03:30:41,140 --> 03:30:42,580
And we're not gonna really write files,
4735
03:30:42,580 --> 03:30:43,980
we're mostly gonna read them.
4736
03:30:43,980 --> 03:30:46,680
And so open, sort of, you pass it in a file,
4737
03:30:46,680 --> 03:30:48,260
it gives you back this file handle
4738
03:30:48,260 --> 03:30:50,820
and then you have a variable in which you store it.
4739
03:30:50,820 --> 03:30:54,700
I often call it fhand to be mnemonic.
4740
03:30:54,700 --> 03:30:57,420
You'll see my code, I use fhand all the time
4741
03:30:57,420 --> 03:31:00,140
to indicate that that is a file handle.
4742
03:31:00,140 --> 03:31:04,780
And so if we were to run this in an interactive mode,
4743
03:31:04,780 --> 03:31:08,940
we'll open mbox.txt and that is a function
4744
03:31:08,940 --> 03:31:11,300
built into Python and then it gives us back a handle.
4745
03:31:11,300 --> 03:31:12,820
It does not give the data.
4746
03:31:12,820 --> 03:31:15,940
You can kinda see this when we print out the file handle
4747
03:31:15,940 --> 03:31:17,460
using the print statement.
4748
03:31:17,460 --> 03:31:20,240
It doesn't print the lines that are in the file.
4749
03:31:20,240 --> 03:31:21,860
The lines that are in the file are sort of out there.
4750
03:31:21,860 --> 03:31:24,380
There could be like, you know, 10 million lines
4751
03:31:24,380 --> 03:31:26,760
for all we know, lines in the file.
4752
03:31:27,700 --> 03:31:30,700
The handle's like a little opening outside of your program
4753
03:31:30,700 --> 03:31:32,900
and you can talk to the file by opening it,
4754
03:31:32,900 --> 03:31:34,460
then you can read stuff, you could,
4755
03:31:34,460 --> 03:31:36,500
if you're writing the file, you can write stuff
4756
03:31:36,500 --> 03:31:38,660
and then you close the file to shut the handle down.
4757
03:31:38,660 --> 03:31:42,460
But handle is a thing that allows you to get to the file.
4758
03:31:42,460 --> 03:31:45,540
It is not the file itself and it's not the data in the file,
4759
03:31:45,540 --> 03:31:48,460
it's just a wrapper that kind of allows you.
4760
03:31:48,460 --> 03:31:49,740
So this, if you print it out, it's like,
4761
03:31:49,740 --> 03:31:51,780
that's the file we opened, we're reading it
4762
03:31:51,780 --> 03:31:53,700
and then coding has to do with the different kinds
4763
03:31:53,700 --> 03:31:55,620
of character sets, which we talked about
4764
03:31:55,620 --> 03:31:57,940
at the end of last lecture, the Unicode character set,
4765
03:31:57,940 --> 03:32:01,540
et cetera, UTF-8 is a great character set.
4766
03:32:01,540 --> 03:32:04,820
It's probably the most typical character set
4767
03:32:04,820 --> 03:32:05,900
that you will run into it,
4768
03:32:05,900 --> 03:32:08,740
although you can have different character sets of files,
4769
03:32:08,740 --> 03:32:10,220
but most of them are UTF-8.
4770
03:32:11,940 --> 03:32:14,220
So, of course, this is Python.
4771
03:32:14,220 --> 03:32:16,980
If you make a mistake and there's a file that doesn't exist,
4772
03:32:16,980 --> 03:32:19,800
we get a trace back and it blows up.
4773
03:32:23,340 --> 03:32:25,620
We'll show you in a second how to deal with that.
4774
03:32:25,620 --> 03:32:28,420
Now, the newline character is an important part
4775
03:32:28,420 --> 03:32:32,580
of file reading and in strings,
4776
03:32:32,580 --> 03:32:34,480
we can put the newline character in
4777
03:32:34,480 --> 03:32:36,740
by this backslash n character.
4778
03:32:36,740 --> 03:32:39,500
And the backslash n is the character that indicates
4779
03:32:39,500 --> 03:32:42,100
that we're supposed to go to another line.
4780
03:32:42,100 --> 03:32:44,940
Go to a newline, go to a newline.
4781
03:32:44,940 --> 03:32:46,140
And so we have, what is this?
4782
03:32:46,140 --> 03:32:48,540
Well, that's a backslash n, that's a backslash n.
4783
03:32:50,580 --> 03:32:53,140
And so, if we print it out, we print it this way,
4784
03:32:53,140 --> 03:32:54,940
we see that the backslash n is in there.
4785
03:32:54,940 --> 03:32:55,940
This is how we type it.
4786
03:32:55,940 --> 03:32:58,740
We actually type backslash n to Python
4787
03:32:58,740 --> 03:33:01,960
to indicate that we're supposed to put that there.
4788
03:33:03,140 --> 03:33:04,900
But if we do a print statement,
4789
03:33:04,900 --> 03:33:06,620
it actually interprets the backslash n,
4790
03:33:06,620 --> 03:33:09,460
so the backslash n causes this movement to the beginning.
4791
03:33:09,460 --> 03:33:11,860
Now, the print actually, at the end of this,
4792
03:33:11,860 --> 03:33:13,180
adds another backslash n.
4793
03:33:13,180 --> 03:33:15,900
So, the backslash n that we put in
4794
03:33:15,900 --> 03:33:17,940
by putting it into the string is that one.
4795
03:33:17,940 --> 03:33:20,740
And then print always puts a backslash n at the end.
4796
03:33:21,700 --> 03:33:25,100
There's actually a way to override that backslash n behavior
4797
03:33:25,100 --> 03:33:26,900
by putting something on the print statement,
4798
03:33:26,900 --> 03:33:28,580
which we'll talk about later.
4799
03:33:28,580 --> 03:33:30,540
Now, it's important to note
4800
03:33:30,540 --> 03:33:33,460
that the backslash n is one character, right?
4801
03:33:33,460 --> 03:33:37,900
And so, even though this x backslash ny prints this,
4802
03:33:37,900 --> 03:33:40,500
and then print adds another new line to go down to here,
4803
03:33:40,500 --> 03:33:41,940
if you ask how many characters,
4804
03:33:41,940 --> 03:33:44,860
what is the length of this, well, it's only three.
4805
03:33:44,860 --> 03:33:46,740
That's because that's a character,
4806
03:33:46,740 --> 03:33:49,260
the backslash n is a character, and the y is a character.
4807
03:33:49,260 --> 03:33:50,820
So, it's a three character string.
4808
03:33:50,820 --> 03:33:52,620
So, the backslash n is a character
4809
03:33:52,620 --> 03:33:54,380
like all the rest of the characters,
4810
03:33:54,380 --> 03:33:59,100
but it's only, we encode it by typing backslash n.
4811
03:33:59,100 --> 03:34:01,620
It's called an escape, where the backslash is the escape.
4812
03:34:01,620 --> 03:34:04,220
Backslash n is a way to say new line,
4813
03:34:04,220 --> 03:34:05,500
because we can't see it.
4814
03:34:05,500 --> 03:34:08,060
It's a way for us to encode in a string
4815
03:34:08,060 --> 03:34:11,660
this non-printable character, this invisible character.
4816
03:34:11,660 --> 03:34:13,780
The white space, it's part of white space.
4817
03:34:14,900 --> 03:34:16,500
So, as we're reading through the file,
4818
03:34:16,500 --> 03:34:18,260
we can think of it as a sequence of lines,
4819
03:34:18,260 --> 03:34:20,180
and we can read these a line at a time.
4820
03:34:20,180 --> 03:34:22,700
We can also read them a character at a time if we want.
4821
03:34:22,700 --> 03:34:25,040
And so, but it's more common to say read this line,
4822
03:34:25,040 --> 03:34:27,180
read the next line, read the line after that,
4823
03:34:27,180 --> 03:34:28,980
et cetera, et cetera, et cetera.
4824
03:34:28,980 --> 03:34:31,060
But the way to best think about this,
4825
03:34:31,940 --> 03:34:33,140
it doesn't really matter.
4826
03:34:33,140 --> 03:34:34,540
You can think about it as lines,
4827
03:34:34,540 --> 03:34:36,940
and we will in most of the programs that we write.
4828
03:34:36,940 --> 03:34:39,820
But realize that the way when we see this,
4829
03:34:41,380 --> 03:34:44,980
we see it like this, it comes back to the beginning,
4830
03:34:44,980 --> 03:34:45,820
it comes back to the beginning.
4831
03:34:45,820 --> 03:34:47,980
There's a character in the file.
4832
03:34:47,980 --> 03:34:50,180
At each of these points to say go back to the beginning.
4833
03:34:50,180 --> 03:34:53,380
It's like hitting the enter key on your computer.
4834
03:34:53,380 --> 03:34:54,500
And that is a new line.
4835
03:34:54,500 --> 03:34:56,940
So you have to think that in the file,
4836
03:34:56,940 --> 03:35:00,300
in order for your text editor and Python and everybody
4837
03:35:00,300 --> 03:35:03,660
to know where the lines end, you put new lines in the file.
4838
03:35:03,660 --> 03:35:05,420
And that's another character.
4839
03:35:05,420 --> 03:35:08,740
So, you know, this looks like an empty line.
4840
03:35:08,740 --> 03:35:10,420
This line here looks like an empty line,
4841
03:35:10,420 --> 03:35:11,780
but really it has a single character,
4842
03:35:11,780 --> 03:35:13,300
and the character is a new line.
4843
03:35:13,300 --> 03:35:14,980
And it turns out that in a bit,
4844
03:35:14,980 --> 03:35:17,620
we're gonna need to keep track of the fact that
4845
03:35:17,620 --> 03:35:20,440
every line is ended by a new line.
4846
03:35:20,440 --> 03:35:22,200
So up next, I'm gonna talk a little bit
4847
03:35:22,200 --> 03:35:24,660
about how to read files in Python.
4848
03:35:28,480 --> 03:35:30,560
So we're gonna find that there's a number of different ways
4849
03:35:30,560 --> 03:35:31,740
that we can read through the file.
4850
03:35:31,740 --> 03:35:33,300
But the most common way that we're gonna read
4851
03:35:33,300 --> 03:35:35,980
through the file is to treat it as a sequence of lines.
4852
03:35:35,980 --> 03:35:38,100
And we're gonna use the determinant loop,
4853
03:35:38,100 --> 03:35:40,660
the for loop, to do this.
4854
03:35:40,660 --> 03:35:43,740
And so what happens here is we get back this handle,
4855
03:35:43,740 --> 03:35:45,500
that opens the file and gives us back the handle.
4856
03:35:45,500 --> 03:35:49,100
That handle xfile is the variable I named,
4857
03:35:49,100 --> 03:35:51,020
I just named it xfile.
4858
03:35:51,020 --> 03:35:52,260
That's not the data.
4859
03:35:52,260 --> 03:35:54,860
But it is a sequence.
4860
03:35:54,860 --> 03:35:59,100
It is that file handle represents to Python a sequence
4861
03:35:59,100 --> 03:36:00,820
that we can potentially walk through
4862
03:36:00,820 --> 03:36:02,140
and then get all the lines.
4863
03:36:02,140 --> 03:36:04,940
And it's the simplest, most beautiful, elegant way
4864
03:36:04,940 --> 03:36:07,060
to read all the lines in a file.
4865
03:36:07,060 --> 03:36:09,700
We use the for loop and we have an iteration variable.
4866
03:36:09,700 --> 03:36:12,540
This is going to take, when we talk about the file,
4867
03:36:12,540 --> 03:36:14,540
cheese is gonna be the first line, then the second line,
4868
03:36:14,540 --> 03:36:15,780
then the third line, then the fourth line.
4869
03:36:15,780 --> 03:36:17,300
So it's like going through a string,
4870
03:36:17,300 --> 03:36:18,500
but you're going through a file now
4871
03:36:18,500 --> 03:36:19,780
and you're getting it line by line.
4872
03:36:19,780 --> 03:36:21,300
So that's each line.
4873
03:36:21,300 --> 03:36:22,840
I just picked a variable named cheese
4874
03:36:22,840 --> 03:36:23,900
so you didn't get confused.
4875
03:36:23,900 --> 03:36:25,500
Later I'll call this line.
4876
03:36:25,500 --> 03:36:28,300
But Python doesn't know anything special
4877
03:36:28,300 --> 03:36:30,340
by naming that variable line.
4878
03:36:30,340 --> 03:36:32,980
Okay, and so this is, it's the for and the in.
4879
03:36:32,980 --> 03:36:36,820
And so I read this as for each line
4880
03:36:36,820 --> 03:36:40,260
in the file handle xfile.
4881
03:36:40,260 --> 03:36:43,460
So run this loop one time for every line
4882
03:36:43,460 --> 03:36:44,500
and then print it out.
4883
03:36:44,500 --> 03:36:47,920
So it's actually really quite simple, okay?
4884
03:36:49,060 --> 03:36:53,340
Other languages like C or C++ or other languages,
4885
03:36:53,340 --> 03:36:55,860
they have to write while loops with end of file conditions
4886
03:36:55,860 --> 03:36:58,740
and all kinds of things that make this very difficult.
4887
03:36:58,740 --> 03:37:02,300
But this is one of the prettiest things that Python has.
4888
03:37:02,300 --> 03:37:05,020
It's a very, very pretty thing.
4889
03:37:07,540 --> 03:37:09,960
Okay, so let's talk about what we might do.
4890
03:37:09,960 --> 03:37:12,060
And we're going kind of back to iterations now.
4891
03:37:12,060 --> 03:37:14,180
What if we wanted to count the number of lines in a file?
4892
03:37:14,180 --> 03:37:16,540
Well, this is a basic loop counting pattern.
4893
03:37:17,500 --> 03:37:20,500
So we open the file and then like in all these loops,
4894
03:37:20,500 --> 03:37:23,260
we do something to sort of prime the loop to get it started,
4895
03:37:23,260 --> 03:37:24,940
set a variable count to zero.
4896
03:37:24,940 --> 03:37:26,860
And I'm gonna use the variable line
4897
03:37:26,860 --> 03:37:29,260
that's gonna go through each of the lines in the file
4898
03:37:29,260 --> 03:37:32,460
for line in fhand, down the file.
4899
03:37:32,460 --> 03:37:33,980
And it's gonna run this loop once
4900
03:37:33,980 --> 03:37:35,060
for each line in the file
4901
03:37:35,060 --> 03:37:37,260
and the variable line is gonna change.
4902
03:37:37,260 --> 03:37:39,620
But all I'm gonna do is add count equals count plus one.
4903
03:37:39,620 --> 03:37:41,540
And so that's just like from counters,
4904
03:37:41,540 --> 03:37:43,420
that's just how you detect.
4905
03:37:43,420 --> 03:37:44,620
So every time we see a line,
4906
03:37:44,620 --> 03:37:45,700
we're just gonna add one to the counter.
4907
03:37:45,700 --> 03:37:46,580
We're not printing the line,
4908
03:37:46,580 --> 03:37:48,820
we're not even looking at its data at this point.
4909
03:37:48,820 --> 03:37:49,940
And then when the line is done,
4910
03:37:49,940 --> 03:37:51,500
however many times it has to go,
4911
03:37:51,500 --> 03:37:54,340
out it comes and we print out line count equals count.
4912
03:37:54,340 --> 03:37:56,700
And so if we open mbox.txt,
4913
03:37:56,700 --> 03:37:58,180
this is gonna do all this work
4914
03:37:58,180 --> 03:37:59,620
and then print this line out
4915
03:37:59,620 --> 03:38:03,140
and say line count is 132,045.
4916
03:38:03,140 --> 03:38:05,940
So this is a little five line program
4917
03:38:05,940 --> 03:38:08,180
that shows you how to count the lines
4918
03:38:08,180 --> 03:38:10,660
in a text file using Python.
4919
03:38:10,660 --> 03:38:12,900
Again, simple and elegant
4920
03:38:12,900 --> 03:38:15,620
and not too much syntax for you to have to learn.
4921
03:38:16,800 --> 03:38:18,920
Now it's also possible to read the file
4922
03:38:18,920 --> 03:38:22,120
as a series of characters all in one go.
4923
03:38:22,120 --> 03:38:23,180
Read the whole file in.
4924
03:38:23,180 --> 03:38:25,940
Now you gotta be careful depending on the size of the file,
4925
03:38:25,940 --> 03:38:28,340
this is gonna lead to a string variable
4926
03:38:28,340 --> 03:38:29,740
with a lot of data in it.
4927
03:38:29,740 --> 03:38:32,180
Now if it's 100,000 characters,
4928
03:38:32,180 --> 03:38:33,980
that's actually kind of a small thing.
4929
03:38:33,980 --> 03:38:36,840
But if it was 10 million lines,
4930
03:38:36,840 --> 03:38:38,100
that would probably not be good.
4931
03:38:38,100 --> 03:38:40,020
You'd wanna read it one line at a time
4932
03:38:40,020 --> 03:38:42,620
and process each line and then do something.
4933
03:38:42,620 --> 03:38:46,140
But mbox.short.txt is a small little file.
4934
03:38:46,140 --> 03:38:50,260
So we open it and we get back a file object,
4935
03:38:50,260 --> 03:38:53,060
file handle object, and we call the read method.
4936
03:38:53,060 --> 03:38:55,620
And that says go through and read all the text
4937
03:38:55,620 --> 03:38:59,220
and give it back in one big blob, one big string,
4938
03:38:59,220 --> 03:39:00,660
and I'll put it in imp.
4939
03:39:00,660 --> 03:39:03,140
And so that's where you have a line, a new line,
4940
03:39:03,140 --> 03:39:05,500
a line, a new line, a line, a new line.
4941
03:39:05,500 --> 03:39:08,240
So not really lines, it's just a sequence of characters
4942
03:39:08,240 --> 03:39:10,620
with new lines in there to punctuate them.
4943
03:39:10,620 --> 03:39:11,900
And now you can split that,
4944
03:39:11,900 --> 03:39:14,060
later we'll see how to split that
4945
03:39:14,060 --> 03:39:16,500
into separate lines if you want.
4946
03:39:16,500 --> 03:39:19,060
Now I picked a file that was short,
4947
03:39:19,060 --> 03:39:22,660
and so this imp variable now has a string in it
4948
03:39:22,660 --> 03:39:24,480
and I can use the len function,
4949
03:39:24,480 --> 03:39:26,100
pass a string into the len function,
4950
03:39:26,100 --> 03:39:29,040
it says oh 94,626 characters.
4951
03:39:29,040 --> 03:39:32,820
That's kind of a small little file.
4952
03:39:32,820 --> 03:39:35,280
And perfectly okay to read it all in one go.
4953
03:39:36,380 --> 03:39:38,900
And so now I say just print the first 20 characters,
4954
03:39:38,900 --> 03:39:41,300
that's beginning to up to but not including 20,
4955
03:39:41,300 --> 03:39:44,520
and so it shows the first 20 characters
4956
03:39:44,520 --> 03:39:46,500
of that little file is a from line,
4957
03:39:46,500 --> 03:39:48,140
because this is a mailbox file.
4958
03:39:51,080 --> 03:39:53,560
Now let's say we're going to do a searching,
4959
03:39:53,560 --> 03:39:56,140
and we did this loop where you're looking for something.
4960
03:39:56,140 --> 03:39:58,100
And so we're going to search for lines
4961
03:39:58,100 --> 03:40:01,540
that have a prefix of from, okay?
4962
03:40:01,540 --> 03:40:02,380
That's what we're going to do,
4963
03:40:02,380 --> 03:40:03,500
and we're going to print those lines out.
4964
03:40:03,500 --> 03:40:05,460
So there's lots of lines in this file,
4965
03:40:07,140 --> 03:40:09,200
line, line, line, line, from,
4966
03:40:09,200 --> 03:40:11,180
line, line, line, line, from, right?
4967
03:40:11,180 --> 03:40:12,060
On and on and on.
4968
03:40:12,060 --> 03:40:13,640
And we only want to show these lines,
4969
03:40:13,640 --> 03:40:14,640
the ones that match, right?
4970
03:40:14,640 --> 03:40:16,140
That's what we want to do.
4971
03:40:16,140 --> 03:40:20,900
And so we are going to write an open statement
4972
03:40:20,900 --> 03:40:22,720
and then we're going to loop through,
4973
03:40:22,720 --> 03:40:23,980
and we're going to ask the question,
4974
03:40:23,980 --> 03:40:26,980
if the line starts with from, print it.
4975
03:40:26,980 --> 03:40:29,340
So sometimes it's going to skip, skip, skip, skip,
4976
03:40:29,340 --> 03:40:30,180
and then it's going to run it,
4977
03:40:30,180 --> 03:40:32,260
and skip, skip, skip, skip, skip,
4978
03:40:32,260 --> 03:40:34,060
and it's going to run it, skip, skip, skip,
4979
03:40:34,060 --> 03:40:35,940
and then it's going to run it, okay?
4980
03:40:35,940 --> 03:40:38,580
So that's the basic idea,
4981
03:40:38,580 --> 03:40:40,620
and then it'll finish when it's all said and done.
4982
03:40:40,620 --> 03:40:44,220
And so this is like a criteria, this is like a search.
4983
03:40:44,220 --> 03:40:47,040
We're looking for lines that match the string,
4984
03:40:47,040 --> 03:40:50,200
that have the string from as their prefix.
4985
03:40:50,200 --> 03:40:52,960
Now, when we look at the output of this,
4986
03:40:52,960 --> 03:40:54,420
it's kind of weird.
4987
03:40:54,420 --> 03:40:58,120
We see kind of these little blank lines that show up.
4988
03:40:58,120 --> 03:41:01,380
Blank, blank, blank, blank, blank, blank, blank.
4989
03:41:01,380 --> 03:41:03,120
What's going on here?
4990
03:41:04,080 --> 03:41:04,920
What's going on?
4991
03:41:04,920 --> 03:41:06,160
So let's take a quick look.
4992
03:41:07,360 --> 03:41:09,240
The problem is, is new lines.
4993
03:41:09,240 --> 03:41:13,480
Well, I mentioned that the file has new lines in them.
4994
03:41:13,480 --> 03:41:15,640
And so when you do the for loop,
4995
03:41:15,640 --> 03:41:17,200
it doesn't throw the new lines away.
4996
03:41:17,200 --> 03:41:19,000
As you might expect,
4997
03:41:19,000 --> 03:41:20,840
it would be kind of nice if it did, but it doesn't.
4998
03:41:20,840 --> 03:41:23,560
It actually shows you when you read,
4999
03:41:23,560 --> 03:41:26,480
it reads that first line up to and including the new line
5000
03:41:26,480 --> 03:41:28,320
and gives you that back as the variable.
5001
03:41:28,320 --> 03:41:29,960
So that is the first new line.
5002
03:41:29,960 --> 03:41:31,600
So that means it's going to go down.
5003
03:41:31,600 --> 03:41:34,740
And then the print statement actually adds another new line.
5004
03:41:34,740 --> 03:41:36,600
So that's the second line of the file
5005
03:41:36,600 --> 03:41:38,200
has a new line at the end of it,
5006
03:41:38,200 --> 03:41:40,000
and the print statement adds another new line.
5007
03:41:40,000 --> 03:41:41,600
So if we take a look at the code,
5008
03:41:43,040 --> 03:41:46,440
there is a new line, oops, come back.
5009
03:41:46,440 --> 03:41:50,560
If we take a look at the code,
5010
03:41:50,560 --> 03:41:53,600
this variable line has a new line in it, oops,
5011
03:41:53,600 --> 03:41:54,440
where am I at?
5012
03:41:54,440 --> 03:41:56,160
I'm in the wrong slide, there we go.
5013
03:41:58,500 --> 03:42:01,040
Yeah, this is what I want to do.
5014
03:42:01,040 --> 03:42:03,440
If we look at the code, there's a new line in here,
5015
03:42:03,440 --> 03:42:05,200
and then the print adds another new line.
5016
03:42:05,200 --> 03:42:07,600
So the print adds a separate new line.
5017
03:42:07,600 --> 03:42:09,600
And that's how we get two new lines.
5018
03:42:09,600 --> 03:42:10,800
The print statements new line
5019
03:42:10,800 --> 03:42:12,560
and the new line from the file.
5020
03:42:13,840 --> 03:42:14,760
Here's how we fix it.
5021
03:42:14,760 --> 03:42:16,040
And you're going to write this code a lot
5022
03:42:16,040 --> 03:42:18,080
because when you're reading text files,
5023
03:42:18,080 --> 03:42:18,960
you end up with a new line.
5024
03:42:18,960 --> 03:42:20,440
And often you don't want the new line.
5025
03:42:20,440 --> 03:42:23,880
But thankfully, as we saw in the previous chapter,
5026
03:42:23,880 --> 03:42:28,340
there is a nice little function in Python for strings
5027
03:42:28,340 --> 03:42:31,040
called strip that allows you to throw away white space.
5028
03:42:33,080 --> 03:42:37,520
And to review, remember white space
5029
03:42:37,520 --> 03:42:39,160
is anything that doesn't print.
5030
03:42:39,160 --> 03:42:41,900
And this new line is not a non-printing character.
5031
03:42:41,900 --> 03:42:43,440
So our strip gets rid of it.
5032
03:42:43,440 --> 03:42:45,360
So it's a way to get rid of white space.
5033
03:42:45,360 --> 03:42:47,720
And our strip does it from the right end.
5034
03:42:47,720 --> 03:42:50,320
So it's the right end of the string.
5035
03:42:51,880 --> 03:42:54,520
And so if we just are going to loop
5036
03:42:54,520 --> 03:42:55,800
through all the lines in the file,
5037
03:42:55,800 --> 03:42:57,520
we say line equals line our strip.
5038
03:42:57,520 --> 03:43:00,040
And then this variable no longer has the new line
5039
03:43:00,040 --> 03:43:01,080
at the end of it.
5040
03:43:01,080 --> 03:43:02,120
We have our little if statement.
5041
03:43:02,120 --> 03:43:04,680
And if we print it, then this line,
5042
03:43:04,680 --> 03:43:05,880
the data has no thing.
5043
03:43:05,880 --> 03:43:08,200
And then the data has a no new line in it.
5044
03:43:08,200 --> 03:43:09,520
So the print only goes down one.
5045
03:43:09,520 --> 03:43:12,160
And so now we have single spaced output.
5046
03:43:12,160 --> 03:43:13,460
And so you're going to be doing that a lot.
5047
03:43:13,460 --> 03:43:15,960
It's really common to read through a file
5048
03:43:15,960 --> 03:43:18,360
and then just strip the new line
5049
03:43:18,360 --> 03:43:21,740
or any trailing space off the end of that.
5050
03:43:22,640 --> 03:43:25,520
Now, there's a couple of ways to do a loop like this.
5051
03:43:25,520 --> 03:43:28,160
And let's just think of this as
5052
03:43:29,200 --> 03:43:31,940
we're looking for a line, a file
5053
03:43:31,940 --> 03:43:33,640
with lots of different lines in it.
5054
03:43:33,640 --> 03:43:35,000
And we want to ignore all the lines
5055
03:43:35,000 --> 03:43:36,480
except some say good lines.
5056
03:43:36,480 --> 03:43:38,320
And we want to do something with those good lines
5057
03:43:38,320 --> 03:43:39,840
or the lines we're looking for.
5058
03:43:39,840 --> 03:43:40,800
Needle in a haystack.
5059
03:43:40,800 --> 03:43:44,040
This is like searching for a needle in a haystack.
5060
03:43:44,040 --> 03:43:45,640
So if you look at this code at high level,
5061
03:43:45,640 --> 03:43:47,080
we're going to loop through everything.
5062
03:43:47,080 --> 03:43:49,800
And then we're sort of picking which lines are.
5063
03:43:49,800 --> 03:43:52,460
And these are the good lines down here.
5064
03:43:52,460 --> 03:43:54,680
Now, often we have a bunch more code that we want to do.
5065
03:43:54,680 --> 03:43:55,840
And we're not just printing them,
5066
03:43:55,840 --> 03:43:57,160
but we're going to do a lot of code.
5067
03:43:57,160 --> 03:43:59,240
So sometimes you actually structure the loop
5068
03:43:59,240 --> 03:44:01,400
a little bit differently.
5069
03:44:01,400 --> 03:44:02,520
And so the way to do it,
5070
03:44:02,520 --> 03:44:04,280
and this is going to do the exact same thing,
5071
03:44:04,280 --> 03:44:06,400
it's just a little different way
5072
03:44:06,400 --> 03:44:07,680
of thinking about this loop.
5073
03:44:07,680 --> 03:44:09,360
So the top part is the same.
5074
03:44:09,360 --> 03:44:10,400
We're stripping it.
5075
03:44:10,400 --> 03:44:12,840
And what we're doing here is everything's the same here
5076
03:44:12,840 --> 03:44:13,960
except we add this and not.
5077
03:44:13,960 --> 03:44:16,160
If the line does not start with from,
5078
03:44:16,160 --> 03:44:18,080
that's the translation of that.
5079
03:44:18,080 --> 03:44:21,320
If the line does not start with from, continue.
5080
03:44:21,320 --> 03:44:24,380
So basically we have a skipping pattern.
5081
03:44:24,380 --> 03:44:27,880
So the lines we're not interested in, we skip.
5082
03:44:27,880 --> 03:44:30,760
So we come down, we skip a lot of lines.
5083
03:44:30,760 --> 03:44:32,400
Choo, choo, choo, choo, choo.
5084
03:44:32,400 --> 03:44:34,040
And then we find a line that's good,
5085
03:44:34,040 --> 03:44:35,480
and then we fall through.
5086
03:44:35,480 --> 03:44:37,040
So this is the good code.
5087
03:44:37,040 --> 03:44:38,680
And then we have all the other good code
5088
03:44:38,680 --> 03:44:40,080
that we want to do to that line.
5089
03:44:40,080 --> 03:44:41,940
We have that showing up down here.
5090
03:44:43,320 --> 03:44:44,760
And so there's just two patterns
5091
03:44:44,760 --> 03:44:47,920
that are two ways to do the exact same thing.
5092
03:44:47,920 --> 03:44:50,040
So another way to select the lines
5093
03:44:50,040 --> 03:44:52,120
that we're interested in is to use the in operator.
5094
03:44:52,120 --> 03:44:54,280
So we talked before about the in operator
5095
03:44:54,280 --> 03:44:55,600
and how that works.
5096
03:44:55,600 --> 03:44:59,240
So we're basically gonna use the continue skipping method.
5097
03:44:59,240 --> 03:45:00,440
So we're gonna read all the lines,
5098
03:45:00,440 --> 03:45:01,360
these first few lines.
5099
03:45:01,360 --> 03:45:06,360
If uct.ac.za is not in the line, skip it.
5100
03:45:06,960 --> 03:45:09,280
And so this is gonna print out all the lines
5101
03:45:09,280 --> 03:45:14,000
that have the string uct.ac.za in them.
5102
03:45:14,000 --> 03:45:16,080
And so you see this is the output of the program,
5103
03:45:16,080 --> 03:45:17,240
dot, dot, dot, dot, dot.
5104
03:45:20,040 --> 03:45:22,080
Sometimes you'll have programs
5105
03:45:22,080 --> 03:45:24,400
that want to read different files.
5106
03:45:24,400 --> 03:45:26,280
Often I give assignments where I say,
5107
03:45:26,280 --> 03:45:28,280
show me how this program runs on the short file,
5108
03:45:28,280 --> 03:45:30,280
and then show me again how it runs on the long file,
5109
03:45:30,280 --> 03:45:31,320
just like this.
5110
03:45:31,320 --> 03:45:33,640
And so the way we do that to input the file name,
5111
03:45:33,640 --> 03:45:34,720
instead of making the file name
5112
03:45:34,720 --> 03:45:36,700
be a constant to the open call,
5113
03:45:36,700 --> 03:45:39,760
we make the file name be a input.
5114
03:45:39,760 --> 03:45:41,840
So we just run an input statement,
5115
03:45:41,840 --> 03:45:43,360
which gives us a prompt.
5116
03:45:43,360 --> 03:45:45,240
And then we type mbox.txt,
5117
03:45:45,240 --> 03:45:47,200
and then that shows up in this variable fname.
5118
03:45:47,200 --> 03:45:48,880
It's of course a string all the time.
5119
03:45:48,880 --> 03:45:51,160
And we pass that into open, and then we open it,
5120
03:45:51,160 --> 03:45:53,880
and then we do the count operation.
5121
03:45:53,880 --> 03:45:56,880
So if we enter mbox.txt, it counts 1797
5122
03:46:00,600 --> 03:46:02,000
subject lines in mbox.
5123
03:46:02,000 --> 03:46:03,240
And if we give it mbox short,
5124
03:46:03,240 --> 03:46:05,540
it says there are 27 subject lines in mbox.
5125
03:46:05,540 --> 03:46:07,760
And again, this is another one of those ifs,
5126
03:46:07,760 --> 03:46:09,120
and it's just counting,
5127
03:46:09,120 --> 03:46:13,280
but only counting lines that match a particular pattern.
5128
03:46:16,400 --> 03:46:19,040
Okay, so now the user can also type bad file names,
5129
03:46:19,040 --> 03:46:21,200
and we need to be able to deal with that as well.
5130
03:46:21,200 --> 03:46:25,200
And so we're taking a small change to the code.
5131
03:46:26,600 --> 03:46:29,160
The dangerous code is this line right here.
5132
03:46:29,160 --> 03:46:31,440
This line right here is gonna trace back
5133
03:46:31,440 --> 03:46:32,800
if that file doesn't exist.
5134
03:46:32,800 --> 03:46:34,200
So what do we do?
5135
03:46:34,200 --> 03:46:35,840
Well, we're gonna just expand that.
5136
03:46:35,840 --> 03:46:37,640
The rest of this program is exactly the same.
5137
03:46:37,640 --> 03:46:40,380
You know, things different as we've got this line.
5138
03:46:40,380 --> 03:46:42,840
We took out insurance on it,
5139
03:46:42,840 --> 03:46:44,660
and we know that it might blow up,
5140
03:46:44,660 --> 03:46:48,160
and so we have it in a try and accept block.
5141
03:46:50,160 --> 03:46:52,500
So here's how the code runs.
5142
03:46:54,160 --> 03:46:56,080
So, you know, the input runs.
5143
03:46:56,080 --> 03:46:57,400
We type in a good file name.
5144
03:46:57,400 --> 03:46:58,880
It comes in here.
5145
03:46:58,880 --> 03:47:01,200
This works, and so it skips the acceptance,
5146
03:47:01,200 --> 03:47:03,260
so it runs the code and prints out the count.
5147
03:47:03,260 --> 03:47:05,020
So that's the good pattern.
5148
03:47:05,020 --> 03:47:06,940
The bad pattern is here,
5149
03:47:08,300 --> 03:47:09,840
we type in a bad file name.
5150
03:47:09,840 --> 03:47:11,440
It comes in the try accept.
5151
03:47:11,440 --> 03:47:14,120
This file name is non-abubu,
5152
03:47:14,120 --> 03:47:17,160
and it's gonna blow up, so this line blows up.
5153
03:47:17,160 --> 03:47:19,040
So it jumps down into the accept code,
5154
03:47:19,040 --> 03:47:21,120
prints out, file cannot be opened.
5155
03:47:21,120 --> 03:47:22,360
So it prints this out.
5156
03:47:22,360 --> 03:47:24,300
Now this quit is really important,
5157
03:47:24,300 --> 03:47:25,860
because if we don't put this quit in here,
5158
03:47:25,860 --> 03:47:27,340
it's gonna continue down here,
5159
03:47:27,340 --> 03:47:28,340
and that's gonna blow up here,
5160
03:47:28,340 --> 03:47:31,880
because file handle is not defined properly at this point.
5161
03:47:31,880 --> 03:47:33,840
And so what we have is,
5162
03:47:33,840 --> 03:47:36,760
we have this quit is a special function
5163
03:47:36,760 --> 03:47:39,480
where it comes in and never returns.
5164
03:47:39,480 --> 03:47:43,040
So this is a way to terminate the entire Python program
5165
03:47:43,040 --> 03:47:45,440
silently with no trace back, right?
5166
03:47:45,440 --> 03:47:47,600
So we put in our own error message,
5167
03:47:47,600 --> 03:47:48,960
so we look like we're professionals,
5168
03:47:48,960 --> 03:47:51,080
say if we could not open this file,
5169
03:47:51,080 --> 03:47:52,600
and then we stop.
5170
03:47:52,600 --> 03:47:54,200
If you don't, it's gonna come down here,
5171
03:47:54,200 --> 03:47:56,120
and it's gonna trace back,
5172
03:47:56,120 --> 03:47:58,160
trace back right there, it's gonna blow up.
5173
03:47:58,160 --> 03:48:03,160
So the quit is useful when you want to stop executing,
5174
03:48:03,260 --> 03:48:06,060
because you've detected some kind of an error.
5175
03:48:07,220 --> 03:48:09,320
So that's a quick zoom through opening
5176
03:48:09,320 --> 03:48:11,860
and reading through files and doing some patterns.
5177
03:48:13,880 --> 03:48:15,840
Most of the rest of the programs in this course
5178
03:48:15,840 --> 03:48:20,840
are going to say open for our strip,
5179
03:48:21,000 --> 03:48:23,880
do look for, and then do something interesting.
5180
03:48:23,880 --> 03:48:25,920
That's going to be our loop that we're gonna do
5181
03:48:25,920 --> 03:48:28,420
over and over and over again.
5182
03:48:28,420 --> 03:48:32,800
And now we see how this looping and if and iteration
5183
03:48:32,800 --> 03:48:36,600
and variables are starting to come together,
5184
03:48:36,600 --> 03:48:38,560
and you can actually sort of do a program
5185
03:48:38,560 --> 03:48:40,540
that does something useful.
5186
03:48:40,540 --> 03:48:43,480
But before we get to too many more programs,
5187
03:48:43,480 --> 03:48:45,500
we gotta switch a little bit, switch gears
5188
03:48:45,500 --> 03:48:48,000
and talk up next about data structures,
5189
03:48:48,000 --> 03:48:50,080
and that is the shape of data,
5190
03:48:50,080 --> 03:48:54,160
and how we can use more intricate and complex variables
5191
03:48:54,160 --> 03:48:55,720
to help solve our problems.
5192
03:48:59,240 --> 03:49:01,240
Hello and welcome to chapter eight.
5193
03:49:01,240 --> 03:49:03,840
We're gonna talk about lists in this chapter.
5194
03:49:03,840 --> 03:49:07,000
Up to now, we've been talking about algorithms.
5195
03:49:07,000 --> 03:49:09,640
Algorithms are the concept in computer science
5196
03:49:09,640 --> 03:49:13,080
of using the programming language to express the steps
5197
03:49:13,080 --> 03:49:14,320
that you want the computer to go through
5198
03:49:14,320 --> 03:49:16,080
to solve the problem.
5199
03:49:16,080 --> 03:49:19,040
Read some data, convert it to a floating point number,
5200
03:49:19,040 --> 03:49:20,520
check to see if it's greater than 40,
5201
03:49:20,520 --> 03:49:21,820
do one thing if it's greater than 40,
5202
03:49:21,820 --> 03:49:23,120
do another thing if it's not,
5203
03:49:23,120 --> 03:49:24,440
then print out the result.
5204
03:49:24,440 --> 03:49:28,080
Or open a file, read everything.
5205
03:49:28,080 --> 03:49:30,180
If the first line starts with something, do something.
5206
03:49:30,180 --> 03:49:33,320
If not, skip it and then add all the things up.
5207
03:49:33,320 --> 03:49:35,940
Those are steps, those are a series of steps,
5208
03:49:35,940 --> 03:49:37,900
and hopefully by now you're getting to the point
5209
03:49:37,900 --> 03:49:39,960
where you have a good understanding of steps.
5210
03:49:39,960 --> 03:49:43,000
But there's a whole other side of computer programming
5211
03:49:43,000 --> 03:49:44,200
and we call it data structures.
5212
03:49:44,200 --> 03:49:46,520
And data structures is not the steps,
5213
03:49:46,520 --> 03:49:50,520
but instead clever ways that you lay out the data
5214
03:49:50,520 --> 03:49:52,320
and clever ways that you make sure
5215
03:49:52,320 --> 03:49:54,920
that the data does what you want it to do.
5216
03:49:54,920 --> 03:49:57,320
And so that's what we're gonna start talking about now.
5217
03:49:57,320 --> 03:50:00,320
Lists are the first and most simplest data structure.
5218
03:50:00,320 --> 03:50:02,320
Strings are kind of like data structures,
5219
03:50:02,320 --> 03:50:05,440
but lists are probably our first real data structure
5220
03:50:05,440 --> 03:50:06,960
that we're gonna think about and design
5221
03:50:06,960 --> 03:50:08,840
and make use of effectively.
5222
03:50:08,840 --> 03:50:11,860
But before we talk about what is a collection,
5223
03:50:11,860 --> 03:50:13,720
we should talk about what is not a collection.
5224
03:50:13,720 --> 03:50:15,360
So we're familiar with what a variable is.
5225
03:50:15,360 --> 03:50:17,500
We know that a variable is a little piece of memory
5226
03:50:17,500 --> 03:50:19,160
that's got a label on it.
5227
03:50:19,160 --> 03:50:21,080
And then an assignment statement, you know,
5228
03:50:21,080 --> 03:50:23,720
sticks a two into x and then x is,
5229
03:50:23,720 --> 03:50:25,840
and then two is in this little cupboard.
5230
03:50:25,840 --> 03:50:27,560
And then it goes to the next line
5231
03:50:27,560 --> 03:50:29,480
and then four goes into x
5232
03:50:29,480 --> 03:50:31,800
and so the two goes away and the four is there.
5233
03:50:31,800 --> 03:50:35,060
A key thing is you can't have more than one variable
5234
03:50:35,060 --> 03:50:36,920
at any given moment, right?
5235
03:50:36,920 --> 03:50:39,320
And more than one value in a variable.
5236
03:50:39,320 --> 03:50:41,140
So when we move to collections,
5237
03:50:41,140 --> 03:50:43,300
collections are more like suitcases.
5238
03:50:43,300 --> 03:50:44,880
We can put lots of things in them.
5239
03:50:44,880 --> 03:50:46,880
We have ways of organizing them.
5240
03:50:46,880 --> 03:50:49,360
And as we go through lists and dictionaries and tuples,
5241
03:50:49,360 --> 03:50:51,840
we'll see how there are different ways to organize them.
5242
03:50:51,840 --> 03:50:52,680
And as a matter of fact,
5243
03:50:52,680 --> 03:50:55,400
we've been talking about lists for a while.
5244
03:50:55,400 --> 03:50:58,320
Every time we use one of these square bracket syntaxes
5245
03:50:58,320 --> 03:51:00,760
in earlier programs, we've been working with lists.
5246
03:51:00,760 --> 03:51:03,720
And so this is technically a three item list
5247
03:51:03,720 --> 03:51:05,800
with three strings, got commas here,
5248
03:51:05,800 --> 03:51:08,120
Joseph is one string, Glen and Sally are another string.
5249
03:51:08,120 --> 03:51:10,880
And here's another one that is another thing.
5250
03:51:10,880 --> 03:51:14,000
And the list is basically, it's a list constant
5251
03:51:14,000 --> 03:51:15,440
and it's being assigned into a variable.
5252
03:51:15,440 --> 03:51:18,200
So this friends variable has three things in it.
5253
03:51:18,200 --> 03:51:21,780
So that's different than what we've been talking about before.
5254
03:51:22,640 --> 03:51:24,920
So these brackets and bracket structures
5255
03:51:24,920 --> 03:51:27,560
with square brackets are those lists.
5256
03:51:27,560 --> 03:51:29,120
And so the print is just a print
5257
03:51:29,120 --> 03:51:31,040
with parentheses to get the print to work.
5258
03:51:31,040 --> 03:51:35,520
But 124, 76 is a three item integer list.
5259
03:51:35,520 --> 03:51:38,940
Red, yellow and blue is a three item string list.
5260
03:51:38,940 --> 03:51:41,160
But it doesn't all have to be integers or strings.
5261
03:51:41,160 --> 03:51:43,860
Python can handle different things
5262
03:51:43,860 --> 03:51:45,080
and different kinds of data
5263
03:51:45,080 --> 03:51:46,360
in different positions in the list.
5264
03:51:46,360 --> 03:51:50,760
So red, 24, 98.6, a three item list with a string,
5265
03:51:50,760 --> 03:51:53,440
an integer and a floating point number.
5266
03:51:53,440 --> 03:51:56,280
And while we're not gonna use this too much for now,
5267
03:51:56,280 --> 03:51:59,000
this outer list is a three item list
5268
03:51:59,000 --> 03:52:02,000
and the second item is another list.
5269
03:52:02,000 --> 03:52:04,480
So this is kind of alluding toward what we'll do
5270
03:52:04,480 --> 03:52:06,640
when we start talking about data structures.
5271
03:52:06,640 --> 03:52:08,200
And that is we have a structure
5272
03:52:08,200 --> 03:52:09,560
and then we have another structure inside of it.
5273
03:52:09,560 --> 03:52:11,640
And sometimes this can get quite complex.
5274
03:52:11,640 --> 03:52:13,520
And we're doing this for a reason.
5275
03:52:13,520 --> 03:52:16,000
This here has no reason just to show you
5276
03:52:16,000 --> 03:52:18,880
that it's possible that lists can be made up
5277
03:52:18,880 --> 03:52:21,360
of lots of things, including other lists.
5278
03:52:21,360 --> 03:52:25,080
And of course, there is also the notion of the empty list.
5279
03:52:25,080 --> 03:52:28,040
And like I said, I have had to be able to tell you
5280
03:52:28,040 --> 03:52:29,760
about lists all along.
5281
03:52:29,760 --> 03:52:31,480
We use them in for loops.
5282
03:52:31,480 --> 03:52:32,760
We can put lots of things here.
5283
03:52:32,760 --> 03:52:34,180
We can put file handle here.
5284
03:52:34,180 --> 03:52:35,560
We can go through the file.
5285
03:52:35,560 --> 03:52:36,460
We can put a string there.
5286
03:52:36,460 --> 03:52:38,120
We can go through the characters in the string
5287
03:52:38,120 --> 03:52:38,960
and then the list.
5288
03:52:38,960 --> 03:52:40,520
And the iteration variable then goes through
5289
03:52:40,520 --> 03:52:42,680
the successive elements of the list.
5290
03:52:42,680 --> 03:52:45,480
And that's why this prints off y4321.
5291
03:52:45,480 --> 03:52:48,040
And then the loop is done and it prints out a blast off.
5292
03:52:48,040 --> 03:52:50,440
So we've been using them and we've been actually iterating
5293
03:52:50,440 --> 03:52:53,640
through lists with for statements all along.
5294
03:52:54,660 --> 03:52:59,660
So the for statement has been something we use with lists.
5295
03:53:00,620 --> 03:53:03,020
And when you just need to go iterate through the list
5296
03:53:03,020 --> 03:53:05,240
and go through every item in order,
5297
03:53:05,240 --> 03:53:07,240
the for is a great way to do that.
5298
03:53:07,240 --> 03:53:09,340
So friend is our iteration variable.
5299
03:53:09,340 --> 03:53:11,400
Friends is our list variable.
5300
03:53:11,400 --> 03:53:14,040
And so that says friend is gonna successfully take
5301
03:53:14,040 --> 03:53:17,600
on the value Joseph, Glenn, and Sally and print out,
5302
03:53:17,600 --> 03:53:20,040
you know, Happy New Year, Joseph, Glenn, and Sally.
5303
03:53:20,040 --> 03:53:22,400
It runs three times once for each of the values
5304
03:53:22,400 --> 03:53:24,600
and the iteration variable advances.
5305
03:53:24,600 --> 03:53:28,000
Now, I do wanna make it really clear
5306
03:53:28,000 --> 03:53:32,840
that the choice of friends a and friend a,
5307
03:53:32,840 --> 03:53:35,440
singular and plural, is arbitrary and capricious.
5308
03:53:35,440 --> 03:53:38,760
It happens to be convenient and intuitive
5309
03:53:38,760 --> 03:53:41,240
that the iteration variable is one
5310
03:53:41,240 --> 03:53:43,520
and the list variable is more than one.
5311
03:53:43,520 --> 03:53:46,160
But Python has no idea about singular and plurals.
5312
03:53:46,160 --> 03:53:47,960
Matter of fact, Python would care.
5313
03:53:47,960 --> 03:53:50,280
It would be totally equivalent for Python
5314
03:53:50,280 --> 03:53:52,800
to do the same thing, to have the list variable be z
5315
03:53:52,800 --> 03:53:54,860
and the iteration variable be x.
5316
03:53:54,860 --> 03:53:58,080
X will take on the successive values of these three things.
5317
03:53:58,080 --> 03:54:01,200
Now, am I being nice to you by calling this list friends
5318
03:54:01,200 --> 03:54:03,320
and this iteration variable friend?
5319
03:54:03,320 --> 03:54:05,480
I am, but I also don't want it to confuse you
5320
03:54:05,480 --> 03:54:07,320
if you're just a beginning developer.
5321
03:54:08,880 --> 03:54:11,780
So just like strings, we can sort of look within lists.
5322
03:54:11,780 --> 03:54:14,200
Part of the thing is when you put more than one thing
5323
03:54:14,200 --> 03:54:17,160
in a data structure, you need to get them out.
5324
03:54:17,160 --> 03:54:21,200
And so lists have positions, they maintain order,
5325
03:54:21,200 --> 03:54:22,720
and so the first thing in the list
5326
03:54:22,720 --> 03:54:25,520
is the sub-zero position, sub-one, sub-two.
5327
03:54:25,520 --> 03:54:27,460
Just like strings, they're zero-based.
5328
03:54:27,460 --> 03:54:30,280
Just like European elevators, they're zero-based.
5329
03:54:30,280 --> 03:54:33,360
So if we take a look and we say, oh, friends sub-one,
5330
03:54:33,360 --> 03:54:35,760
that's how I read that, the little square brackets,
5331
03:54:35,760 --> 03:54:38,240
when you take a variable here and you say friends sub-one.
5332
03:54:38,240 --> 03:54:40,700
Remember, singular and plural don't matter.
5333
03:54:40,700 --> 03:54:43,840
Friends sub-one means glen, because this is the zero
5334
03:54:43,840 --> 03:54:46,440
and that's the one, and then Sally's the sub-two,
5335
03:54:46,440 --> 03:54:51,040
and so that's what prints glen out in this particular thing.
5336
03:54:51,040 --> 03:54:52,840
Now, lists are mutable.
5337
03:54:52,840 --> 03:54:54,480
Mutable is another word for changeable.
5338
03:54:54,480 --> 03:54:57,440
They can be changed, meaning that a list has three things.
5339
03:54:57,440 --> 03:55:00,040
You can change this thing right in the middle if you want.
5340
03:55:00,040 --> 03:55:01,840
To take a look at what's not mutable,
5341
03:55:01,840 --> 03:55:03,040
strings are not mutable.
5342
03:55:03,040 --> 03:55:06,140
So if I take a look at assigning banana into fruit,
5343
03:55:06,140 --> 03:55:08,640
well, fruit sub-zero is a capital letter B.
5344
03:55:08,640 --> 03:55:10,240
Could we imagine for the moment
5345
03:55:10,240 --> 03:55:14,400
that we could change fruit sub-zero to lowercase b?
5346
03:55:14,400 --> 03:55:16,320
Well, the syntax would be how you would do it
5347
03:55:16,320 --> 03:55:17,760
if you could do it, but it turns out
5348
03:55:17,760 --> 03:55:20,960
that strings are not mutable,
5349
03:55:20,960 --> 03:55:23,320
meaning they're not changeable once you create them.
5350
03:55:23,320 --> 03:55:25,720
And that's why when we do things like lowercase
5351
03:55:25,720 --> 03:55:28,760
or uppercase, we take a look at the fruit
5352
03:55:28,760 --> 03:55:30,680
and we say, give me a lowercase copy of that,
5353
03:55:30,680 --> 03:55:32,600
and then we take the return value from this
5354
03:55:32,600 --> 03:55:33,960
and we store that in x,
5355
03:55:33,960 --> 03:55:36,680
and that's how x becomes a lowercase banana.
5356
03:55:36,680 --> 03:55:39,320
But fruit is still the original one.
5357
03:55:39,320 --> 03:55:41,400
So fruit has not changed.
5358
03:55:41,400 --> 03:55:45,760
Compare and contrast that with a list, though.
5359
03:55:45,760 --> 03:55:49,600
Here we have a five-item list, two, 14, 26, 41.
5360
03:55:49,600 --> 03:55:52,640
And we're gonna do the sub-two position.
5361
03:55:52,640 --> 03:55:55,000
And the sub-two is zero, one, two.
5362
03:55:55,000 --> 03:55:56,280
So that's that one right there.
5363
03:55:56,280 --> 03:55:58,880
And we're going to assign a 28 into it.
5364
03:55:58,880 --> 03:56:00,680
So that 28 is going in here.
5365
03:56:00,680 --> 03:56:02,820
Gonna wipe that out and put 28 in.
5366
03:56:02,820 --> 03:56:05,760
So we can do item assignment in lists
5367
03:56:05,760 --> 03:56:09,280
by putting a bracket syntax on the left-hand side
5368
03:56:09,280 --> 03:56:11,080
to say, don't just put it in a variable,
5369
03:56:11,080 --> 03:56:13,040
put it in this position within the variable.
5370
03:56:13,040 --> 03:56:14,600
So that's what that's doing.
5371
03:56:14,600 --> 03:56:16,080
And when you print that out, the 28,
5372
03:56:16,080 --> 03:56:17,600
everything else is unchanged.
5373
03:56:17,600 --> 03:56:18,840
Meaning the whole list is there.
5374
03:56:18,840 --> 03:56:20,760
There could be 1,000 items in the list.
5375
03:56:20,760 --> 03:56:22,900
And then you're changing the second one.
5376
03:56:25,320 --> 03:56:26,760
We have a function called len.
5377
03:56:26,760 --> 03:56:29,500
We've been using this len function all along
5378
03:56:29,500 --> 03:56:31,360
to take a look at how long strings are.
5379
03:56:31,360 --> 03:56:32,900
It counts the number of characters in the string.
5380
03:56:32,900 --> 03:56:34,800
So that's a nine-character string.
5381
03:56:34,800 --> 03:56:36,880
If we have items in a list,
5382
03:56:36,880 --> 03:56:38,800
len tells us how many items there are.
5383
03:56:38,800 --> 03:56:40,500
It's not like how many characters there are.
5384
03:56:40,500 --> 03:56:42,400
It's the number of things.
5385
03:56:42,400 --> 03:56:44,480
And each thing doesn't have to be a number.
5386
03:56:44,480 --> 03:56:46,600
It could be a number, a string, or even another list.
5387
03:56:46,600 --> 03:56:47,840
And len is the way to say,
5388
03:56:47,840 --> 03:56:49,540
hey, how many things are in there?
5389
03:56:52,640 --> 03:56:55,600
There's a function that returns a list of numbers.
5390
03:56:55,600 --> 03:56:57,680
And we use it, as we'll see in a second,
5391
03:56:57,680 --> 03:56:59,920
to construct specialized loops to go through lists.
5392
03:56:59,920 --> 03:57:01,720
So let's take a look at this range function
5393
03:57:01,720 --> 03:57:03,000
just for a minute.
5394
03:57:03,000 --> 03:57:04,720
So range takes as its parameter
5395
03:57:04,720 --> 03:57:07,560
the number of numbers that you want returned.
5396
03:57:07,560 --> 03:57:10,440
So I'd like a four-item list
5397
03:57:10,440 --> 03:57:13,840
with the numbers zero, up to, but not including four.
5398
03:57:13,840 --> 03:57:16,680
And so it just turns out that that is really useful
5399
03:57:16,680 --> 03:57:19,560
for constructing four loops
5400
03:57:19,560 --> 03:57:20,720
that are counted four loops
5401
03:57:20,720 --> 03:57:22,880
that go to zero, to the one, to the two,
5402
03:57:22,880 --> 03:57:25,340
as compared to the definite loops
5403
03:57:25,340 --> 03:57:28,700
that go through each one.
5404
03:57:28,700 --> 03:57:30,440
And so it's a common thing to say,
5405
03:57:30,440 --> 03:57:32,680
okay, we know how many things are in this list.
5406
03:57:32,680 --> 03:57:34,080
There are three friends.
5407
03:57:34,080 --> 03:57:37,040
And if I put combine, range, and len,
5408
03:57:37,040 --> 03:57:39,200
so I take len friends, which is three,
5409
03:57:39,200 --> 03:57:41,280
and then I take range sub three,
5410
03:57:41,280 --> 03:57:42,960
I get zero, one, and two.
5411
03:57:42,960 --> 03:57:44,160
And so the interesting thing is
5412
03:57:44,160 --> 03:57:46,360
this zero corresponds to the first one,
5413
03:57:46,360 --> 03:57:47,960
one corresponds to the second one,
5414
03:57:47,960 --> 03:57:51,260
and two corresponds to the third one, okay?
5415
03:57:51,260 --> 03:57:55,320
And so we'll use this to construct loops,
5416
03:57:55,320 --> 03:57:58,320
especially when we need to go through an array
5417
03:58:00,920 --> 03:58:02,960
and remember what position we're at.
5418
03:58:02,960 --> 03:58:05,800
And so here's just an example of two different loops.
5419
03:58:05,800 --> 03:58:09,560
This is a four loop that's just gonna go through
5420
03:58:09,560 --> 03:58:10,680
whatever's in this list.
5421
03:58:10,680 --> 03:58:13,660
So friend is just gonna take on the success of values,
5422
03:58:13,660 --> 03:58:15,840
and so it's gonna print out these three things
5423
03:58:15,840 --> 03:58:17,480
just as you would expect.
5424
03:58:17,480 --> 03:58:18,920
And if you don't need to,
5425
03:58:18,920 --> 03:58:19,800
while you're going through the loop,
5426
03:58:19,800 --> 03:58:21,600
know the position, your relative position
5427
03:58:21,600 --> 03:58:24,720
from the top in the loop, that's okay.
5428
03:58:24,720 --> 03:58:27,520
But sometimes you want a little more sophisticated loop.
5429
03:58:27,520 --> 03:58:30,720
And instead, you want to be able to
5430
03:58:31,920 --> 03:58:34,520
loop through where you know the position.
5431
03:58:34,520 --> 03:58:35,880
And so what we do instead is,
5432
03:58:35,880 --> 03:58:38,040
instead of looping through that list itself,
5433
03:58:38,040 --> 03:58:42,800
we do range lend friends, which gives us zero, one, two.
5434
03:58:42,800 --> 03:58:46,360
And then I takes on the success of value zero, one,
5435
03:58:46,360 --> 03:58:47,440
and then two.
5436
03:58:47,440 --> 03:58:49,160
So this loop is gonna run four times,
5437
03:58:49,160 --> 03:58:50,720
and I is zero the first time.
5438
03:58:50,720 --> 03:58:53,640
And we might even just look up the value inside
5439
03:58:53,640 --> 03:58:57,480
that sub-zero value so we get Joseph the first time.
5440
03:58:57,480 --> 03:58:59,200
So prints out Happy New Year Joseph,
5441
03:58:59,200 --> 03:59:01,940
goes and I becomes one now,
5442
03:59:01,940 --> 03:59:03,960
and so it gives us Glen, and that prints out.
5443
03:59:03,960 --> 03:59:04,840
And away you go.
5444
03:59:04,840 --> 03:59:06,600
So if you look at these two loops,
5445
03:59:07,920 --> 03:59:09,040
if you look at these two loops,
5446
03:59:09,040 --> 03:59:11,080
they really do the exact same thing.
5447
03:59:11,080 --> 03:59:12,320
The only difference is this,
5448
03:59:12,320 --> 03:59:14,140
we allowed the four to find its way
5449
03:59:14,140 --> 03:59:15,680
with the iteration variable through.
5450
03:59:15,680 --> 03:59:18,540
And here we created our own I variable
5451
03:59:18,540 --> 03:59:20,600
that went through the positions.
5452
03:59:20,600 --> 03:59:22,320
And they're dense, there's no gaps in here,
5453
03:59:22,320 --> 03:59:26,280
so it's zero through two that it goes through.
5454
03:59:26,280 --> 03:59:28,600
So these two are equivalent.
5455
03:59:28,600 --> 03:59:30,880
There'll be times when you'll want to use one and the other.
5456
03:59:30,880 --> 03:59:33,200
I tend to prefer the first one
5457
03:59:33,200 --> 03:59:37,420
because it's prettier as long as it works for me.
5458
03:59:38,320 --> 03:59:39,960
So that gets us started with loops.
5459
03:59:39,960 --> 03:59:41,400
We'll be back in just a bit.
5460
03:59:44,720 --> 03:59:46,800
Okay, so we've taken a look at loops,
5461
03:59:46,800 --> 03:59:49,560
and now we're gonna just take a little bit of a look
5462
03:59:49,560 --> 03:59:52,800
at some of the operations that you can do with loops.
5463
03:59:52,800 --> 03:59:54,740
Python has this, as we'll soon learn,
5464
03:59:54,740 --> 03:59:57,320
object-oriented approach to its operators.
5465
03:59:57,320 --> 04:00:00,720
And the plus can add strings, and it can add numbers.
5466
04:00:00,720 --> 04:00:03,080
Floating point numbers, integer numbers, strings.
5467
04:00:03,080 --> 04:00:03,960
Et cetera.
5468
04:00:03,960 --> 04:00:08,360
And so the plus similarly works this way with lists.
5469
04:00:08,360 --> 04:00:10,800
The plus looks to its left and looks to its right
5470
04:00:10,800 --> 04:00:12,560
and says, what am I adding?
5471
04:00:12,560 --> 04:00:14,600
And in the case that I'm adding the list one, two, three,
5472
04:00:14,600 --> 04:00:17,600
and the list four, five, six, it concatenates them together.
5473
04:00:17,600 --> 04:00:19,880
And this way it sort of functions like a string,
5474
04:00:19,880 --> 04:00:21,760
and so we get one, two, three, four, five, six.
5475
04:00:21,760 --> 04:00:25,260
It just concatenate this list to another list.
5476
04:00:25,260 --> 04:00:26,840
And it doesn't change A or B
5477
04:00:26,840 --> 04:00:29,560
just like in any kind of assignment statement.
5478
04:00:29,560 --> 04:00:32,560
Calculations on the right side don't change the variables
5479
04:00:32,560 --> 04:00:33,720
and then produce a new variable
5480
04:00:33,720 --> 04:00:35,940
and then assign that into C.
5481
04:00:37,440 --> 04:00:41,720
You can also use list slicing, and it's easy to remember.
5482
04:00:41,720 --> 04:00:43,080
If you remember how strings work,
5483
04:00:43,080 --> 04:00:45,400
lists work exactly the same way.
5484
04:00:45,400 --> 04:00:48,080
So of course it's a little tricky.
5485
04:00:48,080 --> 04:00:49,740
The first number's the starting position.
5486
04:00:49,740 --> 04:00:50,840
They start at zero.
5487
04:00:50,840 --> 04:00:52,880
So one is right there.
5488
04:00:52,880 --> 04:00:54,900
So it's the zero position, the one position.
5489
04:00:54,900 --> 04:00:56,880
Start at one, right?
5490
04:00:56,880 --> 04:00:58,920
But go up two, but not including three.
5491
04:00:58,920 --> 04:01:01,980
There's one, two, three.
5492
04:01:01,980 --> 04:01:04,480
So this goes up two, but not including three.
5493
04:01:04,480 --> 04:01:07,120
And that's why we get 41, 12 out of that.
5494
04:01:07,120 --> 04:01:08,720
So up two, but not including.
5495
04:01:08,720 --> 04:01:11,560
I'll just say that over and over and over again.
5496
04:01:11,560 --> 04:01:14,700
If we do, you can leave the first part out.
5497
04:01:14,700 --> 04:01:16,240
You can leave the first part out here,
5498
04:01:16,240 --> 04:01:18,760
and you can say, oh, up two, but not including four.
5499
04:01:18,760 --> 04:01:19,980
So that starts at the beginning,
5500
04:01:19,980 --> 04:01:22,240
goes up two, but not including four.
5501
04:01:22,240 --> 04:01:24,920
And so that's how we get that piece right there.
5502
04:01:24,920 --> 04:01:29,640
We can say start at the position three,
5503
04:01:29,640 --> 04:01:32,680
zero, one, two, three, start at position three,
5504
04:01:32,680 --> 04:01:34,360
and go to the end.
5505
04:01:34,360 --> 04:01:35,840
Now the fact that the number three is in here
5506
04:01:35,840 --> 04:01:37,860
is sort of irrelevant.
5507
04:01:37,860 --> 04:01:41,040
Three to the end is those three numbers.
5508
04:01:41,040 --> 04:01:43,420
And then you can do the whole list with slicing as well.
5509
04:01:43,420 --> 04:01:45,460
Again, these pretty much are the exact same examples
5510
04:01:45,460 --> 04:01:47,580
I used when I was doing strings.
5511
04:01:47,580 --> 04:01:49,080
They're pretty much the same.
5512
04:01:52,360 --> 04:01:54,040
There's a number of different methods,
5513
04:01:54,040 --> 04:01:56,360
and you can look up all the documentation in the list.
5514
04:01:56,360 --> 04:01:58,680
I often just use the dir command
5515
04:01:58,680 --> 04:02:00,320
to remind myself of them.
5516
04:02:00,320 --> 04:02:01,680
A pen we'll look at.
5517
04:02:01,680 --> 04:02:04,520
Count looks for certain values in the list.
5518
04:02:04,520 --> 04:02:06,680
Extend adds things to the end of the list.
5519
04:02:06,680 --> 04:02:08,800
Index looks things up in the list.
5520
04:02:08,800 --> 04:02:12,620
Insert allows the list to sort of be expanded in the middle.
5521
04:02:12,620 --> 04:02:14,620
Pop pulls things off the top.
5522
04:02:14,620 --> 04:02:16,940
Remove removes an item in the middle.
5523
04:02:16,940 --> 04:02:19,520
Reverse flips the order of them and sort,
5524
04:02:19,520 --> 04:02:23,880
up puts them sorted order based on the values.
5525
04:02:23,880 --> 04:02:28,880
So let's look at a couple of these.
5526
04:02:31,200 --> 04:02:33,180
So if we build a list from scratch,
5527
04:02:33,180 --> 04:02:34,880
we have a way to ask for an empty list.
5528
04:02:34,880 --> 04:02:37,560
There are a couple different ways to ask for an empty list.
5529
04:02:37,560 --> 04:02:40,300
We could use just two square brackets next to each other.
5530
04:02:40,300 --> 04:02:42,380
But this is a form we call the constructor form
5531
04:02:42,380 --> 04:02:44,320
where we say, hey Python, make a list.
5532
04:02:44,320 --> 04:02:48,080
In this case, the word list is like a reserved word to Python.
5533
04:02:48,080 --> 04:02:52,000
It's really a reserved class, but say,
5534
04:02:52,000 --> 04:02:54,600
list parentheses says make me an empty list
5535
04:02:54,600 --> 04:02:57,000
and then assign that list into stuff.
5536
04:02:57,000 --> 04:02:59,920
So stuff is now, it's a list of object,
5537
04:02:59,920 --> 04:03:03,040
it's a type list, but it has nothing in it.
5538
04:03:03,040 --> 04:03:04,640
And then we can call the append method,
5539
04:03:04,640 --> 04:03:07,120
stuff.append and stick book in.
5540
04:03:07,120 --> 04:03:09,320
And then we say, oh, and that knows how long,
5541
04:03:09,320 --> 04:03:11,000
the stuff knows how long it is,
5542
04:03:11,000 --> 04:03:12,920
where the end is and how to add something to it
5543
04:03:12,920 --> 04:03:15,320
and then add a 99 to it, and we print it out.
5544
04:03:15,320 --> 04:03:18,960
We got book a 99, reminding ourselves that lists,
5545
04:03:18,960 --> 04:03:21,560
while they're often the same types of variables,
5546
04:03:21,560 --> 04:03:23,920
the same types of values in the various positions
5547
04:03:23,920 --> 04:03:26,840
in the list, it doesn't always have to be that way.
5548
04:03:26,840 --> 04:03:28,920
Then we say, oh, we'll stuff that append cookie,
5549
04:03:28,920 --> 04:03:30,880
you can keep on going, and then we end up
5550
04:03:30,880 --> 04:03:33,640
with three things and the cookie.
5551
04:03:35,040 --> 04:03:38,720
We have an in operator, works pretty much like
5552
04:03:38,720 --> 04:03:42,440
the in operator in a string, is nine in my list?
5553
04:03:42,440 --> 04:03:44,760
And that's pretty simple, and the answer of course is yes,
5554
04:03:44,760 --> 04:03:46,080
nine is in my list.
5555
04:03:46,080 --> 04:03:47,560
Is 15 in my list?
5556
04:03:47,560 --> 04:03:51,160
Looking through, no it's not, 15 is not in my list.
5557
04:03:51,160 --> 04:03:52,800
And then there's the not in operator,
5558
04:03:52,800 --> 04:03:54,800
think of that as kind of like one operator.
5559
04:03:54,800 --> 04:03:56,440
Is 20 not in the list?
5560
04:03:56,440 --> 04:03:58,640
And the answer is, since it's not there, is true.
5561
04:03:58,640 --> 04:04:00,760
And so that's a way to just, you know,
5562
04:04:00,760 --> 04:04:03,560
it's kind of like starts with or in for strings,
5563
04:04:03,560 --> 04:04:05,380
same kind of stuff.
5564
04:04:05,380 --> 04:04:07,880
Lists are in order, and they're sortable,
5565
04:04:07,880 --> 04:04:11,280
and so this is something that we take good advantage of.
5566
04:04:11,280 --> 04:04:13,760
A lot of what computers want to do is sort stuff,
5567
04:04:13,760 --> 04:04:15,320
you know, look all these things up,
5568
04:04:15,320 --> 04:04:17,480
append them, and then get them sorted.
5569
04:04:17,480 --> 04:04:22,000
And so there is this method inside of list,
5570
04:04:22,000 --> 04:04:23,080
that's just the sort method.
5571
04:04:23,080 --> 04:04:25,400
So here we, you know, put three values
5572
04:04:25,400 --> 04:04:27,360
in zero, one, two positions, zero, one, and two,
5573
04:04:27,360 --> 04:04:30,080
Joseph, Glenn, and Sally, and then we tell the list
5574
04:04:30,080 --> 04:04:32,040
to sort itself, and then we print it out.
5575
04:04:32,040 --> 04:04:34,280
Now this is actually sort of the list in place,
5576
04:04:34,280 --> 04:04:36,320
which is different than upper and lower,
5577
04:04:36,320 --> 04:04:39,080
because if you remember, strings are not mutable,
5578
04:04:39,080 --> 04:04:41,200
but lists are mutable, and so you say,
5579
04:04:41,200 --> 04:04:43,760
hey, just sort yourself, okay?
5580
04:04:43,760 --> 04:04:45,960
And so just sort yourself, and then it sorts it,
5581
04:04:45,960 --> 04:04:48,000
and then it's in alphabetical order,
5582
04:04:48,000 --> 04:04:49,040
Glenn, Joseph, and Sally.
5583
04:04:49,040 --> 04:04:51,760
I happen to be clever, I only put strings in there,
5584
04:04:51,760 --> 04:04:53,280
and I put my upper case and lower case
5585
04:04:53,280 --> 04:04:56,320
in a very consistent pattern, but the list has changed,
5586
04:04:56,320 --> 04:05:00,080
and if I look at list sub one, that is the second item,
5587
04:05:00,080 --> 04:05:02,680
which is Joseph that prints out right down there.
5588
04:05:06,360 --> 04:05:07,840
There's a whole bunch of built-in functions
5589
04:05:07,840 --> 04:05:09,080
to help manipulate list.
5590
04:05:09,080 --> 04:05:12,480
The other things I was showing was sort is a method
5591
04:05:12,480 --> 04:05:14,440
that's part of list, but there are other functions
5592
04:05:14,440 --> 04:05:17,400
that take list as their arguments.
5593
04:05:17,400 --> 04:05:19,360
We already talked about the lend function,
5594
04:05:19,360 --> 04:05:21,120
tells you how many items there are.
5595
04:05:21,120 --> 04:05:24,160
There is pretty obvious max, it says go through
5596
04:05:24,160 --> 04:05:28,600
and find the largest, min, go through and find the smallest,
5597
04:05:28,600 --> 04:05:31,920
sum goes through, adds them all up,
5598
04:05:31,920 --> 04:05:34,760
and we can say let's do average by taking the sum
5599
04:05:34,760 --> 04:05:38,120
of all of them and dividing it by the length,
5600
04:05:38,120 --> 04:05:40,340
and you might think to yourself, oh wow,
5601
04:05:40,340 --> 04:05:42,280
I wish we'd have known this such few chapters back
5602
04:05:42,280 --> 04:05:43,860
when we were having to write all those loops
5603
04:05:43,860 --> 04:05:47,620
to do max, min, sum, largest, smallest, et cetera.
5604
04:05:47,620 --> 04:05:48,760
You can kind of think in your mind
5605
04:05:48,760 --> 04:05:50,760
that inside each one of these functions is a loop
5606
04:05:50,760 --> 04:05:52,920
that does pretty much what you did in those chapters,
5607
04:05:52,920 --> 04:05:55,400
and part of the reason we did that back then,
5608
04:05:55,400 --> 04:05:56,560
even though these things were here,
5609
04:05:56,560 --> 04:05:58,920
was they're kind of easy loops to understand,
5610
04:05:59,800 --> 04:06:02,400
and so those are there,
5611
04:06:02,400 --> 04:06:07,400
and basically there allows two different ways
5612
04:06:08,240 --> 04:06:10,520
of building loops to do the maximum and minimum.
5613
04:06:10,520 --> 04:06:13,760
Now it's not necessarily all that much easier
5614
04:06:13,760 --> 04:06:17,840
to do something using these
5615
04:06:17,840 --> 04:06:20,940
because you either can do them the old way,
5616
04:06:20,940 --> 04:06:24,400
or you can make a list and then use these functions.
5617
04:06:24,400 --> 04:06:26,720
So let's take a look, and I'll just say
5618
04:06:26,720 --> 04:06:30,280
that these two bits of code are doing the exact same thing,
5619
04:06:30,280 --> 04:06:32,400
and what they are is they're implementing a program
5620
04:06:32,400 --> 04:06:34,360
that's gonna repeatedly ask for numbers
5621
04:06:34,360 --> 04:06:36,160
until we type the word done,
5622
04:06:36,160 --> 04:06:37,680
and then it's gonna compute the average
5623
04:06:37,680 --> 04:06:39,000
and tell us what they are,
5624
04:06:39,000 --> 04:06:43,760
and so using sort of the stuff from the loop chapter,
5625
04:06:43,760 --> 04:06:45,840
we start with a total variable and a count variable,
5626
04:06:45,840 --> 04:06:49,160
set them to zero, and then we read a number,
5627
04:06:49,160 --> 04:06:51,280
we check for done to break out,
5628
04:06:51,280 --> 04:06:53,720
but then we convert it to a floating point value,
5629
04:06:53,720 --> 04:06:55,800
and then we say total equals total plus value,
5630
04:06:55,800 --> 04:06:56,960
and count equals count plus one,
5631
04:06:56,960 --> 04:06:59,600
and so this is gonna run over and over and over again,
5632
04:06:59,600 --> 04:07:01,400
however many times we're gonna do this,
5633
04:07:01,400 --> 04:07:04,060
and then it's gonna pop out, and when it's done,
5634
04:07:04,060 --> 04:07:05,680
it's gonna have this value of total,
5635
04:07:05,680 --> 04:07:08,280
the running total will become the overall total,
5636
04:07:08,280 --> 04:07:11,200
divided by count, and it'll print the average out, okay?
5637
04:07:11,200 --> 04:07:14,080
And so that's kinda how we would have done this
5638
04:07:14,080 --> 04:07:16,760
before we knew how to do this with lists.
5639
04:07:16,760 --> 04:07:19,080
Now, let's take a look at the other one.
5640
04:07:20,160 --> 04:07:23,000
In the other one, we say let's make an empty list,
5641
04:07:23,000 --> 04:07:25,120
remember this is that constructor syntax
5642
04:07:25,120 --> 04:07:27,600
that says to Python, make me an empty list,
5643
04:07:27,600 --> 04:07:29,600
and assign the empty list.
5644
04:07:29,600 --> 04:07:32,120
It has nothing in it, right, but it is a list,
5645
04:07:32,120 --> 04:07:34,640
has nothing in it, into the variable num list.
5646
04:07:34,640 --> 04:07:36,520
Now we're gonna write another loop,
5647
04:07:36,520 --> 04:07:39,200
this part here is the same, these three lines,
5648
04:07:39,200 --> 04:07:42,400
read the number if it's done, quit, and convert it to value.
5649
04:07:42,400 --> 04:07:44,680
But instead of doing the actual calculation right now,
5650
04:07:44,680 --> 04:07:46,480
what we're gonna do is just append it to the list.
5651
04:07:46,480 --> 04:07:48,160
So the list will start out empty,
5652
04:07:48,160 --> 04:07:49,780
then the three will be in the list,
5653
04:07:49,780 --> 04:07:51,040
then the nine will be in the list,
5654
04:07:51,040 --> 04:07:52,600
then the five will be in the list.
5655
04:07:52,600 --> 04:07:54,840
So we're appending, each time through the loop,
5656
04:07:54,840 --> 04:07:56,480
we're appending into the list.
5657
04:07:56,480 --> 04:08:00,040
So we're just growing the list every time I read a value,
5658
04:08:00,040 --> 04:08:01,560
instead of actually computing something
5659
04:08:01,560 --> 04:08:03,000
with the value that we've got.
5660
04:08:03,000 --> 04:08:05,280
So in either case, we get value,
5661
04:08:05,280 --> 04:08:08,160
and in one case, we append it to the list.
5662
04:08:08,160 --> 04:08:10,480
And then finally, it finishes, the break happens,
5663
04:08:10,480 --> 04:08:12,380
and then we just say, oh, hey, Python,
5664
04:08:12,380 --> 04:08:13,580
sum up everything in the list,
5665
04:08:13,580 --> 04:08:14,880
add these three numbers together,
5666
04:08:14,880 --> 04:08:17,720
and then take the divided by the length of all those things,
5667
04:08:17,720 --> 04:08:19,040
and you'll have the average.
5668
04:08:19,040 --> 04:08:24,040
And so these two things give us exactly the same output.
5669
04:08:24,360 --> 04:08:25,600
Now there is one difference,
5670
04:08:25,600 --> 04:08:29,920
if there was like one million or one billion numbers,
5671
04:08:29,920 --> 04:08:31,260
they actually have to all be stored
5672
04:08:31,260 --> 04:08:32,560
in the memory simultaneously.
5673
04:08:32,560 --> 04:08:35,160
Whereas here, it's actually doing the calculation,
5674
04:08:35,160 --> 04:08:38,160
of the billion numbers, and not using up so much memory.
5675
04:08:38,160 --> 04:08:40,280
For most of the things that you're gonna be doing,
5676
04:08:40,280 --> 04:08:42,720
the difference in memory, there is a difference in memory.
5677
04:08:42,720 --> 04:08:45,380
This uses, this one here uses more memory,
5678
04:08:46,560 --> 04:08:49,540
but I can't draw very well, more memory.
5679
04:08:51,200 --> 04:08:54,080
It uses more memory, but it doesn't really matter
5680
04:08:54,080 --> 04:08:55,440
by the time it's all said and done.
5681
04:08:55,440 --> 04:08:59,520
And so for you, the difference between these things
5682
04:08:59,520 --> 04:09:02,600
is not all that significant, but it's important to understand
5683
04:09:02,600 --> 04:09:03,800
that they're just two techniques
5684
04:09:03,800 --> 04:09:06,220
to accomplish the same thing with lists.
5685
04:09:09,760 --> 04:09:11,800
So now we're gonna wrap up and talk a little bit
5686
04:09:11,800 --> 04:09:13,680
about how strings and lists are related.
5687
04:09:13,680 --> 04:09:16,480
They're sort of related in that they both have zero base
5688
04:09:16,480 --> 04:09:19,080
things and we use the square bracket operator
5689
04:09:19,080 --> 04:09:20,960
to do various things.
5690
04:09:20,960 --> 04:09:23,160
But there's a lot of situations where we're looking
5691
04:09:23,160 --> 04:09:26,520
at our data and we're combining the use of lists and strings.
5692
04:09:26,520 --> 04:09:28,080
So let me show you the first thing,
5693
04:09:28,080 --> 04:09:30,360
probably the coolest thing.
5694
04:09:30,360 --> 04:09:32,320
We're gonna use it a lot the rest of the class,
5695
04:09:32,320 --> 04:09:34,280
and that is the split function.
5696
04:09:34,280 --> 04:09:36,660
So let's take a string, we've got ABC here,
5697
04:09:36,660 --> 04:09:37,760
it's with three words.
5698
04:09:37,760 --> 04:09:40,080
What we're interested in the fact is that there's spaces
5699
04:09:40,080 --> 04:09:41,160
in this word.
5700
04:09:41,160 --> 04:09:42,800
And what split does is says, you know,
5701
04:09:42,800 --> 04:09:44,720
I'm gonna look through this thing, I'm gonna find this,
5702
04:09:44,720 --> 04:09:47,420
and I'm gonna break this into pieces,
5703
04:09:47,420 --> 04:09:49,320
and I'm gonna return you a list
5704
04:09:49,320 --> 04:09:51,000
of the separate individual pieces.
5705
04:09:51,000 --> 04:09:54,280
So look for blanks and break it in pieces
5706
04:09:54,280 --> 04:09:55,760
and give me back the pieces.
5707
04:09:55,760 --> 04:09:58,540
So I'll print these out and now you see that it's a list
5708
04:09:58,540 --> 04:10:00,520
with three items, with three words.
5709
04:10:00,520 --> 04:10:03,360
The spaces are gone, but it's given it to us.
5710
04:10:03,360 --> 04:10:05,220
So it's like, split this into words, please,
5711
04:10:05,220 --> 04:10:06,820
and give me the individual words,
5712
04:10:06,820 --> 04:10:08,620
give me a list of individual words,
5713
04:10:08,620 --> 04:10:11,160
rather than a big long string with spaces in the middle of it.
5714
04:10:11,160 --> 04:10:13,800
And that is a quick way to go from a line,
5715
04:10:13,800 --> 04:10:17,680
and it's really common, a lot of things we're going like,
5716
04:10:17,680 --> 04:10:20,280
go get the second thing, or the third thing, or whatever.
5717
04:10:20,280 --> 04:10:21,320
So the split's really nice,
5718
04:10:21,320 --> 04:10:23,160
because then you can just grab stuff.
5719
04:10:23,160 --> 04:10:25,000
And so you say, oh, how many things did I get?
5720
04:10:25,000 --> 04:10:27,600
Well, I got three, the len function tells us that.
5721
04:10:27,600 --> 04:10:30,080
And I can print the first word I got,
5722
04:10:30,080 --> 04:10:32,240
which is, and with the subzero,
5723
04:10:32,240 --> 04:10:34,800
and that'll be like with, will be the first word,
5724
04:10:34,800 --> 04:10:36,680
because that's the subzero position.
5725
04:10:36,680 --> 04:10:39,520
So I read something, I split it,
5726
04:10:39,520 --> 04:10:41,060
I can say there's three things,
5727
04:10:41,060 --> 04:10:43,520
and I can look at stuff the first word, basically,
5728
04:10:43,520 --> 04:10:44,960
without really knowing much.
5729
04:10:44,960 --> 04:10:47,840
Now, if you remember earlier, and we'll see this,
5730
04:10:47,840 --> 04:10:51,740
we used find and slicing to do a similar kind of thing,
5731
04:10:51,740 --> 04:10:55,440
but people tend to prefer the split.
5732
04:10:55,440 --> 04:11:00,280
And you can, you know, oops, go back.
5733
04:11:00,280 --> 04:11:04,320
You can also then loop through them,
5734
04:11:04,320 --> 04:11:07,160
so you can split these things into stuff as a word,
5735
04:11:07,160 --> 04:11:09,440
and then go through with w,
5736
04:11:09,440 --> 04:11:12,320
and then it's gonna go through,
5737
04:11:12,320 --> 04:11:15,200
w's gonna take the successive with three words.
5738
04:11:15,200 --> 04:11:18,280
And so you can make a loop by reading some data,
5739
04:11:18,280 --> 04:11:19,960
splitting it, then writing a for loop,
5740
04:11:19,960 --> 04:11:22,280
and then it's effectively going through the words
5741
04:11:22,280 --> 04:11:23,680
in that line of data.
5742
04:11:23,680 --> 04:11:26,840
And so that's a really powerful concept that we'll use
5743
04:11:26,840 --> 04:11:29,680
in a lot of the programs that we're going to write.
5744
04:11:29,680 --> 04:11:32,660
Just a couple of bits about this and how it works.
5745
04:11:33,560 --> 04:11:36,160
Split with no parameters here, it looks for spaces,
5746
04:11:36,160 --> 04:11:40,400
but it also treats a bunch of spaces as a single space.
5747
04:11:40,400 --> 04:11:42,400
And so it's pretty smart about that,
5748
04:11:42,400 --> 04:11:44,240
and so even though this has a lot of spaces
5749
04:11:44,240 --> 04:11:47,260
between lot and of, you only see lot of,
5750
04:11:47,260 --> 04:11:48,760
all the spaces are gone.
5751
04:11:48,760 --> 04:11:50,880
It does something special about spaces.
5752
04:11:50,880 --> 04:11:53,840
It's really white space, so tabs or new lines
5753
04:11:53,840 --> 04:11:58,840
or other characters would also qualify in split, basically.
5754
04:11:58,880 --> 04:12:01,780
Now, you don't always have to split based on spaces,
5755
04:12:01,780 --> 04:12:03,860
and a lot of data that you're gonna run into,
5756
04:12:03,860 --> 04:12:05,440
you're gonna wanna split on something else.
5757
04:12:05,440 --> 04:12:08,320
And so here's some data that looks like we're using colons
5758
04:12:08,320 --> 04:12:11,140
to separate the first, second, and third piece.
5759
04:12:11,140 --> 04:12:14,520
Now, if you just call split, split's looking for spaces.
5760
04:12:14,520 --> 04:12:16,800
And so split gives you back a list
5761
04:12:16,800 --> 04:12:18,660
of the things broken apart with spaces,
5762
04:12:18,660 --> 04:12:20,840
but there's not a single space in that line,
5763
04:12:20,840 --> 04:12:23,560
and so we get a list, see, it's a list,
5764
04:12:23,560 --> 04:12:24,560
but there's only one item,
5765
04:12:24,560 --> 04:12:25,960
and the semicolons are sitting there.
5766
04:12:25,960 --> 04:12:26,800
Split doesn't go like,
5767
04:12:26,800 --> 04:12:29,200
whoa, this looks like it should be semicolons.
5768
04:12:29,200 --> 04:12:31,360
Split's job is to use spaces
5769
04:12:31,360 --> 04:12:34,800
and split the string based on spaces, okay?
5770
04:12:34,800 --> 04:12:38,240
But given that this is something we like to do,
5771
04:12:38,240 --> 04:12:39,840
you can tell split what character
5772
04:12:39,840 --> 04:12:41,920
you'd actually like to split on.
5773
04:12:41,920 --> 04:12:43,460
Now, it's not quite as clever
5774
04:12:43,460 --> 04:12:45,540
when splitting on something other than spaces.
5775
04:12:45,540 --> 04:12:47,160
It doesn't understand that, you know,
5776
04:12:47,160 --> 04:12:48,980
if there's a bunch of semicolons in a row,
5777
04:12:48,980 --> 04:12:52,560
it still thinks of those as splitting points to split,
5778
04:12:52,560 --> 04:12:55,480
but in this particular case where there's no spaces,
5779
04:12:55,480 --> 04:12:56,920
you know, and it's gonna split that.
5780
04:12:56,920 --> 04:12:59,920
So it says split this based on the semicolon
5781
04:12:59,920 --> 04:13:03,840
instead of being based on the space.
5782
04:13:03,840 --> 04:13:06,400
And so if you take a look at what comes out of this,
5783
04:13:06,400 --> 04:13:09,520
we split on semicolon, now we have a three-item list,
5784
04:13:09,520 --> 04:13:11,320
and we get first, second, and third.
5785
04:13:11,320 --> 04:13:14,220
And a lot of your data comes out of some logging system
5786
04:13:14,220 --> 04:13:17,400
or some router status updates,
5787
04:13:17,400 --> 04:13:18,960
who knows what you're looking at,
5788
04:13:18,960 --> 04:13:21,840
but the delimiter is often something other than space,
5789
04:13:21,840 --> 04:13:24,320
and you can do that with split.
5790
04:13:26,840 --> 04:13:28,840
So this is a useful thing
5791
04:13:28,840 --> 04:13:31,400
when parsing things like our email address, right?
5792
04:13:31,400 --> 04:13:33,860
We wanted to get things like the email address,
5793
04:13:33,860 --> 04:13:36,960
this second piece, off of the line.
5794
04:13:38,340 --> 04:13:43,080
And so we can use split to take advantage of this.
5795
04:13:43,080 --> 04:13:44,440
And so here's a little loop
5796
04:13:44,440 --> 04:13:47,080
that's just gonna print out not the email addresses,
5797
04:13:47,080 --> 04:13:49,440
but instead the day of the week.
5798
04:13:49,440 --> 04:13:51,000
We're gonna print the day of the week out
5799
04:13:51,000 --> 04:13:51,960
for all these things.
5800
04:13:51,960 --> 04:13:52,880
How do we do that?
5801
04:13:52,880 --> 04:13:55,680
Well, we can observe really quickly
5802
04:13:55,680 --> 04:13:58,680
that if we split based on spaces,
5803
04:14:02,440 --> 04:14:06,320
it's the zero, one, two, it's the two position.
5804
04:14:06,320 --> 04:14:09,160
So we can quickly write a bit of code
5805
04:14:09,160 --> 04:14:13,360
that opens the file, then loops through the lines,
5806
04:14:13,360 --> 04:14:15,160
we do this all the time now.
5807
04:14:15,160 --> 04:14:18,080
The strip takes off the end of the new lines.
5808
04:14:18,080 --> 04:14:20,800
We can check to see if it starts with from space, right?
5809
04:14:20,800 --> 04:14:23,520
From space is our key, so we're ignoring,
5810
04:14:23,520 --> 04:14:24,960
we're ignoring all of the lines
5811
04:14:24,960 --> 04:14:26,200
that don't start with from space,
5812
04:14:26,200 --> 04:14:28,480
but then we find a line that starts with from space,
5813
04:14:28,480 --> 04:14:31,920
and we split it, and then we just print out the second word.
5814
04:14:31,920 --> 04:14:33,800
And so we get the second word of the lines
5815
04:14:33,800 --> 04:14:37,960
that start with from, and that's how this thing works.
5816
04:14:37,960 --> 04:14:42,960
Now, sometimes we want to dig into it deeper,
5817
04:14:45,120 --> 04:14:47,080
and we will take something, split it,
5818
04:14:47,080 --> 04:14:49,080
and then split another piece of it again
5819
04:14:49,080 --> 04:14:50,560
with a different delimiter.
5820
04:14:50,560 --> 04:14:53,040
So let's just say that the thing that we want to achieve
5821
04:14:53,040 --> 04:14:56,160
is getting the part after the at sign for email addresses.
5822
04:14:56,160 --> 04:14:59,160
And we did this with, again, find and pose
5823
04:14:59,160 --> 04:15:01,040
and stuff like that, but you can use split
5824
04:15:01,040 --> 04:15:02,480
to do this as well.
5825
04:15:02,480 --> 04:15:03,560
So the first thing we're gonna do
5826
04:15:03,560 --> 04:15:04,480
is we're gonna take this line,
5827
04:15:04,480 --> 04:15:06,520
we're gonna split it based on spaces, right?
5828
04:15:06,520 --> 04:15:09,480
Chop, chop, chop, chop, chop, chop,
5829
04:15:09,480 --> 04:15:11,240
and the fact that there's an extra space there,
5830
04:15:11,240 --> 04:15:14,040
doesn't matter, split happily just like zooms through that.
5831
04:15:14,040 --> 04:15:18,320
And then words sub one, zero, one, two,
5832
04:15:18,320 --> 04:15:20,200
word sub one is this email address,
5833
04:15:20,200 --> 04:15:22,320
so we'll put that in a variable called email,
5834
04:15:22,320 --> 04:15:25,160
and so email will be a string that's just this.
5835
04:15:25,160 --> 04:15:27,600
So in two lines, we've pulled out
5836
04:15:27,600 --> 04:15:29,840
the second address into a variable.
5837
04:15:29,840 --> 04:15:34,000
Then what we're going to do is we're going to re-split that.
5838
04:15:34,000 --> 04:15:36,120
We're gonna take this string we've got
5839
04:15:36,120 --> 04:15:37,720
and split it based on at sign,
5840
04:15:37,720 --> 04:15:39,400
because we know it's an email address.
5841
04:15:39,400 --> 04:15:41,160
So we get a new set of pieces,
5842
04:15:41,160 --> 04:15:43,040
the first part is the person's name,
5843
04:15:43,040 --> 04:15:46,720
and the second part is the host name
5844
04:15:46,720 --> 04:15:48,920
that their email is hosted on.
5845
04:15:48,920 --> 04:15:52,080
And then what we can do then is we just happen to know that,
5846
04:15:54,240 --> 04:15:57,000
we just happen to know that this is the zero item
5847
04:15:57,000 --> 04:15:59,360
and this is the one item, so we can get at that.
5848
04:15:59,360 --> 04:16:01,400
So the interesting thing of going here,
5849
04:16:01,400 --> 04:16:03,840
if you think back to how we did this before
5850
04:16:03,840 --> 04:16:06,640
with find and pose and all that stuff,
5851
04:16:06,640 --> 04:16:08,920
it's really a lot cleaner and we don't,
5852
04:16:08,920 --> 04:16:12,400
for me, I can look at this after you understand it
5853
04:16:12,400 --> 04:16:14,560
and it's easy for me to understand that it's correct,
5854
04:16:14,560 --> 04:16:17,400
whereas that pose stuff, you gotta add one
5855
04:16:17,400 --> 04:16:20,720
and start the second find after, just remember that.
5856
04:16:20,720 --> 04:16:22,200
And this is a lot cleaner way,
5857
04:16:22,200 --> 04:16:23,720
and this is a more typical way
5858
04:16:23,720 --> 04:16:27,240
of pulling this kind of information out of a line.
5859
04:16:28,320 --> 04:16:30,080
So in this chapter, we've talked about lists,
5860
04:16:30,080 --> 04:16:31,800
we've talked about the concept of collections,
5861
04:16:31,800 --> 04:16:33,120
that's our first data structure,
5862
04:16:33,120 --> 04:16:34,600
we're not just doing algorithms,
5863
04:16:34,600 --> 04:16:36,800
we kinda know algorithms now,
5864
04:16:36,800 --> 04:16:38,080
but now we're gonna do data structures.
5865
04:16:38,080 --> 04:16:40,360
And in this chapter and the next two chapters
5866
04:16:40,360 --> 04:16:42,040
are our foundational data structures
5867
04:16:42,040 --> 04:16:43,520
and then we'll, like everything,
5868
04:16:43,520 --> 04:16:45,640
we'll make more complex data structures
5869
04:16:45,640 --> 04:16:48,280
by composing those data structures together.
5870
04:16:48,280 --> 04:16:51,280
We've looked at how strings and lists connect together
5871
04:16:51,280 --> 04:16:55,160
and how split works and these are all really powerful tools
5872
04:16:55,160 --> 04:16:57,000
that we're gonna use going forward.
5873
04:16:57,000 --> 04:17:02,000
Now we're gonna take a look at how we would write some code
5874
04:17:04,600 --> 04:17:08,240
to do some parsing, read some data.
5875
04:17:08,240 --> 04:17:09,280
As a matter of fact, we're gonna read
5876
04:17:09,280 --> 04:17:12,080
through our famous mailbox data,
5877
04:17:12,080 --> 04:17:15,800
look for lines that begin with from space
5878
04:17:15,800 --> 04:17:17,160
and extract the third word.
5879
04:17:17,160 --> 04:17:19,480
As a matter of fact, we already have some of this code
5880
04:17:19,480 --> 04:17:21,480
already written, we're gonna debug it.
5881
04:17:21,480 --> 04:17:23,960
We're gonna look at code and we're gonna debug it.
5882
04:17:23,960 --> 04:17:25,720
So here we go, here we have it
5883
04:17:25,720 --> 04:17:28,240
and it's a pretty basic program.
5884
04:17:28,240 --> 04:17:31,520
It opens a file, loops through the file,
5885
04:17:31,520 --> 04:17:34,760
throws away the white space, splits it into words
5886
04:17:34,760 --> 04:17:36,760
and checks to see if the zeroth word,
5887
04:17:36,760 --> 04:17:39,120
the first word is from and if it's not,
5888
04:17:39,120 --> 04:17:41,200
we skip and read the next line.
5889
04:17:41,200 --> 04:17:43,960
And otherwise, if we find a line that starts
5890
04:17:43,960 --> 04:17:46,880
with from space, then we print the third word,
5891
04:17:46,880 --> 04:17:48,720
which is word sub two.
5892
04:17:48,720 --> 04:17:50,160
Okay, so this is what we've got
5893
04:17:50,160 --> 04:17:52,900
and we carefully saved this file
5894
04:17:52,900 --> 04:17:57,340
into the same folder that we've got, EX08.
5895
04:17:57,340 --> 04:18:02,340
And so let's go ahead, cd, desktop, Python for everybody,
5896
04:18:03,280 --> 04:18:05,260
EX underscore 08.
5897
04:18:06,160 --> 04:18:11,160
And so this is some files, we got our day of the week,
5898
04:18:11,320 --> 04:18:15,180
Python and our inbox short, so that's sitting there, okay?
5899
04:18:15,180 --> 04:18:16,700
And so let's run this program.
5900
04:18:16,700 --> 04:18:19,360
This is the program we've got right here,
5901
04:18:19,360 --> 04:18:24,360
Python three, dow.py and it doesn't work.
5902
04:18:26,960 --> 04:18:29,360
Now, by now you've seen a few trace backs
5903
04:18:29,360 --> 04:18:30,500
and there you go.
5904
04:18:32,120 --> 04:18:36,040
So, you know, when you look at a trace back,
5905
04:18:36,040 --> 04:18:39,080
you think to yourself, well, I made a mistake
5906
04:18:39,080 --> 04:18:42,160
and you've gotten pretty good at looking at that line.
5907
04:18:42,160 --> 04:18:44,080
So there you are, you're like, this is the line,
5908
04:18:44,080 --> 04:18:46,520
there must be something wrong on this line
5909
04:18:46,520 --> 04:18:48,720
and you wanna change it.
5910
04:18:48,720 --> 04:18:50,880
But that line's not actually the problem
5911
04:18:50,880 --> 04:18:51,920
in this particular thing.
5912
04:18:51,920 --> 04:18:54,360
And so you gotta be careful sometimes.
5913
04:18:54,360 --> 04:18:56,480
And one of the things that you didn't notice
5914
04:18:56,480 --> 04:19:00,320
in this one right away is that it actually worked.
5915
04:19:00,320 --> 04:19:02,840
It printed the first line out.
5916
04:19:02,840 --> 04:19:05,620
So if we take a look at our data set,
5917
04:19:05,620 --> 04:19:08,120
it found the line started with from space,
5918
04:19:08,120 --> 04:19:11,160
it split it and printed out the third word
5919
04:19:11,160 --> 04:19:13,200
and it blew up later.
5920
04:19:13,200 --> 04:19:16,080
And so part of the problem is that we don't know
5921
04:19:16,080 --> 04:19:19,560
what it was doing when it blew up.
5922
04:19:19,560 --> 04:19:21,600
And so the first thing I'd like to do
5923
04:19:21,600 --> 04:19:26,000
in this kind of a situation is find the line
5924
04:19:26,000 --> 04:19:29,400
and make sure there's a print statement right before it.
5925
04:19:29,400 --> 04:19:34,400
And so I'm gonna print words colon and then comma WDS.
5926
04:19:35,360 --> 04:19:38,580
I wanna print right before the line that blows up
5927
04:19:38,580 --> 04:19:41,720
so that I know really when this finally does blow up,
5928
04:19:41,720 --> 04:19:44,000
what was going on in that line.
5929
04:19:44,000 --> 04:19:46,720
So I'm gonna run it again.
5930
04:19:48,600 --> 04:19:51,000
And oop, did I forget to save it?
5931
04:19:51,000 --> 04:19:52,760
No, I forgot to save it, look at that.
5932
04:19:52,760 --> 04:19:54,960
See the little blue dot, forgot to save it.
5933
04:19:58,680 --> 04:20:00,480
So now we see a whole bunch of output.
5934
04:20:00,480 --> 04:20:02,360
And we see that it's actually doing a whole lot of work
5935
04:20:02,360 --> 04:20:03,960
before it's blowing up.
5936
04:20:03,960 --> 04:20:06,940
And so you see that it prints the words out
5937
04:20:06,940 --> 04:20:08,980
from that first line and prints out Saturday,
5938
04:20:08,980 --> 04:20:10,640
which is exactly what we expect.
5939
04:20:10,640 --> 04:20:12,280
It's the third word in the line.
5940
04:20:12,280 --> 04:20:13,680
And then reads a whole bunch of stuff
5941
04:20:13,680 --> 04:20:16,360
and it's actually, what it's doing now is ignoring.
5942
04:20:16,360 --> 04:20:18,520
Let me just put something here.
5943
04:20:18,520 --> 04:20:22,340
I'm gonna say print ignore.
5944
04:20:26,240 --> 04:20:29,120
So I can keep track of when these lines are being ignored.
5945
04:20:29,120 --> 04:20:32,360
So let's run it again and have the word ignore pop up.
5946
04:20:32,360 --> 04:20:35,000
Right, and so it's doing a lot of ignoring.
5947
04:20:35,000 --> 04:20:39,200
It finds these words, prints out Saturday,
5948
04:20:39,200 --> 04:20:40,760
reads this line and ignores it,
5949
04:20:40,760 --> 04:20:42,200
reads this line and ignores it,
5950
04:20:42,200 --> 04:20:43,360
reads this line and ignores it.
5951
04:20:43,360 --> 04:20:47,520
So a lot of stuff's going on here that you might not realize.
5952
04:20:47,520 --> 04:20:50,760
And so we have to take a look at what the problem is.
5953
04:20:50,760 --> 04:20:54,260
And so it is now blowing up word sub zero.
5954
04:20:54,260 --> 04:20:56,240
And now we can scroll down and we can look
5955
04:20:56,240 --> 04:20:59,420
at exactly what happened right before the trace back.
5956
04:20:59,420 --> 04:21:02,480
So we really now know exactly what happened
5957
04:21:02,480 --> 04:21:03,400
before the trace back.
5958
04:21:03,400 --> 04:21:05,200
And the interesting thing is,
5959
04:21:05,200 --> 04:21:08,840
is that there is an empty, empty string.
5960
04:21:08,840 --> 04:21:10,660
I mean empty array.
5961
04:21:10,660 --> 04:21:12,560
There's an array with zero items.
5962
04:21:12,560 --> 04:21:14,700
So I'm gonna print the line out too.
5963
04:21:16,640 --> 04:21:19,920
Print line colon.
5964
04:21:19,920 --> 04:21:21,340
Now I haven't changed my program at all.
5965
04:21:21,340 --> 04:21:24,080
I'm just trying to figure out what's going on here.
5966
04:21:24,080 --> 04:21:26,280
So I'll save that and I'm gonna run it.
5967
04:21:27,280 --> 04:21:30,080
And we've got a lot of stuff and it's still working.
5968
04:21:30,080 --> 04:21:34,000
It reads a line, it reads a line, splits it into words,
5969
04:21:34,000 --> 04:21:35,240
and then prints out Saturday,
5970
04:21:35,240 --> 04:21:37,560
which is the third word on the line.
5971
04:21:37,560 --> 04:21:40,900
Now here it reads a line and this line is a blank line.
5972
04:21:40,900 --> 04:21:43,860
And it has, because it's a blank line,
5973
04:21:43,860 --> 04:21:47,900
the split returns no words and that's what blows up.
5974
04:21:47,900 --> 04:21:50,220
And the problem now is, oh, wait a sec,
5975
04:21:50,220 --> 04:21:52,140
list index out of range.
5976
04:21:52,140 --> 04:21:55,500
So word sub zero is not valid, which is the first word,
5977
04:21:55,500 --> 04:21:57,320
when there are no words.
5978
04:21:57,320 --> 04:22:01,020
So this is a statement that works most of the time.
5979
04:22:01,020 --> 04:22:02,860
Now you might think, oh, I wanna just put a try
5980
04:22:02,860 --> 04:22:04,300
and accept in there.
5981
04:22:04,300 --> 04:22:08,140
Well, the right thing to do is to say to yourself,
5982
04:22:08,140 --> 04:22:09,920
oh, wait a second.
5983
04:22:09,920 --> 04:22:14,500
If the, I don't have enough words,
5984
04:22:14,500 --> 04:22:17,600
if the length of the words is less than one,
5985
04:22:20,660 --> 04:22:21,500
continue.
5986
04:22:23,500 --> 04:22:25,780
So basically it's gonna come through here,
5987
04:22:25,780 --> 04:22:28,240
it's gonna split it and if we don't have any words,
5988
04:22:28,240 --> 04:22:31,820
meaning it's a blank line, then we're gonna skip it.
5989
04:22:31,820 --> 04:22:32,860
So let's run that.
5990
04:22:35,160 --> 04:22:37,000
So now this ran all the way to the end.
5991
04:22:37,000 --> 04:22:39,220
It did a lot of stuff and it did not blow up
5992
04:22:39,220 --> 04:22:44,180
specifically, didn't have a trace back.
5993
04:22:44,180 --> 04:22:47,220
Another way to protect this would be to,
5994
04:22:48,100 --> 04:22:49,140
we'll take this part out.
5995
04:22:49,140 --> 04:22:50,860
This is called a guardian pattern.
5996
04:22:54,280 --> 04:22:55,420
Right, guardian pattern,
5997
04:22:55,420 --> 04:22:58,420
because this is dangerous.
5998
04:22:58,420 --> 04:23:02,100
This could blow up, but this, it won't blow up
5999
04:23:02,100 --> 04:23:06,020
if it makes it past here and it won't come through there
6000
04:23:06,020 --> 04:23:08,220
under the conditions that are causing it to blow up.
6001
04:23:08,220 --> 04:23:11,460
Another way to do this might be to protect it as follows.
6002
04:23:11,460 --> 04:23:12,940
To say, oh, wait a sec.
6003
04:23:14,260 --> 04:23:16,640
If the line is a blank line,
6004
04:23:18,780 --> 04:23:22,020
no, continue.
6005
04:23:22,020 --> 04:23:24,900
So now what we're gonna do is we're gonna skip blank lines.
6006
04:23:24,900 --> 04:23:29,900
I even say this, print skip blank.
6007
04:23:29,900 --> 04:23:34,900
So if it's a blank, we're gonna skip blank and keep going.
6008
04:23:38,340 --> 04:23:40,100
This will skip blank lines.
6009
04:23:40,100 --> 04:23:41,540
It'll come through here
6010
04:23:41,540 --> 04:23:43,860
and this will skip lines that don't have from,
6011
04:23:43,860 --> 04:23:46,300
but because we're not processing blank lines,
6012
04:23:46,300 --> 04:23:48,620
words of zero always works.
6013
04:23:48,620 --> 04:23:51,480
So I can run this code and it works again.
6014
04:23:51,480 --> 04:23:54,320
So here we have a blank line, we skipped it.
6015
04:23:54,320 --> 04:23:56,760
Here we have a blank line, we skipped it.
6016
04:23:56,760 --> 04:23:59,040
Now here we had a non blank lines, we parsed it,
6017
04:23:59,040 --> 04:24:00,660
but then we ignored it.
6018
04:24:00,660 --> 04:24:03,440
And then up here, we'll find it from somewhere.
6019
04:24:03,440 --> 04:24:05,120
Doo doo doo doo doo.
6020
04:24:05,120 --> 04:24:07,940
Let's find it from, here it comes.
6021
04:24:15,420 --> 04:24:17,640
Oh, no, there's ignore, ignore.
6022
04:24:17,640 --> 04:24:22,640
I got too much debug print, I can't find it.
6023
04:24:29,120 --> 04:24:31,180
Here, I'll just hunt for from with find.
6024
04:24:37,840 --> 04:24:39,360
Okay, so there we go.
6025
04:24:39,360 --> 04:24:41,920
There it's from and we print the thing out.
6026
04:24:41,920 --> 04:24:45,640
So we're getting a lot of extra stuff.
6027
04:24:45,640 --> 04:24:48,840
So I'm gonna comment out some of these debugs.
6028
04:24:51,560 --> 04:24:52,760
And I'm actually just gonna get rid
6029
04:24:52,760 --> 04:24:54,520
of this whole skipping of the blank line.
6030
04:24:54,520 --> 04:24:55,940
I'm gonna do it with the words.
6031
04:24:55,940 --> 04:24:59,180
I'm gonna go back to the guardian we had before.
6032
04:25:05,880 --> 04:25:09,200
If the number of words that we got,
6033
04:25:09,200 --> 04:25:16,200
ln of words, is less than one, continue.
6034
04:25:16,680 --> 04:25:19,380
Okay, so now this is gonna be a working program.
6035
04:25:22,400 --> 04:25:26,160
Oops, I gotta take another print statement out.
6036
04:25:27,580 --> 04:25:29,440
Gotta take another print statement out.
6037
04:25:29,440 --> 04:25:31,280
We sort of know what we're doing here.
6038
04:25:34,000 --> 04:25:36,960
Okay, so this looks like a pretty safe thing.
6039
04:25:36,960 --> 04:25:40,560
This guardian is protecting this dangerous.
6040
04:25:40,560 --> 04:25:42,200
I'll get rid of that one too.
6041
04:25:42,200 --> 04:25:45,380
This is the words that was traced back.
6042
04:25:45,380 --> 04:25:47,160
And nothing else in this thing changed
6043
04:25:47,160 --> 04:25:50,100
from when we started except we've added this little guardian.
6044
04:25:50,100 --> 04:25:52,400
Now the interesting thing is if it comes through here
6045
04:25:52,400 --> 04:25:55,280
and prints words of two, what happens if somehow
6046
04:25:55,280 --> 04:25:58,720
we find a line that has from is its first word
6047
04:25:59,960 --> 04:26:02,640
and there's only one word on, this is gonna blow up.
6048
04:26:02,640 --> 04:26:07,640
So we can make our guardian a little stronger.
6049
04:26:07,960 --> 04:26:11,120
And we can say, you know what, we're gonna skip this line
6050
04:26:11,120 --> 04:26:12,800
if it doesn't have the three words in it.
6051
04:26:12,800 --> 04:26:15,120
So it has to have at least three words.
6052
04:26:15,120 --> 04:26:17,360
And if we see less than three words, we're gonna skip it.
6053
04:26:17,360 --> 04:26:19,760
And that just makes the guardian a bit stronger.
6054
04:26:21,980 --> 04:26:25,480
And so the program works safely and you see these things
6055
04:26:25,480 --> 04:26:29,080
where sometimes you wanna check to see reasonable,
6056
04:26:29,080 --> 04:26:31,160
that your assumptions about the data are reasonable
6057
04:26:31,160 --> 04:26:34,240
and skip things where the data is not reasonable.
6058
04:26:35,120 --> 04:26:36,960
So that's one guardian pattern.
6059
04:26:36,960 --> 04:26:39,380
Let me show you a slightly different way to do this.
6060
04:26:39,380 --> 04:26:41,480
And this is with an or statement.
6061
04:26:41,480 --> 04:26:45,200
So I'm gonna take this code, copy that,
6062
04:26:45,200 --> 04:26:47,080
and put it here with or.
6063
04:26:47,960 --> 04:26:49,400
Get rid of all this stuff.
6064
04:26:50,600 --> 04:26:55,600
This is the guardian in a compound statement.
6065
04:26:55,600 --> 04:27:00,600
So what we're saying is if there are less than three words
6066
04:27:04,740 --> 04:27:09,740
on the line or if the first word is not from, continue.
6067
04:27:10,620 --> 04:27:13,860
Now we're doing this in order because the way it works
6068
04:27:13,860 --> 04:27:17,900
is or is true if either that's true or this is true.
6069
04:27:17,900 --> 04:27:21,180
But if it knows that this is true,
6070
04:27:21,180 --> 04:27:22,900
then it doesn't bother checking this.
6071
04:27:22,900 --> 04:27:25,120
And the checking of this is what blows up,
6072
04:27:25,120 --> 04:27:26,940
what causes the trace back.
6073
04:27:26,940 --> 04:27:29,220
So if we flip this order, it would fail.
6074
04:27:29,220 --> 04:27:31,860
If we do it in this order, it will work.
6075
04:27:31,860 --> 04:27:33,640
So let's do this one right.
6076
04:27:35,980 --> 04:27:37,140
It works.
6077
04:27:37,140 --> 04:27:39,560
But if I get this backwards,
6078
04:27:45,400 --> 04:27:48,760
it's gonna check this before it checks this.
6079
04:27:48,760 --> 04:27:53,060
And we're going to go back to failing again.
6080
04:27:53,060 --> 04:27:55,760
So you gotta get the order of these things right.
6081
04:27:55,760 --> 04:28:00,760
The guardian comes before in the or.
6082
04:28:02,900 --> 04:28:04,620
The guardian comes before.
6083
04:28:04,620 --> 04:28:07,240
And if this is true, then it doesn't check this.
6084
04:28:07,240 --> 04:28:09,800
This is called short circuit evaluation
6085
04:28:09,800 --> 04:28:11,920
where it knows that as long as this part's true,
6086
04:28:11,920 --> 04:28:15,020
it doesn't evaluate this second part.
6087
04:28:15,020 --> 04:28:18,820
And so now we have a guardian in a compound statement.
6088
04:28:18,820 --> 04:28:21,180
You'll see this a lot.
6089
04:28:21,180 --> 04:28:22,700
Sometimes if it's more complex,
6090
04:28:22,700 --> 04:28:24,200
you do it in multiple statements,
6091
04:28:24,200 --> 04:28:27,820
or you fall through, check for sanity, check for sanity,
6092
04:28:27,820 --> 04:28:30,140
and only run the code.
6093
04:28:31,400 --> 04:28:35,360
So I hope that that was useful to you,
6094
04:28:35,360 --> 04:28:37,340
looking a little bit about how to debug
6095
04:28:37,340 --> 04:28:40,540
where you don't just start chopping on the line
6096
04:28:40,540 --> 04:28:41,400
that had the problem.
6097
04:28:41,400 --> 04:28:42,680
It's not always that line
6098
04:28:42,680 --> 04:28:44,700
because we never did change that line.
6099
04:28:44,700 --> 04:28:46,320
Although we did change it a little bit at the end,
6100
04:28:46,320 --> 04:28:47,800
we added this guardian here.
6101
04:28:47,800 --> 04:28:49,360
But we also fixed it without it.
6102
04:28:50,320 --> 04:28:52,160
Sometimes you add some print statements
6103
04:28:52,160 --> 04:28:53,300
to figure out what's going on
6104
04:28:53,300 --> 04:28:56,280
before you just start chopping on that line.
6105
04:28:56,280 --> 04:28:58,940
So again, I hope this helps.
6106
04:28:58,940 --> 04:28:59,780
Thanks.
6107
04:29:03,360 --> 04:29:04,880
Hello and welcome to chapter nine.
6108
04:29:04,880 --> 04:29:07,280
Now we're gonna talk about Python dictionaries.
6109
04:29:07,280 --> 04:29:11,260
Python dictionaries are probably the thing
6110
04:29:11,260 --> 04:29:15,120
that most programmers love the most about Python
6111
04:29:15,120 --> 04:29:16,260
because they're very powerful.
6112
04:29:16,260 --> 04:29:18,240
They're like a little in-memory database.
6113
04:29:18,240 --> 04:29:20,560
It's the second of our kinds of collections
6114
04:29:20,560 --> 04:29:22,840
and probably the best collection.
6115
04:29:24,000 --> 04:29:25,200
To review what a collection is,
6116
04:29:25,200 --> 04:29:27,840
it is a situation where we are going to have a variable,
6117
04:29:27,840 --> 04:29:29,640
like a list or a dictionary,
6118
04:29:29,640 --> 04:29:32,320
that we can put multiple pieces of information in
6119
04:29:32,320 --> 04:29:34,960
rather than a single piece of information.
6120
04:29:34,960 --> 04:29:36,680
And of course, prior to collections,
6121
04:29:36,680 --> 04:29:38,740
we would put something into X
6122
04:29:38,740 --> 04:29:40,360
and then we would put something else into X
6123
04:29:40,360 --> 04:29:42,160
and it would be overwritten.
6124
04:29:42,160 --> 04:29:46,320
And now with lists, we can append things on to the end.
6125
04:29:46,320 --> 04:29:50,280
And so if we compare lists and dictionaries,
6126
04:29:50,280 --> 04:29:54,520
the list is sort of the organized version of the collections.
6127
04:29:54,520 --> 04:29:56,000
Everything stays in order.
6128
04:29:56,000 --> 04:29:57,880
You add something, it always adds to the end.
6129
04:29:57,880 --> 04:30:00,240
You take something, it sort of compacts itself.
6130
04:30:00,240 --> 04:30:02,760
It's zero through the n minus one,
6131
04:30:02,760 --> 04:30:04,440
where n is the number of items.
6132
04:30:04,440 --> 04:30:07,280
And so it's very organized, kind of like a Pringles,
6133
04:30:07,280 --> 04:30:09,420
where the potato chips are nicely stacked.
6134
04:30:11,160 --> 04:30:13,440
Dictionaries are messier.
6135
04:30:13,440 --> 04:30:16,240
You can put things into dictionaries.
6136
04:30:16,240 --> 04:30:19,640
There's no real sense of order in dictionaries.
6137
04:30:19,640 --> 04:30:20,760
Everything has a key.
6138
04:30:20,760 --> 04:30:22,220
So you sort of throw things in
6139
04:30:22,220 --> 04:30:24,520
and they kind of mix around in there somehow.
6140
04:30:24,520 --> 04:30:26,420
And you pull things out based on the key.
6141
04:30:26,420 --> 04:30:29,560
It's like you sort of stick a label on it,
6142
04:30:30,880 --> 04:30:34,000
where you say, okay, I'm gonna take this thing
6143
04:30:36,120 --> 04:30:37,600
and I'm gonna put Chuck on it.
6144
04:30:39,000 --> 04:30:44,000
And I'm gonna take these sunglasses with the Chuck label
6145
04:30:46,120 --> 04:30:48,080
and I'm gonna throw it into the dictionary
6146
04:30:48,080 --> 04:30:50,000
and I'm like, hey, give me back Chuck.
6147
04:30:50,000 --> 04:30:51,480
I'm like, oh, here's your sunglasses
6148
04:30:51,480 --> 04:30:53,280
because you mark everything.
6149
04:30:53,280 --> 04:30:56,180
This is like the key.
6150
04:30:56,180 --> 04:30:57,120
This is the value.
6151
04:30:57,120 --> 04:30:59,440
I took a pair of sunglasses and I threw it in.
6152
04:30:59,440 --> 04:31:03,680
So it's kind of like a purse or it's sort of like a mess.
6153
04:31:03,680 --> 04:31:05,880
And so the idea is you have these labels
6154
04:31:05,880 --> 04:31:08,800
that you put on everything that you're gonna throw in.
6155
04:31:08,800 --> 04:31:11,840
Like I'm gonna put, so it won't stick to my keys.
6156
04:31:13,240 --> 04:31:15,400
You know, what else do I got here?
6157
04:31:15,400 --> 04:31:19,000
I'm gonna stick a label on my pen, a Chuck label,
6158
04:31:19,000 --> 04:31:21,400
and I'm gonna store a pen in my dictionary
6159
04:31:21,400 --> 04:31:22,540
with a Chuck label.
6160
04:31:23,660 --> 04:31:27,800
And so it's like having a purse or a bag or a backpack
6161
04:31:27,800 --> 04:31:31,820
where you have things labeled and you can throw things in
6162
04:31:31,820 --> 04:31:34,480
and label them and you can shout into your bag and say,
6163
04:31:34,480 --> 04:31:37,200
give me the calculator or give me the candy
6164
04:31:37,200 --> 04:31:39,560
or whatever it is that you have labeled them.
6165
04:31:39,560 --> 04:31:41,200
You have to come up with the labels
6166
04:31:41,200 --> 04:31:43,880
and then you can use the labels to get things back out.
6167
04:31:43,880 --> 04:31:46,800
And like I said, they're probably the most powerful thing.
6168
04:31:46,800 --> 04:31:48,960
And they're basically this concept
6169
04:31:48,960 --> 04:31:51,320
that's generally referred to as associative arrays,
6170
04:31:51,320 --> 04:31:54,520
which means they're like lists, but they have these keys.
6171
04:31:54,520 --> 04:31:57,280
And so the associative means the association
6172
04:31:57,280 --> 04:31:58,920
between a key and a value.
6173
04:31:58,920 --> 04:32:01,120
Whereas in a list, there's a position in a value
6174
04:32:01,120 --> 04:32:04,240
and the position is less powerful and less flexible.
6175
04:32:04,240 --> 04:32:06,840
Most modern programming languages have this notion
6176
04:32:06,840 --> 04:32:08,120
of associative arrays.
6177
04:32:08,120 --> 04:32:09,920
If they don't, they're sort of unpopular
6178
04:32:09,920 --> 04:32:12,400
because once you get using them, they're like,
6179
04:32:12,400 --> 04:32:13,560
whoa, they're so powerful.
6180
04:32:13,560 --> 04:32:15,000
If you ever find yourself in a language
6181
04:32:15,000 --> 04:32:17,840
that doesn't have them, you'll freak out.
6182
04:32:17,840 --> 04:32:20,560
They have different names like property maps
6183
04:32:20,560 --> 04:32:22,920
or hash maps or property bags,
6184
04:32:22,920 --> 04:32:24,480
depending on the language you're using,
6185
04:32:24,480 --> 04:32:26,040
but they all are the same thing.
6186
04:32:26,040 --> 04:32:27,740
They're key value pairs.
6187
04:32:28,940 --> 04:32:31,880
So the idea of a dictionary is that,
6188
04:32:31,880 --> 04:32:32,840
or the idea of any collection
6189
04:32:32,840 --> 04:32:34,800
is putting more than one thing in.
6190
04:32:34,800 --> 04:32:36,000
And then the difference is,
6191
04:32:36,000 --> 04:32:39,360
is that you have ways of indexing it.
6192
04:32:39,360 --> 04:32:41,000
So this basically line says,
6193
04:32:41,000 --> 04:32:42,400
let's make ourselves a dictionary,
6194
04:32:42,400 --> 04:32:45,160
just like we constructed an empty list.
6195
04:32:45,160 --> 04:32:47,560
And I want to store 12 into this dictionary
6196
04:32:47,560 --> 04:32:49,520
and I want to label it money.
6197
04:32:49,520 --> 04:32:52,160
And so on the left-hand side, when we use this money,
6198
04:32:52,160 --> 04:32:54,420
that's the label that we're going to give it.
6199
04:32:54,420 --> 04:32:56,240
And so 12 is being placed in the dictionary.
6200
04:32:56,240 --> 04:32:58,000
That's like taking the 12,
6201
04:32:58,000 --> 04:33:00,080
throwing it in the dictionary with a label of money.
6202
04:33:00,080 --> 04:33:00,960
I can't, yeah.
6203
04:33:02,000 --> 04:33:03,640
Three's going in with a label of candy
6204
04:33:03,640 --> 04:33:05,480
and 75 is going in with tissues.
6205
04:33:05,480 --> 04:33:06,740
We say, what's in there?
6206
04:33:06,740 --> 04:33:08,000
And there's no order to it.
6207
04:33:08,000 --> 04:33:10,040
And sometimes the order can even change
6208
04:33:10,040 --> 04:33:11,540
inside of a dictionary.
6209
04:33:11,540 --> 04:33:13,560
Although there are more advanced versions of dictionaries
6210
04:33:13,560 --> 04:33:15,240
that maintain some kind of order,
6211
04:33:15,240 --> 04:33:18,480
but for now let's just not worry about the ordering of them.
6212
04:33:19,720 --> 04:33:20,640
If we say, what's in there?
6213
04:33:20,640 --> 04:33:22,080
You say, oh, there's three things in there.
6214
04:33:22,080 --> 04:33:24,520
There is 12, 75, and three,
6215
04:33:24,520 --> 04:33:26,800
and stored under the keys, money,
6216
04:33:26,800 --> 04:33:28,880
tissues, and candy, respectively.
6217
04:33:28,880 --> 04:33:32,439
We can ask, using the index operator,
6218
04:33:32,439 --> 04:33:33,519
what is purse of candy?
6219
04:33:33,520 --> 04:33:35,640
And that's like saying, hey, give me back candy.
6220
04:33:35,640 --> 04:33:39,840
And out comes the number three, which is that.
6221
04:33:39,840 --> 04:33:40,880
We can update stuff.
6222
04:33:40,880 --> 04:33:43,320
So we can say, go grab the candy version,
6223
04:33:43,320 --> 04:33:45,480
add two to it, make five,
6224
04:33:45,480 --> 04:33:46,919
and then store that back into candy.
6225
04:33:46,919 --> 04:33:51,919
And so now we see that candy has been set up to be five.
6226
04:33:55,279 --> 04:33:58,319
And so if you look at the difference
6227
04:33:58,320 --> 04:33:59,720
between lists and dictionaries,
6228
04:33:59,720 --> 04:34:02,599
they both can have new items added to them.
6229
04:34:02,599 --> 04:34:03,879
We haven't talked a lot about deleting,
6230
04:34:03,880 --> 04:34:05,919
but items can be deleted from them.
6231
04:34:05,919 --> 04:34:07,959
The difference is the indexing mechanism,
6232
04:34:07,960 --> 04:34:10,400
how we look things up, how we store things,
6233
04:34:10,400 --> 04:34:11,680
and how we look things up.
6234
04:34:11,680 --> 04:34:14,480
So we make an empty list, we make an empty dictionary.
6235
04:34:14,480 --> 04:34:16,840
We add 21 to the end, and we add 183 to the end,
6236
04:34:16,840 --> 04:34:18,599
and we ask it, and it says, oh,
6237
04:34:18,599 --> 04:34:21,679
position zero is 21, and position one is 183.
6238
04:34:21,680 --> 04:34:23,320
We don't see the positions when we print it out,
6239
04:34:23,320 --> 04:34:24,720
because it's sort of implicit.
6240
04:34:24,720 --> 04:34:27,020
Here we're gonna, and mark 21 with age,
6241
04:34:27,020 --> 04:34:29,400
and stick it in, and mark 182 with course,
6242
04:34:29,400 --> 04:34:31,279
and stick it in, and then we're gonna print it out,
6243
04:34:31,279 --> 04:34:34,199
and there we got course and age mapped.
6244
04:34:34,200 --> 04:34:37,520
And we can add 23 and stick it back in age,
6245
04:34:37,520 --> 04:34:40,860
and that overwrites, so the 21 becomes the 23.
6246
04:34:40,860 --> 04:34:42,200
We can do the same thing in a list,
6247
04:34:42,200 --> 04:34:43,800
except we say lists of zero,
6248
04:34:43,800 --> 04:34:46,520
because in lists, the indexing is position,
6249
04:34:46,520 --> 04:34:50,100
and so this 21 becomes 23.
6250
04:34:51,919 --> 04:34:53,479
And again, you just look at them,
6251
04:34:53,480 --> 04:34:55,099
and you can think of each of these
6252
04:34:55,099 --> 04:34:57,999
as pretty much doing roughly the same thing,
6253
04:34:58,000 --> 04:35:00,119
except the indexing mechanism.
6254
04:35:00,119 --> 04:35:03,359
The values are the same, but the keys are different.
6255
04:35:03,360 --> 04:35:05,560
So in lists, the keys are always the position,
6256
04:35:05,560 --> 04:35:06,960
and you don't get to assign those
6257
04:35:06,960 --> 04:35:09,360
other than the fact that the order in which you put them in
6258
04:35:09,360 --> 04:35:11,279
implicitly assigns a position,
6259
04:35:11,279 --> 04:35:15,039
and in dictionaries, the key is a string.
6260
04:35:15,919 --> 04:35:17,479
You can actually use other things.
6261
04:35:17,480 --> 04:35:19,759
I use strings a lot in this lecture,
6262
04:35:19,759 --> 04:35:21,839
but that just kinda keeps things simple
6263
04:35:21,840 --> 04:35:23,720
until you get good at it.
6264
04:35:23,720 --> 04:35:27,080
You can actually use numbers as the dictionary index,
6265
04:35:27,080 --> 04:35:28,500
the dictionary keys if you want,
6266
04:35:28,500 --> 04:35:30,720
but the values are things you put in
6267
04:35:30,720 --> 04:35:33,599
and manage in those dictionaries.
6268
04:35:33,599 --> 04:35:37,319
So we can, just like lists, we have dictionary literals,
6269
04:35:37,320 --> 04:35:39,480
and what's nice about dictionary literals
6270
04:35:39,480 --> 04:35:43,080
is that they use the exact same syntax as the printout,
6271
04:35:43,080 --> 04:35:44,720
and so it starts with a curly brace,
6272
04:35:44,720 --> 04:35:46,080
ends with a curly brace,
6273
04:35:46,080 --> 04:35:48,400
and then has a series of key colon value,
6274
04:35:48,400 --> 04:35:50,960
key colon value, key colon value,
6275
04:35:50,960 --> 04:35:53,259
and this is sort of the associative array bit.
6276
04:35:53,259 --> 04:35:56,039
We are associating one with a key chuck.
6277
04:35:56,040 --> 04:35:58,200
We are associating 42 with a key thread,
6278
04:35:58,200 --> 04:36:00,880
more associating Jan and 100.
6279
04:36:00,880 --> 04:36:02,919
Then we print it out, it kinda looks exactly the same,
6280
04:36:02,919 --> 04:36:05,679
so the print statements in Python are nice
6281
04:36:05,680 --> 04:36:07,599
in that you ask what's in a thing,
6282
04:36:07,599 --> 04:36:10,119
you show the stuff, and it shows you in the syntax
6283
04:36:10,119 --> 04:36:11,839
that if you type that into Python,
6284
04:36:11,840 --> 04:36:16,279
that would be how you do a constant.
6285
04:36:16,279 --> 04:36:18,319
And if you just say empty array,
6286
04:36:18,320 --> 04:36:21,640
you see me also do D-I-C-T.
6287
04:36:21,640 --> 04:36:23,040
This is constructor where you say
6288
04:36:23,040 --> 04:36:24,599
make a new empty dictionary.
6289
04:36:24,599 --> 04:36:26,659
This is an empty dictionary constant.
6290
04:36:26,660 --> 04:36:29,460
These two things are pretty much the exact same thing.
6291
04:36:29,460 --> 04:36:33,040
This is a shortcut to doing this.
6292
04:36:33,040 --> 04:36:37,480
The empty curly braces is a shortcut
6293
04:36:37,480 --> 04:36:41,919
to do the construction.
6294
04:36:41,919 --> 04:36:43,519
So up next, we're gonna talk about
6295
04:36:43,520 --> 04:36:46,279
sort of one of the really common applications
6296
04:36:46,279 --> 04:36:48,319
of dictionaries, and that is counting.
6297
04:36:52,279 --> 04:36:53,819
So now we're gonna talk to you about
6298
04:36:53,820 --> 04:36:56,400
one of the common applications of dictionaries,
6299
04:36:56,400 --> 04:36:58,400
and that is making histograms.
6300
04:36:58,400 --> 04:37:01,200
It's counting the frequency of things.
6301
04:37:01,200 --> 04:37:03,520
And so if you think of a histogram as,
6302
04:37:03,520 --> 04:37:07,160
it's a little graph, and there is A,
6303
04:37:07,160 --> 04:37:09,360
how many A's, how many B's, and how many C's,
6304
04:37:09,360 --> 04:37:10,400
and there's a histogram that says,
6305
04:37:10,400 --> 04:37:12,599
oh, there's this many of that, and this many of that,
6306
04:37:12,599 --> 04:37:14,959
and these are like buckets, these are frequencies,
6307
04:37:14,960 --> 04:37:17,480
and this is how many times it happens, so a histogram.
6308
04:37:17,480 --> 04:37:18,680
But we're gonna do this thing
6309
04:37:18,680 --> 04:37:20,840
where we're gonna count people's names,
6310
04:37:20,840 --> 04:37:23,200
and we're gonna kinda count how many that we see.
6311
04:37:23,200 --> 04:37:25,240
But the interesting thing that we're gonna solve,
6312
04:37:25,240 --> 04:37:27,400
just like many of the things in the computer,
6313
04:37:27,400 --> 04:37:28,840
is we can't just sort of look at the data,
6314
04:37:28,840 --> 04:37:30,540
we gotta look at the data iteratively,
6315
04:37:30,540 --> 04:37:33,320
one piece of data at a time.
6316
04:37:33,320 --> 04:37:35,919
So I'm gonna give you a little problem, okay?
6317
04:37:35,919 --> 04:37:38,939
I'm gonna show you a series of names, one at a time,
6318
04:37:38,939 --> 04:37:42,479
and I want you to count for each name,
6319
04:37:42,480 --> 04:37:44,360
make a little bucket, and then keep counting
6320
04:37:44,360 --> 04:37:46,439
how many things for each of the different names, okay?
6321
04:37:46,439 --> 04:37:48,879
You'll notice that you have to start with one,
6322
04:37:48,880 --> 04:37:51,800
and then you move across, so just watch this,
6323
04:37:51,800 --> 04:37:55,080
and tell me how many,
6324
04:37:55,080 --> 04:37:57,599
how many, what's the most common name
6325
04:37:57,599 --> 04:37:59,519
of the set of names I'm about to show you,
6326
04:37:59,520 --> 04:38:01,520
and how many do we see?
6327
04:38:01,520 --> 04:38:23,680
One, two, three, four, five.
6328
04:38:31,520 --> 04:38:36,520
So how many, what was the most common name
6329
04:38:38,240 --> 04:38:40,500
and how many times did you see it?
6330
04:38:40,500 --> 04:38:42,279
That's the question.
6331
04:38:42,279 --> 04:38:44,239
Now, here comes the review.
6332
04:38:44,240 --> 04:38:46,119
So for humans, it's so much easier for you
6333
04:38:46,119 --> 04:38:47,359
to just look at this and you think,
6334
04:38:47,360 --> 04:38:49,119
how did my brain look at that?
6335
04:38:49,119 --> 04:38:51,559
And you're like, okay, what is pretty common?
6336
04:38:51,560 --> 04:38:55,560
Oh, maybe, maybe Chen is common.
6337
04:38:55,560 --> 04:38:59,360
Oh, Chen, Chen, Chen, no.
6338
04:38:59,360 --> 04:39:04,360
Maybe Chen is common, one, two, three, four, yeah.
6339
04:39:04,480 --> 04:39:07,759
Anybody else have, Markov's got three, C7.
6340
04:39:07,759 --> 04:39:11,699
And so you'll notice how our minds, without computers,
6341
04:39:11,700 --> 04:39:15,360
we just sort of like bounce, branch in bound.
6342
04:39:15,360 --> 04:39:17,720
We have hypotheses and then we decide,
6343
04:39:17,720 --> 04:39:21,560
yep, it's Chen, that's it, and there's four of them.
6344
04:39:21,560 --> 04:39:24,279
Now, how did your brain think about this
6345
04:39:24,279 --> 04:39:27,279
as we were going through them one at a time?
6346
04:39:27,279 --> 04:39:30,239
Well, my guess is if you really had to do this a lot,
6347
04:39:30,240 --> 04:39:32,320
you would make a little picture like this.
6348
04:39:32,320 --> 04:39:36,119
And then what you would do is if you saw a new name,
6349
04:39:36,119 --> 04:39:38,120
XYZ, you'd add it to the list
6350
04:39:38,120 --> 04:39:39,880
and give it a tick mark of one.
6351
04:39:39,880 --> 04:39:42,980
And then if you saw C7 again, you'd give that a tick mark.
6352
04:39:42,980 --> 04:39:45,480
And if you saw XYZ again, you'd make a tick mark.
6353
04:39:45,480 --> 04:39:48,480
And then you'd keep adding to these tick marks, right?
6354
04:39:48,480 --> 04:39:49,720
And that's how you would do it.
6355
04:39:49,720 --> 04:39:52,440
And you wouldn't, like many of the things we do in a loop,
6356
04:39:52,440 --> 04:39:54,360
you wouldn't really know what the most common was
6357
04:39:54,360 --> 04:39:55,480
one until the end.
6358
04:39:55,480 --> 04:39:57,400
And then you'd sort of take a look at these numbers
6359
04:39:57,400 --> 04:40:00,840
and you'd say, okay, that's the most common number.
6360
04:40:00,840 --> 04:40:03,040
And then you'd be done.
6361
04:40:03,040 --> 04:40:05,200
But you have to watch them one at a time.
6362
04:40:05,200 --> 04:40:06,740
You can't just bounce around.
6363
04:40:08,200 --> 04:40:12,180
And so that's how we're gonna use dictionaries
6364
04:40:12,180 --> 04:40:13,680
to achieve that.
6365
04:40:13,680 --> 04:40:16,000
Again, instinctively as humans, we just look at the stuff.
6366
04:40:16,000 --> 04:40:17,340
But if you add a million things,
6367
04:40:17,340 --> 04:40:18,680
you probably wanna write a Python program
6368
04:40:18,680 --> 04:40:19,760
and use dictionaries.
6369
04:40:19,760 --> 04:40:21,120
And so this is the idea.
6370
04:40:21,120 --> 04:40:22,600
And there's two basic things that happen.
6371
04:40:22,600 --> 04:40:24,520
One is the first time you see a name.
6372
04:40:24,520 --> 04:40:27,160
Like I say, is this name there already?
6373
04:40:27,160 --> 04:40:28,160
If it's there already,
6374
04:40:28,160 --> 04:40:30,320
you really just wanna add one to it, right?
6375
04:40:30,320 --> 04:40:31,520
That's the adding of a tick.
6376
04:40:31,520 --> 04:40:34,520
And or you wanna see for the first time,
6377
04:40:34,520 --> 04:40:36,600
you know, blah, blah, blah, blah, blah, and give it a one.
6378
04:40:36,600 --> 04:40:41,040
And so you can use the name as the key.
6379
04:40:41,040 --> 04:40:42,260
And then one is the value.
6380
04:40:42,260 --> 04:40:44,840
And then first time you see Chen, you stick one in there.
6381
04:40:44,840 --> 04:40:47,240
And so at this point inside the dictionary,
6382
04:40:47,240 --> 04:40:50,280
sort of dynamically adding as soon as it sees a new name,
6383
04:40:50,280 --> 04:40:51,720
it adds another slot in here.
6384
04:40:52,600 --> 04:40:54,120
But then if you see the same name again,
6385
04:40:54,120 --> 04:40:56,520
like Chen again, then you end up with a one,
6386
04:40:56,520 --> 04:40:58,320
add one to it, and so it's two.
6387
04:40:58,320 --> 04:40:59,600
And so at that point, Chen is two.
6388
04:40:59,600 --> 04:41:03,400
And so you can see how you can both extend the dictionary
6389
04:41:03,400 --> 04:41:08,080
by encountering a new name or adding when you see a name
6390
04:41:08,080 --> 04:41:09,860
that you've already seen before.
6391
04:41:11,080 --> 04:41:14,160
The problem with dictionaries is like everything in Python,
6392
04:41:14,160 --> 04:41:16,600
there are rules about what you can and can't do.
6393
04:41:16,600 --> 04:41:17,680
And one of the, I think,
6394
04:41:17,680 --> 04:41:19,820
kind of frustrating things about dictionaries
6395
04:41:19,820 --> 04:41:23,160
is that you can't just look for a key that doesn't exist.
6396
04:41:23,160 --> 04:41:24,720
So this is a fresh brand new dictionary,
6397
04:41:24,720 --> 04:41:27,720
we do a constructor there, and we print out sub csev,
6398
04:41:27,720 --> 04:41:30,680
and boom, it blows up, and that's bad.
6399
04:41:30,680 --> 04:41:33,040
But we can solve this by the in operator.
6400
04:41:33,040 --> 04:41:34,680
The in operator we've used in the for loops.
6401
04:41:34,680 --> 04:41:37,140
We've used it in lists, we've used it in strings.
6402
04:41:37,140 --> 04:41:41,080
So that is a question, it's saying, is csev in CCC?
6403
04:41:41,080 --> 04:41:44,680
Well, this is this empty one, and so it is no, it is not.
6404
04:41:44,680 --> 04:41:46,240
Csev is not in CCC.
6405
04:41:46,240 --> 04:41:49,780
And so using this in operator, we can avoid the traceback.
6406
04:41:49,780 --> 04:41:52,640
We can say, if it's not there, put it in.
6407
04:41:52,640 --> 04:41:54,400
If it is there, add one to it.
6408
04:41:54,400 --> 04:41:58,160
And that leads us to this bit of code.
6409
04:41:58,160 --> 04:42:00,600
Okay, and that is the kind of code
6410
04:42:00,600 --> 04:42:01,600
that we're gonna build a histogram,
6411
04:42:01,600 --> 04:42:04,400
this is gonna histogram code, okay?
6412
04:42:04,400 --> 04:42:07,600
And so this is gonna have name as our iterator names.
6413
04:42:07,600 --> 04:42:10,480
Sorry, I made them singular and plural, that's nice,
6414
04:42:10,480 --> 04:42:13,480
but so name is gonna be csev-chen, csev-gen.
6415
04:42:13,480 --> 04:42:15,480
Now normally, we'll be reading this from a file,
6416
04:42:15,480 --> 04:42:17,800
but for now, keep it easy.
6417
04:42:17,800 --> 04:42:19,020
We're gonna go through this.
6418
04:42:19,020 --> 04:42:20,860
And we're gonna have counts as our dictionary.
6419
04:42:20,860 --> 04:42:22,380
So that starts out empty.
6420
04:42:22,380 --> 04:42:24,180
And we're gonna do a simple if then else
6421
04:42:24,180 --> 04:42:25,600
every time through the loop.
6422
04:42:25,600 --> 04:42:28,160
If the name we're looking at is not in the dictionary
6423
04:42:28,160 --> 04:42:31,460
already is the key, then set it to be one.
6424
04:42:31,460 --> 04:42:36,260
If it's not, go get the old value, count sub name,
6425
04:42:36,260 --> 04:42:38,560
and then add one to it and stick it back in.
6426
04:42:38,560 --> 04:42:42,920
So this line right here is new, adding a new thing.
6427
04:42:42,920 --> 04:42:45,120
And this line right here is adding
6428
04:42:45,120 --> 04:42:46,960
some things to existing things.
6429
04:42:46,960 --> 04:42:49,280
And you do this long enough, you start with an empty one,
6430
04:42:49,280 --> 04:42:52,080
and you do this long enough, at the very end,
6431
04:42:52,080 --> 04:42:55,880
it will print out the histogram that you're looking for,
6432
04:42:55,880 --> 04:42:57,280
the histogram you're looking for.
6433
04:42:57,280 --> 04:42:59,360
And so you say, oh, we've seen csev twice,
6434
04:42:59,360 --> 04:43:01,080
gen once, and gen twice.
6435
04:43:01,080 --> 04:43:02,500
And so that's the idea.
6436
04:43:02,500 --> 04:43:05,380
And so this can run a million times if you want.
6437
04:43:09,440 --> 04:43:14,120
Now, this notion of checking to see if a key exists
6438
04:43:14,120 --> 04:43:16,040
and doing one thing if it doesn't exist
6439
04:43:16,040 --> 04:43:18,320
and doing another thing if it does exist
6440
04:43:18,320 --> 04:43:24,560
is such a common practice that the dictionary object has
6441
04:43:24,560 --> 04:43:29,560
this method called get that collapses these four lines
6442
04:43:29,560 --> 04:43:30,800
into one line.
6443
04:43:30,800 --> 04:43:33,920
And so the idea is you're going to do one thing if it's in there
6444
04:43:33,920 --> 04:43:35,840
and you're going to retrieve the current thing.
6445
04:43:35,840 --> 04:43:37,920
Otherwise, you're going to pick a default value.
6446
04:43:37,920 --> 04:43:39,440
In this case, we'll pick one.
6447
04:43:39,440 --> 04:43:40,600
I mean, you pick zero.
6448
04:43:40,600 --> 04:43:44,740
This is like the default, meaning what is not there.
6449
04:43:44,740 --> 04:43:48,040
And if you say counts, now counts is a dictionary, dot get.
6450
04:43:48,040 --> 04:43:49,240
That's like string dot upper.
6451
04:43:49,240 --> 04:43:50,380
That's a method.
6452
04:43:50,380 --> 04:43:53,440
You give it a key and then a default.
6453
04:43:53,440 --> 04:43:55,680
And if the key exists, you get back what's in the key.
6454
04:43:55,680 --> 04:43:59,480
If the key doesn't exist, you get the default.
6455
04:43:59,480 --> 04:44:02,000
And with no trace back, this works.
6456
04:44:02,000 --> 04:44:03,480
So the best way to think about this
6457
04:44:03,480 --> 04:44:08,960
is those four lines are equal to that one line.
6458
04:44:08,960 --> 04:44:11,400
Because x is either going to be whatever was in there before
6459
04:44:11,400 --> 04:44:14,120
if it exists or it's going to be zero.
6460
04:44:14,120 --> 04:44:16,200
Now, the nice thing about zero is the next thing we're
6461
04:44:16,200 --> 04:44:17,360
going to do is we're going to add one to it.
6462
04:44:17,360 --> 04:44:18,840
So that that's going to get us to one.
6463
04:44:18,840 --> 04:44:25,720
So collapsing that loop that we saw before,
6464
04:44:25,720 --> 04:44:28,920
collapsing that loop, we can make it just a one line loop.
6465
04:44:28,920 --> 04:44:31,280
And this will become an idiom.
6466
04:44:31,280 --> 04:44:33,720
This will become something that you will get used to.
6467
04:44:33,720 --> 04:44:36,200
And you will use over and over and over again.
6468
04:44:36,200 --> 04:44:38,440
And after a while, right now, you're looking at it, boy,
6469
04:44:38,440 --> 04:44:41,880
boy, that's a lot of syntax and semicolons and whatever.
6470
04:44:41,880 --> 04:44:44,240
After a while, you just type this and not even think about it.
6471
04:44:44,240 --> 04:44:45,840
It's an idiom.
6472
04:44:45,840 --> 04:44:48,120
It's basically included in this idiom
6473
04:44:48,120 --> 04:44:50,880
is how to both create new entries in dictionaries
6474
04:44:50,880 --> 04:44:54,400
and update existing entries by adding one to them.
6475
04:44:54,400 --> 04:44:56,960
So everything else in this is the same.
6476
04:44:56,960 --> 04:44:59,120
Name is going to go through these five values.
6477
04:44:59,120 --> 04:45:01,400
And we're going to say counts of name equals
6478
04:45:01,400 --> 04:45:04,960
counts.get name comma zero plus one.
6479
04:45:04,960 --> 04:45:07,960
And so if, for example, this already has a one in it,
6480
04:45:07,960 --> 04:45:11,000
then this is going to be one plus one becomes two.
6481
04:45:11,000 --> 04:45:14,000
If it's not, it's going to be zero plus one equals two.
6482
04:45:14,000 --> 04:45:18,160
And so this is the idea of if new set it to one, not zero,
6483
04:45:18,160 --> 04:45:20,800
set it to one because the first time you see something,
6484
04:45:20,800 --> 04:45:22,680
the count should be one, not zero.
6485
04:45:22,680 --> 04:45:24,400
So that's why we make this default.
6486
04:45:24,400 --> 04:45:26,480
Now the get can be used for anything.
6487
04:45:26,480 --> 04:45:29,240
It just so happens that zero is a common default
6488
04:45:29,240 --> 04:45:31,480
because it's really common that we're using this
6489
04:45:31,480 --> 04:45:33,600
to basically make a histogram, right?
6490
04:45:33,600 --> 04:45:36,000
Little histogram of a, b, c, right?
6491
04:45:36,000 --> 04:45:38,160
And so we need to make a d,
6492
04:45:38,160 --> 04:45:39,960
but then the histogram has to start at one.
6493
04:45:39,960 --> 04:45:43,760
So that's basically the simplified counting
6494
04:45:43,760 --> 04:45:44,600
with get.
6495
04:45:44,600 --> 04:45:47,680
And there's a lot of things that we're going to do
6496
04:45:47,680 --> 04:45:52,400
inside of Python that do have to do with frequencies
6497
04:45:52,400 --> 04:45:54,680
and how many times certain things happened.
6498
04:45:54,680 --> 04:45:57,880
And this pattern is a really good pattern
6499
04:45:57,880 --> 04:45:59,200
to absolutely know.
6500
04:46:02,920 --> 04:46:05,640
So now what we're going to do is we're going to switch
6501
04:46:05,640 --> 04:46:07,440
from just looping through strings,
6502
04:46:07,440 --> 04:46:08,720
instead loop through files.
6503
04:46:08,720 --> 04:46:10,680
And it's going to take a little bit of work
6504
04:46:10,680 --> 04:46:11,860
because we have to open the file
6505
04:46:11,860 --> 04:46:14,360
and we'll bring a lot of things together at this point.
6506
04:46:14,360 --> 04:46:16,240
So here would be another task
6507
04:46:16,240 --> 04:46:19,260
and that is here's a bunch of text from the book
6508
04:46:19,260 --> 04:46:22,360
and you can just split this into words
6509
04:46:22,360 --> 04:46:25,720
and count and find out what the most common word is
6510
04:46:25,720 --> 04:46:28,640
and how many times it occurs.
6511
04:46:28,640 --> 04:46:30,360
So go ahead and try to do this for a second.
6512
04:46:30,360 --> 04:46:31,960
Feel free to pause.
6513
04:46:31,960 --> 04:46:33,120
Actually don't bother pausing.
6514
04:46:33,120 --> 04:46:33,960
This is too hard.
6515
04:46:33,960 --> 04:46:35,160
We should write a program for this.
6516
04:46:35,160 --> 04:46:36,640
It's not easy.
6517
04:46:36,640 --> 04:46:37,720
Humans don't like this.
6518
04:46:37,720 --> 04:46:39,040
It makes you concentrate.
6519
04:46:39,040 --> 04:46:42,560
And so here is a counting pattern
6520
04:46:42,560 --> 04:46:44,120
where we're going to take a line
6521
04:46:44,120 --> 04:46:46,840
and then later we'll read this in a file.
6522
04:46:46,840 --> 04:46:51,120
And so this is just an adaptation improvement
6523
04:46:51,120 --> 04:46:52,020
of the previous thing.
6524
04:46:52,020 --> 04:46:53,940
So we're going to start with an empty dictionary.
6525
04:46:53,940 --> 04:46:56,280
We're going to ask for a line of text and read it in.
6526
04:46:56,280 --> 04:46:57,640
And then we're going to use split.
6527
04:46:57,640 --> 04:46:58,920
So remember the list of words?
6528
04:46:58,920 --> 04:47:02,040
Well, what we're going to get here is a list of words.
6529
04:47:02,040 --> 04:47:04,160
We'll print it out and we'll run this counting.
6530
04:47:04,160 --> 04:47:06,440
This is the little loop.
6531
04:47:06,440 --> 04:47:08,600
For every word in whatever this was,
6532
04:47:08,600 --> 04:47:13,040
we're going to do this idiom of either adding a new entry
6533
04:47:13,040 --> 04:47:15,000
or adding one to an existing entry
6534
04:47:15,000 --> 04:47:16,360
and then printing that out.
6535
04:47:16,360 --> 04:47:18,800
So let's take a look at what we get there.
6536
04:47:18,800 --> 04:47:22,520
So if we run this, we can give it some text
6537
04:47:22,520 --> 04:47:24,800
and I've got this, this will be all one line.
6538
04:47:24,800 --> 04:47:26,340
And then it splits it into words
6539
04:47:26,340 --> 04:47:29,040
and you see that these words here are split, split,
6540
04:47:29,040 --> 04:47:29,880
split, split.
6541
04:47:29,880 --> 04:47:31,580
I mean that's strings and splits.
6542
04:47:31,580 --> 04:47:34,360
Remember strings and lists and split.
6543
04:47:34,360 --> 04:47:37,440
And so now the counting is gonna go through this list.
6544
04:47:37,440 --> 04:47:39,560
The clown ran after the,
6545
04:47:39,560 --> 04:47:41,200
and it's gonna build a histogram.
6546
04:47:41,200 --> 04:47:44,400
The clown, you know, one clown,
6547
04:47:44,400 --> 04:47:47,860
the up, up, up of these things are gonna go up, right?
6548
04:47:47,860 --> 04:47:49,360
That's this histogram.
6549
04:47:49,360 --> 04:47:50,880
And then when it's all said and done,
6550
04:47:50,880 --> 04:47:52,600
we end up with the histogram.
6551
04:47:52,600 --> 04:47:54,440
And so counts is the dictionary
6552
04:47:54,440 --> 04:47:55,600
that ends up with a histogram.
6553
04:47:55,600 --> 04:47:57,680
And we can start by inspection, see,
6554
04:47:57,680 --> 04:48:00,320
oh, the is the most common word.
6555
04:48:00,320 --> 04:48:02,360
And there are seven of those, right?
6556
04:48:02,360 --> 04:48:04,260
So if we sort of take a look at this,
6557
04:48:04,260 --> 04:48:05,920
we start out, we make a dictionary,
6558
04:48:05,920 --> 04:48:08,960
we read in a line of text, the text goes in.
6559
04:48:08,960 --> 04:48:13,320
We, and then we split that and we print the words out.
6560
04:48:13,320 --> 04:48:15,240
So these are the words, right?
6561
04:48:15,240 --> 04:48:16,080
Then we have a for loop
6562
04:48:16,080 --> 04:48:18,000
that's gonna loop through all those things
6563
04:48:18,000 --> 04:48:19,840
and then produce a dictionary.
6564
04:48:19,840 --> 04:48:21,760
And when we print the dictionary out,
6565
04:48:21,760 --> 04:48:23,180
that's what we're gonna get.
6566
04:48:23,180 --> 04:48:25,440
And the seven, okay?
6567
04:48:25,440 --> 04:48:27,660
So that's one line of text.
6568
04:48:27,660 --> 04:48:31,300
That's how you walk across the words in a line of text
6569
04:48:31,300 --> 04:48:34,440
after you split the line into separate words.
6570
04:48:34,440 --> 04:48:35,880
So now we're gonna look at ways
6571
04:48:35,880 --> 04:48:37,360
that you can loop through dictionaries.
6572
04:48:37,360 --> 04:48:39,840
We just produced a loop that can build a dictionary,
6573
04:48:39,840 --> 04:48:42,200
but now we're gonna look at a dictionary.
6574
04:48:42,200 --> 04:48:44,320
And so we'll start with a very, very simple example
6575
04:48:44,320 --> 04:48:46,880
and then we'll work to a slightly more complex example.
6576
04:48:46,880 --> 04:48:48,920
So here's a dictionary, just the constant,
6577
04:48:48,920 --> 04:48:51,960
Chuck is one, Fred's 42, and Jan's 100.
6578
04:48:51,960 --> 04:48:55,520
And so we're gonna use a definite loop with a four,
6579
04:48:55,520 --> 04:48:56,760
four key and counts.
6580
04:48:56,760 --> 04:49:00,400
Now it doesn't have to be a key, but key is a good name
6581
04:49:00,400 --> 04:49:04,560
because these are keys and values, K, V, K, V,
6582
04:49:04,560 --> 04:49:05,400
keys and values.
6583
04:49:05,400 --> 04:49:06,760
I just mentally think of this
6584
04:49:06,760 --> 04:49:08,880
as keys and values and keys and values.
6585
04:49:08,880 --> 04:49:12,700
So this iteration variable is gonna walk the keys.
6586
04:49:12,700 --> 04:49:16,160
It's not gonna walk the values, it's gonna walk the keys.
6587
04:49:16,160 --> 04:49:19,180
Chuck, Fred, Jan, not necessarily in that particular order.
6588
04:49:19,180 --> 04:49:20,880
As you see, it goes Jan, Chuck, Fred,
6589
04:49:20,880 --> 04:49:23,060
because just because I typed it in in this order,
6590
04:49:23,060 --> 04:49:25,200
it's not like a list, it doesn't stay in that order.
6591
04:49:25,200 --> 04:49:27,820
It might move around a little bit as we add data to it
6592
04:49:27,820 --> 04:49:30,440
or as we set the data up.
6593
04:49:30,440 --> 04:49:32,760
And so you can, in the loop, you can get the key,
6594
04:49:32,760 --> 04:49:35,160
and so that's what prints out the Jan, Chuck, Fred,
6595
04:49:35,160 --> 04:49:37,520
but then you can also get the corresponding count
6596
04:49:37,520 --> 04:49:41,620
for each one of these by just pulling it out of the array.
6597
04:49:41,620 --> 04:49:43,320
I mean, pulling it out of the dictionary, right?
6598
04:49:43,320 --> 04:49:45,800
And so we can pull out the corresponding value,
6599
04:49:45,800 --> 04:49:48,160
and so we print out Jan 100, Chuck 1, Fred 2,
6600
04:49:48,160 --> 04:49:50,540
and that runs this loop three times.
6601
04:49:50,540 --> 04:49:53,660
So if you just use the N and you give a dictionary here,
6602
04:49:53,660 --> 04:49:56,000
remember all the different things we've been able to put there
6603
04:49:56,000 --> 04:50:00,160
on the end of a for loop and dictionary's another thing
6604
04:50:00,160 --> 04:50:03,780
we can put on and we get a list of keys.
6605
04:50:03,780 --> 04:50:05,540
Now there's a couple of methods that allow us
6606
04:50:05,540 --> 04:50:09,500
to get the keys and so we have, you know,
6607
04:50:09,500 --> 04:50:11,300
we can say turn this into a list
6608
04:50:11,300 --> 04:50:12,720
and we get a list of the keys.
6609
04:50:12,720 --> 04:50:15,240
So this is a dictionary, the same dictionary.
6610
04:50:15,240 --> 04:50:16,480
We get a list of the keys.
6611
04:50:16,480 --> 04:50:19,480
You can also get a list of the keys by using the keys method.
6612
04:50:19,480 --> 04:50:21,720
So that's take this dictionary, JJJ,
6613
04:50:21,720 --> 04:50:23,600
and give me all the keys, which gives me a list,
6614
04:50:23,600 --> 04:50:25,240
which is kind of the same thing.
6615
04:50:25,240 --> 04:50:27,120
And then we can ask for the values
6616
04:50:27,120 --> 04:50:28,780
and they give me just then the values
6617
04:50:28,780 --> 04:50:30,440
extracted out of this dictionary.
6618
04:50:30,440 --> 04:50:31,500
So that's nice.
6619
04:50:32,440 --> 04:50:34,560
Now the one thing is that while I said
6620
04:50:34,560 --> 04:50:37,120
you can't predict the order, if in two statements
6621
04:50:37,120 --> 04:50:39,320
you ask for the keys and then the values,
6622
04:50:39,320 --> 04:50:41,320
they at least come out in the same order,
6623
04:50:41,320 --> 04:50:43,020
even though you can't necessarily predict the order
6624
04:50:43,020 --> 04:50:45,800
that they come out in the same order.
6625
04:50:45,800 --> 04:50:48,960
And then there is a third thing that we can do
6626
04:50:48,960 --> 04:50:52,720
and that is list, ask for the items.
6627
04:50:52,720 --> 04:50:54,960
We can say give me the items.
6628
04:50:54,960 --> 04:50:57,780
And that gives us a list.
6629
04:50:57,780 --> 04:51:02,080
This is our first really kind of composite
6630
04:51:02,080 --> 04:51:05,120
combined data structure where it is a list,
6631
04:51:05,120 --> 04:51:08,680
a three item list, zero, one, two.
6632
04:51:08,680 --> 04:51:11,840
And inside that there is what are called two tuples.
6633
04:51:11,840 --> 04:51:15,880
Jan maps to 100, Chuck maps to one, Fred maps to 42.
6634
04:51:15,880 --> 04:51:18,000
Coming up next we're gonna have a whole chapter on that
6635
04:51:18,000 --> 04:51:21,040
and so just take a look at that for the moment
6636
04:51:21,040 --> 04:51:25,560
and we will come back to that in some detail later.
6637
04:51:25,560 --> 04:51:28,200
This whole items idea that gives us back
6638
04:51:28,200 --> 04:51:30,160
a list of key value pairs,
6639
04:51:30,160 --> 04:51:32,200
because it's not just a list of keys or a list of values,
6640
04:51:32,200 --> 04:51:33,840
it's actually a list of key value pairs,
6641
04:51:33,840 --> 04:51:37,680
allows us to write in Python a very clever and elegant loop.
6642
04:51:38,920 --> 04:51:41,960
What we can do is actually this items gives us back
6643
04:51:41,960 --> 04:51:45,840
each item in the list has a key and a value
6644
04:51:45,840 --> 04:51:47,800
and we can actually take two iteration variables.
6645
04:51:47,800 --> 04:51:51,000
For a, a, comma, b, b, b, this is two iteration variables
6646
04:51:51,000 --> 04:51:53,000
and if you're coming from another programming language,
6647
04:51:53,000 --> 04:51:55,360
this is super cool and it's a Python only feature.
6648
04:51:55,360 --> 04:51:57,480
I never have seen another language that's capable
6649
04:51:57,480 --> 04:52:00,760
of doing something this simple and that elegantly.
6650
04:52:00,760 --> 04:52:02,800
So what this basically does is says
6651
04:52:02,800 --> 04:52:04,360
we're gonna simultaneously advance
6652
04:52:04,360 --> 04:52:05,600
these two iteration variables.
6653
04:52:05,600 --> 04:52:07,720
So this is gonna be the key and the value,
6654
04:52:07,720 --> 04:52:09,160
the K and the V.
6655
04:52:09,160 --> 04:52:11,160
Key and the value is gonna be Chuck one,
6656
04:52:11,160 --> 04:52:16,000
then they're both gonna advance, Fred 42, Jan 100.
6657
04:52:16,000 --> 04:52:17,720
And so that means in this simple loop
6658
04:52:17,720 --> 04:52:18,720
if we just print them out
6659
04:52:18,720 --> 04:52:20,100
we're gonna get the key value pairs.
6660
04:52:20,100 --> 04:52:21,400
Of course in the order.
6661
04:52:21,400 --> 04:52:23,680
And so it's sort of a, a, a, and b, b, b,
6662
04:52:23,680 --> 04:52:27,320
simultaneously walk down these key value pairs.
6663
04:52:27,320 --> 04:52:29,160
And so that's really pretty
6664
04:52:29,160 --> 04:52:31,080
and it makes for a very succinct loop.
6665
04:52:31,080 --> 04:52:33,640
It's the syntax is a little sort of disquieting
6666
04:52:33,640 --> 04:52:36,440
when you first see it, but it's a super elegant thing
6667
04:52:36,440 --> 04:52:38,600
and you just have to say items.
6668
04:52:38,600 --> 04:52:41,800
If you don't say items, you just get the keys.
6669
04:52:41,800 --> 04:52:43,640
If you say items, you get the key value pairs
6670
04:52:43,640 --> 04:52:46,160
and you have to have two iteration variables.
6671
04:52:46,160 --> 04:52:47,520
If you don't have two iteration variables
6672
04:52:47,520 --> 04:52:48,920
and use items, it'll complain and say,
6673
04:52:48,920 --> 04:52:50,080
what are you doing?
6674
04:52:50,080 --> 04:52:51,200
I'm giving you two things
6675
04:52:51,200 --> 04:52:53,000
and you don't have two variables to receive them.
6676
04:52:53,000 --> 04:52:57,380
So two iteration variables and items are basically related.
6677
04:52:58,560 --> 04:53:01,720
Now we're going to take a look
6678
04:53:01,720 --> 04:53:05,900
and this is code that I showed you perhaps many weeks ago
6679
04:53:05,900 --> 04:53:08,080
about, I said this is a little story
6680
04:53:08,080 --> 04:53:11,560
about how to read a file and count all the words in the file.
6681
04:53:11,560 --> 04:53:12,780
And now we're back to it.
6682
04:53:12,780 --> 04:53:14,560
And at this point you should understand
6683
04:53:14,560 --> 04:53:16,680
every single character of this program,
6684
04:53:16,680 --> 04:53:19,000
every single concept of the program.
6685
04:53:19,000 --> 04:53:20,480
You should literally stare at this
6686
04:53:20,480 --> 04:53:22,480
and look at it, code it, play with it
6687
04:53:22,480 --> 04:53:24,880
until you absolutely understand it.
6688
04:53:24,880 --> 04:53:26,560
So let's take a look.
6689
04:53:27,840 --> 04:53:29,680
Again, I showed you this weeks ago.
6690
04:53:30,640 --> 04:53:33,240
So we're going to ask for a file name.
6691
04:53:33,240 --> 04:53:35,360
Then we're going to open the file name.
6692
04:53:35,360 --> 04:53:37,200
Then we're going to make an empty dictionary.
6693
04:53:37,200 --> 04:53:39,360
Again, this is all stuff you've done before.
6694
04:53:39,360 --> 04:53:40,960
And then we're going to have an iteration variable
6695
04:53:40,960 --> 04:53:44,340
that's going to go through the lines in the file, right?
6696
04:53:44,340 --> 04:53:47,260
So line is going to go line, line, line.
6697
04:53:47,260 --> 04:53:49,320
Then we are going to split that line,
6698
04:53:49,320 --> 04:53:52,320
each line into words, chop, chop, chop, chop.
6699
04:53:53,160 --> 04:53:57,000
So that's words is the list of the words in one line.
6700
04:53:57,000 --> 04:53:57,840
We're inside of a loop
6701
04:53:57,840 --> 04:53:59,560
that's going to go through all the lines.
6702
04:53:59,560 --> 04:54:01,080
And then what we're going to do
6703
04:54:01,080 --> 04:54:04,840
is we're going to have the word iteration
6704
04:54:04,840 --> 04:54:07,160
iterate through each word in the line.
6705
04:54:07,160 --> 04:54:08,000
And then what we're going to do
6706
04:54:08,000 --> 04:54:10,400
is take each word in the line.
6707
04:54:10,400 --> 04:54:12,800
I'm going to do this histogram, right?
6708
04:54:12,800 --> 04:54:16,000
So this is going to run not only just for every line,
6709
04:54:16,000 --> 04:54:17,400
but for every word in every line.
6710
04:54:17,400 --> 04:54:19,840
So we have a nested loop for every line.
6711
04:54:19,840 --> 04:54:21,940
Then we split it and then we go across the line.
6712
04:54:21,940 --> 04:54:22,960
So it's almost like a typewriter.
6713
04:54:22,960 --> 04:54:27,000
We go, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch.
6714
04:54:27,000 --> 04:54:27,960
And that's what we're doing.
6715
04:54:27,960 --> 04:54:32,840
Tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch.
6716
04:54:32,840 --> 04:54:36,520
So it's like the outer loop is going down, down, down the lines.
6717
04:54:36,520 --> 04:54:39,320
And the inner loop is going across, across, across the words.
6718
04:54:39,320 --> 04:54:41,080
And eventually we are going to see
6719
04:54:41,080 --> 04:54:43,040
in this middle, in this last line,
6720
04:54:43,040 --> 04:54:44,960
every single word in the file.
6721
04:54:44,960 --> 04:54:47,760
And we're going to do the accounts get word plus one,
6722
04:54:47,760 --> 04:54:50,800
which is our magic histogram making line
6723
04:54:50,800 --> 04:54:52,400
that if you don't remember what that is,
6724
04:54:52,400 --> 04:54:54,800
go back a couple of slides, I just talked about it.
6725
04:54:54,800 --> 04:54:56,200
At this point in the code,
6726
04:54:56,200 --> 04:54:57,840
and it's important to be able to draw these lines,
6727
04:54:57,840 --> 04:55:00,640
at this point in the code, you have the histogram
6728
04:55:00,640 --> 04:55:02,840
and it's in the variable counts.
6729
04:55:02,840 --> 04:55:06,320
Now, we want to find the largest one.
6730
04:55:06,320 --> 04:55:08,520
Now we have written loops
6731
04:55:08,520 --> 04:55:10,640
that can find the largest in a list,
6732
04:55:10,640 --> 04:55:13,080
but now we want to find the largest value
6733
04:55:13,080 --> 04:55:15,360
in the key value pairs of a dictionary.
6734
04:55:16,440 --> 04:55:18,200
So we're going to start with,
6735
04:55:18,200 --> 04:55:20,600
we're going to know what the largest count is
6736
04:55:20,600 --> 04:55:22,720
and the largest word of the, has that count.
6737
04:55:22,720 --> 04:55:24,200
And we're going to set them both to none
6738
04:55:24,200 --> 04:55:25,280
because we're going to prime our loop.
6739
04:55:25,280 --> 04:55:27,360
We have to prime our loop and we're going to say to none.
6740
04:55:27,360 --> 04:55:29,560
And so then we're going to write one of these cool things
6741
04:55:29,560 --> 04:55:31,480
that says for word, comma, count.
6742
04:55:31,480 --> 04:55:32,920
So word and count are going to go through
6743
04:55:32,920 --> 04:55:35,880
the key value pairs because we've got items here.
6744
04:55:35,880 --> 04:55:37,320
So it's going to go through the key value pairs,
6745
04:55:37,320 --> 04:55:40,200
loop through each key, whatever it was.
6746
04:55:40,200 --> 04:55:42,120
There could be a million words in here.
6747
04:55:42,120 --> 04:55:43,440
We're going to go through every one.
6748
04:55:43,440 --> 04:55:45,960
And what we're going to do is we're going to make sure
6749
04:55:45,960 --> 04:55:49,280
that key big count is the current largest count
6750
04:55:49,280 --> 04:55:50,560
we've seen so far.
6751
04:55:50,560 --> 04:55:54,160
And if it's none, well, then we haven't seen anything
6752
04:55:54,160 --> 04:55:56,360
or the current, the count we just read
6753
04:55:56,360 --> 04:55:58,760
is greater than the big count so far,
6754
04:55:58,760 --> 04:56:01,000
we are going to jump in and this is sort of like,
6755
04:56:01,000 --> 04:56:03,800
oh, this is a new personal best count
6756
04:56:03,800 --> 04:56:05,800
for this particular dataset.
6757
04:56:05,800 --> 04:56:08,160
And so we're going to remember the word in big word
6758
04:56:08,160 --> 04:56:10,200
and we're going to remember the count in big count.
6759
04:56:10,200 --> 04:56:11,720
So this is just a max loop.
6760
04:56:12,720 --> 04:56:14,760
It's a maximum loop with the extra thing
6761
04:56:14,760 --> 04:56:18,360
that we're recording in addition to what count
6762
04:56:18,360 --> 04:56:21,120
is the largest, what the word that was associated
6763
04:56:21,120 --> 04:56:22,560
with that count, we're recording it.
6764
04:56:22,560 --> 04:56:24,200
So again, this is a starting part of the loop.
6765
04:56:24,200 --> 04:56:25,120
We're going to do some work.
6766
04:56:25,120 --> 04:56:27,400
And then when we exit the bottom of this,
6767
04:56:27,400 --> 04:56:30,200
big word is going to be the word that is the most common
6768
04:56:30,200 --> 04:56:32,400
and big count is the number of times.
6769
04:56:32,400 --> 04:56:35,920
And so if we run a file, we say, oh, in that file,
6770
04:56:35,920 --> 04:56:37,520
two is the most common word.
6771
04:56:37,520 --> 04:56:38,720
And it's 16 times.
6772
04:56:38,720 --> 04:56:41,040
If we run the clown file, well,
6773
04:56:41,040 --> 04:56:42,960
the is the most common word in seven.
6774
04:56:42,960 --> 04:56:46,600
And so this now is, and this could have a very large file
6775
04:56:46,600 --> 04:56:49,800
and give you the most common word.
6776
04:56:49,800 --> 04:56:54,800
And so that is sort of a really good application
6777
04:56:55,140 --> 04:56:56,480
of dictionaries.
6778
04:56:56,480 --> 04:56:58,520
So dictionaries are the most powerful,
6779
04:56:58,520 --> 04:57:02,040
well, they're the most powerful collection
6780
04:57:02,040 --> 04:57:03,240
we've seen so far.
6781
04:57:04,320 --> 04:57:06,300
It is good to see both lists and dictionaries
6782
04:57:06,300 --> 04:57:08,840
to understand what collections are.
6783
04:57:08,840 --> 04:57:12,240
They are things inside of Python that can handle
6784
04:57:12,240 --> 04:57:13,940
more than one item inside of it.
6785
04:57:13,940 --> 04:57:15,560
And we'll learn about another collection
6786
04:57:15,560 --> 04:57:16,900
about tuples in a second.
6787
04:57:18,240 --> 04:57:20,200
Just understand the get method
6788
04:57:20,200 --> 04:57:23,280
because that leads to very compact code,
6789
04:57:23,280 --> 04:57:24,720
understanding their various ways
6790
04:57:24,720 --> 04:57:25,920
to iterate through dictionaries.
6791
04:57:25,920 --> 04:57:29,520
And so we've learned a lot, but in the next section,
6792
04:57:29,520 --> 04:57:32,160
we will learn even more and put these together
6793
04:57:32,160 --> 04:57:34,320
and do some sorting and do some other stuff
6794
04:57:34,320 --> 04:57:38,140
and really start to see the real power of dictionaries.
6795
04:57:42,320 --> 04:57:44,480
This is, I'm gonna do some coding.
6796
04:57:44,480 --> 04:57:48,660
It's related to the dictionaries chapter, chapter nine.
6797
04:57:48,660 --> 04:57:51,040
And we're gonna do some word counting.
6798
04:57:51,040 --> 04:57:55,720
That's basically right out of the slides for,
6799
04:57:55,720 --> 04:57:58,600
but I'm gonna just write the code in front of you
6800
04:57:58,600 --> 04:58:00,980
rather than have you look at it in the book.
6801
04:58:00,980 --> 04:58:04,180
So what we're gonna do is I've got my text editor
6802
04:58:04,180 --> 04:58:09,180
up here and let me start by making a new folder.
6803
04:58:09,240 --> 04:58:14,240
New folder for my chapter nine exercise.
6804
04:58:14,760 --> 04:58:17,440
And then I'm gonna go and make an untitled file.
6805
04:58:17,440 --> 04:58:19,280
That was from the previous one.
6806
04:58:19,280 --> 04:58:24,280
And I'll do what I always do, print hello and save it.
6807
04:58:28,040 --> 04:58:31,480
And save it here into exercise 09
6808
04:58:31,480 --> 04:58:35,880
and ex09.py.
6809
04:58:35,880 --> 04:58:40,040
So now I have a folder that's in my py4e folder
6810
04:58:40,040 --> 04:58:44,080
and that happens to be in my desktop.
6811
04:58:44,080 --> 04:58:47,900
Py4e is my folder on my desktop.
6812
04:58:47,900 --> 04:58:50,900
And now I have all of these subfolders, cd ex08.
6813
04:58:53,520 --> 04:58:55,520
ls is dir on windows.
6814
04:58:57,040 --> 04:59:01,440
ls, oops, I gotta go up one.
6815
04:59:01,440 --> 04:59:05,760
cd ex09 ls, so I've got that file right there.
6816
04:59:05,760 --> 04:59:07,800
Now I'm gonna wanna read some files
6817
04:59:07,800 --> 04:59:10,880
and so I'm gonna bring some files down,
6818
04:59:10,880 --> 04:59:14,960
a couple of files, Python for everybody,
6819
04:59:14,960 --> 04:59:18,720
code3, intro.txt, so I've got this URL
6820
04:59:18,720 --> 04:59:21,440
and I'm gonna save it, save page as.
6821
04:59:21,440 --> 04:59:23,360
And it's really important that I save it
6822
04:59:23,360 --> 04:59:26,160
in the same folder as I'm gonna write my code
6823
04:59:26,160 --> 04:59:28,800
just so that when I open this file it knows where it's at.
6824
04:59:28,800 --> 04:59:30,400
So I've saved that one.
6825
04:59:30,400 --> 04:59:33,840
And I'm gonna also take this clown text.
6826
04:59:33,840 --> 04:59:37,040
I'll use this to make my life simple
6827
04:59:37,040 --> 04:59:39,040
so I have a real short thing that I can show you
6828
04:59:39,040 --> 04:59:42,480
how it works and so now if I go back to my terminal,
6829
04:59:44,320 --> 04:59:48,400
I see I've got exercise 09 Python, intro.txt
6830
04:59:48,400 --> 04:59:51,160
and clown.txt, okay?
6831
04:59:51,160 --> 04:59:55,020
So let's go back to my text editor and get started.
6832
04:59:55,020 --> 05:00:00,020
I will prompt for the file name input enter file colon space.
6833
05:00:08,420 --> 05:00:09,540
Now I'm gonna do something.
6834
05:00:09,540 --> 05:00:14,540
If the length of the F name that I just read
6835
05:00:16,760 --> 05:00:21,760
is less than one, I'm gonna say F name equals clown.txt.
6836
05:00:21,760 --> 05:00:26,760
I do this so that I can just hit enter
6837
05:00:28,640 --> 05:00:30,400
and it defaults to clown.txt.
6838
05:00:30,400 --> 05:00:32,720
If I want to give it a different name, I can.
6839
05:00:32,720 --> 05:00:35,660
So if I just hit enter at this prompt,
6840
05:00:35,660 --> 05:00:38,260
then this will give me a string that's zero length.
6841
05:00:38,260 --> 05:00:40,840
So if it's less than one, I'll just assume that.
6842
05:00:40,840 --> 05:00:42,300
So let me open that.
6843
05:00:43,480 --> 05:00:47,640
Handle equals open F name.
6844
05:00:47,640 --> 05:00:52,640
And let's read through it for line in handle.
6845
05:00:58,540 --> 05:01:02,300
We'll strip it, line equals line.rstrip
6846
05:01:02,300 --> 05:01:05,180
to take the white space off the right hand side
6847
05:01:05,180 --> 05:01:07,500
and then we're gonna say print line.
6848
05:01:07,500 --> 05:01:09,580
Again, I'm not just doing this.
6849
05:01:09,580 --> 05:01:12,660
I really, when I write code, I just saved it.
6850
05:01:12,660 --> 05:01:16,340
When I write code, I do these kind of stuff all the time
6851
05:01:16,340 --> 05:01:18,900
just for my own sanity checking.
6852
05:01:18,900 --> 05:01:22,900
And so now I'm gonna run python3 ex09.py
6853
05:01:25,700 --> 05:01:27,420
just to test that.
6854
05:01:27,420 --> 05:01:29,260
I'm gonna hit enter now and it's gonna assume,
6855
05:01:29,260 --> 05:01:31,420
hopefully, clown.txt if it all goes well.
6856
05:01:31,420 --> 05:01:35,140
And yep, it read one line, okay?
6857
05:01:35,140 --> 05:01:36,460
So that part's working.
6858
05:01:36,460 --> 05:01:38,580
I'll just leave that print statement in.
6859
05:01:38,580 --> 05:01:41,980
The next thing I wanna do is kind of a classic thing
6860
05:01:41,980 --> 05:01:44,060
where we're gonna go read a bunch of lines
6861
05:01:44,060 --> 05:01:46,780
and then go horizontally across those lines in words.
6862
05:01:46,780 --> 05:01:47,940
So I'm gonna split that.
6863
05:01:47,940 --> 05:01:52,380
WDS equals line.split
6864
05:01:54,500 --> 05:01:57,540
and print WDS.
6865
05:01:57,540 --> 05:02:01,500
So I'll print that and I'm gonna save it and test it.
6866
05:02:01,500 --> 05:02:04,700
I really love to test things over and over.
6867
05:02:04,700 --> 05:02:05,740
There's the actual line.
6868
05:02:05,740 --> 05:02:08,860
This file clown.txt only has one line
6869
05:02:08,860 --> 05:02:11,940
and it breaks it into words and so I have those words.
6870
05:02:11,940 --> 05:02:16,940
Let's just run it again with intro.txt.
6871
05:02:17,540 --> 05:02:18,820
So this will have a lot of lines.
6872
05:02:18,820 --> 05:02:21,180
Line, line, line, line, line, lots of lines.
6873
05:02:21,180 --> 05:02:23,020
Every line has a prints out the line
6874
05:02:23,020 --> 05:02:26,500
and then prints out the words that we split it into, okay?
6875
05:02:26,500 --> 05:02:29,580
So now I kinda, one of the things that I do here
6876
05:02:29,580 --> 05:02:32,400
is I wanna believe, now I sort of can believe
6877
05:02:32,400 --> 05:02:34,180
everything from here up.
6878
05:02:34,180 --> 05:02:35,740
Like, oh, it's gonna open the file.
6879
05:02:35,740 --> 05:02:36,700
It's gonna read through the lines
6880
05:02:36,700 --> 05:02:38,260
and I'm gonna split them into words.
6881
05:02:38,260 --> 05:02:40,180
And so then I'll just kind of behind it,
6882
05:02:40,180 --> 05:02:44,140
I'll just say, okay, I'll just comment that out.
6883
05:02:44,140 --> 05:02:49,140
Now I need another for loop for W in WDS.
6884
05:02:51,060 --> 05:02:54,740
Now words is a Python list and has some number of words
6885
05:02:54,740 --> 05:02:57,980
in it, zero or 12 or whatever was on the line.
6886
05:02:57,980 --> 05:03:00,980
And now I'm gonna print out the word, okay?
6887
05:03:06,380 --> 05:03:09,500
And so now it will go through that horizontally.
6888
05:03:09,500 --> 05:03:11,740
Now I'll just do clown.txt.
6889
05:03:11,740 --> 05:03:15,300
So that you see, I'm not printing the line out.
6890
05:03:15,300 --> 05:03:17,340
That's the words that have been parsed from the,
6891
05:03:17,340 --> 05:03:18,700
split from the line.
6892
05:03:18,700 --> 05:03:20,340
And now we got this loop.
6893
05:03:20,340 --> 05:03:22,160
Now, one of the thing that's interesting
6894
05:03:22,160 --> 05:03:25,600
is just to make sure that you're going through all the words.
6895
05:03:25,600 --> 05:03:28,260
And I like a print statement here
6896
05:03:28,260 --> 05:03:32,900
to know that W is going to successfully take on literally
6897
05:03:32,900 --> 05:03:34,220
all the words of this file.
6898
05:03:34,220 --> 05:03:37,060
So if I comment this print statement out
6899
05:03:37,060 --> 05:03:42,060
and I run it again, clown.txt, that for loop starting
6900
05:03:42,780 --> 05:03:44,860
from here is every word in that file
6901
05:03:44,860 --> 05:03:46,700
which happens to only be one line.
6902
05:03:46,700 --> 05:03:49,160
But now if I do the same thing for intro.txt,
6903
05:03:53,980 --> 05:03:55,380
it's just gonna go through the words.
6904
05:03:55,380 --> 05:03:57,660
And in a sense, by nesting these two loops,
6905
05:03:57,660 --> 05:03:59,000
we're gonna hit all the lines.
6906
05:03:59,000 --> 05:04:02,600
And that's a lot of stuff, but it hit all of the lines,
6907
05:04:02,600 --> 05:04:04,340
all the words, and away we go.
6908
05:04:04,340 --> 05:04:07,540
Okay, so here's where a dictionary comes in.
6909
05:04:10,020 --> 05:04:14,620
I'm gonna make a variable called DI for dictionary.
6910
05:04:14,620 --> 05:04:16,720
And I'm gonna say, give me a dictionary.
6911
05:04:16,720 --> 05:04:19,180
Now, D-I-C-T is not something you can choose.
6912
05:04:19,180 --> 05:04:22,780
That's saying make, that's defining the type of dictionary.
6913
05:04:22,780 --> 05:04:25,100
DI is a variable that I chose.
6914
05:04:25,100 --> 05:04:28,220
Okay, so the key thing to this dictionary
6915
05:04:28,220 --> 05:04:29,940
is we're gonna make a counter.
6916
05:04:29,940 --> 05:04:33,740
And we're gonna use W, the word absorb,
6917
05:04:33,740 --> 05:04:37,460
elegant, whatever, and we're gonna use that as the index.
6918
05:04:37,460 --> 05:04:42,460
So the simple thing to do is to say if W is in DI,
6919
05:04:45,820 --> 05:04:50,820
then we can say W's, I mean, the dictionary sub the word
6920
05:04:51,380 --> 05:04:55,580
which is our key and the key value store of the dictionary
6921
05:04:55,580 --> 05:04:58,940
is equal to the value that we had before in that area,
6922
05:04:58,940 --> 05:05:01,260
D sub W plus one.
6923
05:05:01,260 --> 05:05:06,260
And if it's not in there, else D-I sub W equals one.
6924
05:05:14,920 --> 05:05:17,920
And I'm gonna print, print new.
6925
05:05:24,760 --> 05:05:28,720
So every time we see a new word, it's gonna say new.
6926
05:05:28,720 --> 05:05:33,480
And I'm going to also then print W and the current value
6927
05:05:33,480 --> 05:05:36,380
of the counter for W as it's going through.
6928
05:05:36,380 --> 05:05:38,240
Now notice how far in I'm indented.
6929
05:05:38,240 --> 05:05:40,600
This is all part of this inner loop.
6930
05:05:40,600 --> 05:05:44,240
So this is the loop that's gonna run every single word.
6931
05:05:44,240 --> 05:05:47,760
Okay, and I'm gonna run this first with clown.
6932
05:05:49,120 --> 05:05:54,120
So it runs slowly, okay, so we saw the was new
6933
05:05:54,120 --> 05:05:59,120
and the count is one, clown is new, count is one,
6934
05:06:00,040 --> 05:06:02,120
ran is new, the count is one.
6935
05:06:02,120 --> 05:06:04,080
After is new, the count is one.
6936
05:06:04,080 --> 05:06:07,460
Now we saw the again, but now we made the count be two.
6937
05:06:09,980 --> 05:06:10,940
Let's print here.
6938
05:06:19,360 --> 05:06:20,520
I'll say existing.
6939
05:06:20,520 --> 05:06:24,160
So you can kind of see it.
6940
05:06:24,160 --> 05:06:25,880
Now in the print, I'm printing this,
6941
05:06:25,880 --> 05:06:27,640
let's make it even a little more verbose.
6942
05:06:27,640 --> 05:06:32,640
Print W and then I will make it so it prints the,
6943
05:06:34,880 --> 05:06:38,200
it prints the word before and the count after
6944
05:06:38,200 --> 05:06:40,200
and then whether it's existing or new.
6945
05:06:40,200 --> 05:06:42,040
So we'll put a lot of print statements in.
6946
05:06:42,040 --> 05:06:45,760
Print statements are cheap, okay?
6947
05:06:45,760 --> 05:06:49,400
So now we see the word the, it's the first time we see it
6948
05:06:49,400 --> 05:06:50,640
and we set it to one.
6949
05:06:50,640 --> 05:06:52,720
We see the clown, it's the first time we see it,
6950
05:06:52,720 --> 05:06:53,680
we set it to one.
6951
05:06:53,680 --> 05:06:55,680
We see ran, new, one.
6952
05:06:56,680 --> 05:07:00,640
Later on, we see the, it's already in.
6953
05:07:00,640 --> 05:07:03,920
So existing means it was already in the dictionary.
6954
05:07:03,920 --> 05:07:08,200
W as a key was already in the dictionary, okay?
6955
05:07:08,200 --> 05:07:10,000
And so that's why we added one to it.
6956
05:07:10,000 --> 05:07:14,280
So the old value was one and then we added di sub the
6957
05:07:14,280 --> 05:07:18,780
equals di sub, di sub the equals di sub the plus one.
6958
05:07:18,780 --> 05:07:21,360
W is the string, the, T-H-E.
6959
05:07:21,360 --> 05:07:24,960
That's what that string is, okay?
6960
05:07:24,960 --> 05:07:27,280
And so we've made it all the way through
6961
05:07:27,280 --> 05:07:29,760
and you see the in this one line occurred
6962
05:07:29,760 --> 05:07:31,420
ultimately seven times.
6963
05:07:31,420 --> 05:07:34,880
So now I want to print out the contents of this dictionary
6964
05:07:35,880 --> 05:07:37,440
at the very end of both loops.
6965
05:07:37,440 --> 05:07:39,480
So I got it de-indent twice
6966
05:07:41,520 --> 05:07:43,520
and so that will give us the counts.
6967
05:07:43,520 --> 05:07:44,360
Okay?
6968
05:07:48,240 --> 05:07:51,000
And so this is what we get when it's all said and done.
6969
05:07:51,000 --> 05:07:54,120
You know, the happened seven times,
6970
05:07:54,120 --> 05:07:57,480
but it just worked through its way through, okay?
6971
05:07:58,560 --> 05:08:00,080
So you got that.
6972
05:08:00,080 --> 05:08:03,120
Now, this is a pretty verbose way of doing this
6973
05:08:03,120 --> 05:08:05,000
but I did it sort of the slow way to show
6974
05:08:05,000 --> 05:08:06,680
that there are two situations.
6975
05:08:06,680 --> 05:08:08,640
If it's already there, you increment it
6976
05:08:08,640 --> 05:08:10,400
and if it's not there, you set it to one,
6977
05:08:10,400 --> 05:08:12,040
effectively inserting it, right?
6978
05:08:12,040 --> 05:08:14,160
So you insert it and set it to one
6979
05:08:14,160 --> 05:08:18,220
with this D-I sub the equals one, okay?
6980
05:08:18,220 --> 05:08:20,520
But let's get a little less verbose here,
6981
05:08:20,520 --> 05:08:21,920
get rid of some of these print statements
6982
05:08:21,920 --> 05:08:23,720
because we kind of covered all that.
6983
05:08:29,240 --> 05:08:33,160
Get rid of this line and go back to printing W and D-I-W
6984
05:08:33,160 --> 05:08:34,960
at the end, we'll leave that one in.
6985
05:08:34,960 --> 05:08:36,960
So what I want to do is I want to look
6986
05:08:36,960 --> 05:08:41,960
at this bit of code right here, this if W in D-I-Ls.
6987
05:08:45,880 --> 05:08:47,560
We do this so much with dictionaries
6988
05:08:47,560 --> 05:08:52,120
that there is an easy mechanism to do this
6989
05:08:52,120 --> 05:08:53,400
that combines these four lines
6990
05:08:53,400 --> 05:08:55,680
into a single kind of contraction.
6991
05:08:55,680 --> 05:08:58,340
And so I'm gonna do this, I'm gonna print,
6992
05:09:00,100 --> 05:09:03,000
let's put two stars out, then the word
6993
05:09:03,000 --> 05:09:08,000
and D-I dot get of the word comma negative 99.
6994
05:09:13,360 --> 05:09:17,680
Okay, and so this D-I dot get of the word
6995
05:09:17,680 --> 05:09:19,080
is the important part.
6996
05:09:19,080 --> 05:09:21,600
The way it is, is this is a dictionary,
6997
05:09:21,600 --> 05:09:24,520
dot get says in its first parameters,
6998
05:09:24,520 --> 05:09:26,720
the key to lookup, which is word like the
6999
05:09:26,720 --> 05:09:28,880
or fell or clown or whatever,
7000
05:09:28,880 --> 05:09:31,800
and 99 is the default value that we get
7001
05:09:31,800 --> 05:09:33,480
if the key doesn't exist.
7002
05:09:33,480 --> 05:09:38,040
So this is an effect, an if then else, right?
7003
05:09:38,040 --> 05:09:42,680
This little D-I dot get W negative 99 is,
7004
05:09:44,540 --> 05:09:45,960
if it's in there, do one thing,
7005
05:09:45,960 --> 05:09:48,080
if it's not in there, do something else, okay?
7006
05:09:48,080 --> 05:09:50,420
So let me show you how this works
7007
05:09:50,420 --> 05:09:53,900
and you'll see that the 99 will happen when,
7008
05:09:53,900 --> 05:09:58,900
okay, so the first time we see the get returns 99,
7009
05:10:05,880 --> 05:10:07,560
right, so let's move it over here.
7010
05:10:07,560 --> 05:10:11,520
The first time we see the, the is not in the dictionary.
7011
05:10:11,520 --> 05:10:15,960
So this D-I dot get of the word the in the dictionary
7012
05:10:15,960 --> 05:10:19,940
gives us back the negative 99, okay?
7013
05:10:19,940 --> 05:10:22,880
And this still is working and so the is one,
7014
05:10:22,880 --> 05:10:27,160
clown is whatever, but away we go, okay?
7015
05:10:27,160 --> 05:10:29,040
Let's do it this way, let me comment this out.
7016
05:10:29,040 --> 05:10:31,240
Let me comment this one out and run it again
7017
05:10:31,240 --> 05:10:34,120
so it's a little clearer what's going on.
7018
05:10:34,120 --> 05:10:38,400
Okay, so the first time we see the,
7019
05:10:38,400 --> 05:10:39,680
the is not in the dictionary.
7020
05:10:39,680 --> 05:10:40,800
The first time we see clown
7021
05:10:40,800 --> 05:10:43,600
and we know it's negative 99,
7022
05:10:43,600 --> 05:10:47,060
but here we asked for it and the is one
7023
05:10:47,060 --> 05:10:50,120
because we've seen it before.
7024
05:10:50,120 --> 05:10:53,480
And so that's just this get mechanism
7025
05:10:53,480 --> 05:10:58,480
allows us to get the new value
7026
05:10:58,840 --> 05:11:00,880
or get a value out if the key exists
7027
05:11:00,880 --> 05:11:05,880
and specify a default if it's not there.
7028
05:11:06,120 --> 05:11:11,120
So I'm gonna go old count equals D-I dot get
7029
05:11:13,320 --> 05:11:16,400
W comma zero.
7030
05:11:16,400 --> 05:11:18,040
So instead of using 99 here,
7031
05:11:18,040 --> 05:11:22,760
I'm gonna just get rid of all this is what I'm saying
7032
05:11:22,760 --> 05:11:24,800
is look up in this dictionary.
7033
05:11:24,800 --> 05:11:28,760
Get is a function that's part of all dictionaries.
7034
05:11:28,760 --> 05:11:31,160
Look up using the key W which is the,
7035
05:11:31,160 --> 05:11:36,160
and if I don't get it, give me back zero.
7036
05:11:36,160 --> 05:11:49,160
And so I'm gonna say print word comma old comma old count.
7037
05:11:49,660 --> 05:11:53,300
And now what I can say, whatever the old count is,
7038
05:11:53,300 --> 05:11:55,500
it's either the value that was in there or zero.
7039
05:11:55,500 --> 05:12:07,140
And now I can say new count equals old count.
7040
05:12:07,140 --> 05:12:08,820
And now, let's see new count,
7041
05:12:08,820 --> 05:12:15,820
and I can say dictionary sub word is equal to new count.
7042
05:12:18,940 --> 05:12:21,900
So instead I'm gonna get rid of this if then else then.
7043
05:12:21,900 --> 05:12:26,380
This is basically saying, look up the old count that we have.
7044
05:12:26,380 --> 05:12:27,960
If you don't find one, use a zero.
7045
05:12:27,960 --> 05:12:29,300
We'll print that out.
7046
05:12:29,300 --> 05:12:32,340
And then I'm gonna say afterwards,
7047
05:12:41,340 --> 05:12:42,900
I'm gonna print the new count.
7048
05:12:44,780 --> 05:12:46,380
Now, and so,
7049
05:12:46,380 --> 05:12:51,380
we'll print the old count.
7050
05:12:53,700 --> 05:12:55,580
Here are some of these blanks.
7051
05:12:55,580 --> 05:12:56,660
Print the old count.
7052
05:13:02,780 --> 05:13:04,700
And you can see the old count with the,
7053
05:13:04,700 --> 05:13:07,740
because the doesn't exist, was zero, the new one's one.
7054
05:13:07,740 --> 05:13:09,940
Clowns old is zero, new is one.
7055
05:13:09,940 --> 05:13:12,700
Clowns old ran old zero.
7056
05:13:12,700 --> 05:13:15,280
But now we get to the, its old count was one,
7057
05:13:15,280 --> 05:13:18,640
and now its new count is two, okay?
7058
05:13:18,640 --> 05:13:22,620
So by using this get and saying if we don't find it,
7059
05:13:22,620 --> 05:13:23,820
we'll assume the count is zero.
7060
05:13:23,820 --> 05:13:25,840
That makes a lot of sense, right?
7061
05:13:30,420 --> 05:13:35,420
If not there, the count is zero.
7062
05:13:38,100 --> 05:13:43,100
If the key is not there, the count is zero, okay?
7063
05:13:43,100 --> 05:13:46,660
So that's what this line does.
7064
05:13:46,660 --> 05:13:51,300
If get the value under the key, associated with the key,
7065
05:13:51,300 --> 05:13:53,620
or give me zero back.
7066
05:13:53,620 --> 05:13:55,300
And then I can take that old number
7067
05:13:55,300 --> 05:13:57,900
and just add one to it and then stick it back in.
7068
05:13:57,900 --> 05:14:01,940
Now this is ultimately not how we tend to do it, okay?
7069
05:14:01,940 --> 05:14:05,620
We tend to blend this all into one big long statement.
7070
05:14:05,620 --> 05:14:10,380
Di sub w equals this part
7071
05:14:10,380 --> 05:14:15,020
plus one, okay?
7072
05:14:15,020 --> 05:14:19,500
So that says get the old value from this key or zero
7073
05:14:19,500 --> 05:14:20,820
and then add one to it,
7074
05:14:20,820 --> 05:14:24,140
because that really combines all of these lines
7075
05:14:24,140 --> 05:14:26,480
into a single line, okay?
7076
05:14:27,500 --> 05:14:28,940
So I'm gonna delete them now.
7077
05:14:34,460 --> 05:14:36,640
And now we've combined this all into one,
7078
05:14:36,640 --> 05:14:41,640
what effectively is an idiom.
7079
05:14:42,560 --> 05:14:47,560
Retrieve, create, update, counter, all in one line.
7080
05:14:54,260 --> 05:14:57,140
I'll still print out, in this case I'll just say
7081
05:14:57,140 --> 05:15:02,140
di sub w and then we'll see the counter, okay?
7082
05:15:02,140 --> 05:15:07,140
And so now I'll run this, we don't, we have a new,
7083
05:15:11,020 --> 05:15:13,020
but now we see at the second time it's two
7084
05:15:13,020 --> 05:15:15,540
and so we see car the first time,
7085
05:15:15,540 --> 05:15:17,500
we see that the second time, we see car,
7086
05:15:17,500 --> 05:15:20,160
I mean the third time, we see car the second time
7087
05:15:20,160 --> 05:15:22,220
and away we go, okay?
7088
05:15:22,220 --> 05:15:24,460
And so that's pretty straightforward
7089
05:15:24,460 --> 05:15:27,460
and so it really kind of typo there.
7090
05:15:27,460 --> 05:15:32,460
So let's just get rid of that and run it with a clown stuff
7091
05:15:36,100 --> 05:15:38,460
and we get the right data there
7092
05:15:38,460 --> 05:15:43,460
and let's run it with intro dot txt and there we go, okay?
7093
05:15:48,500 --> 05:15:51,500
And so it's tearing out a bunch of words
7094
05:15:51,500 --> 05:15:53,220
and giving us a dictionary.
7095
05:15:53,220 --> 05:15:58,060
And giving us a dictionary, so that was a lot of work
7096
05:15:58,060 --> 05:16:01,300
to get to this line 16 that has the dictionary in it.
7097
05:16:01,300 --> 05:16:04,300
Now we wanna find the most common word.
7098
05:16:15,660 --> 05:16:17,740
And so we're gonna loop through this dictionary
7099
05:16:17,740 --> 05:16:20,580
and part of it is like once we printed this dictionary out
7100
05:16:20,580 --> 05:16:22,580
and we verified that it's right,
7101
05:16:22,580 --> 05:16:25,220
don't worry too much about the code up here, right?
7102
05:16:25,220 --> 05:16:28,620
Matter of fact, I can take out some of these print statements
7103
05:16:28,620 --> 05:16:31,180
and we can kinda trust all this
7104
05:16:31,180 --> 05:16:33,780
and so now we're gonna work on this, okay?
7105
05:16:33,780 --> 05:16:35,900
Now we wanna find the most common word.
7106
05:16:35,900 --> 05:16:37,360
Now this is like a maximum loop.
7107
05:16:37,360 --> 05:16:42,360
So if you recall, we have a whole set of key value pairs,
7108
05:16:42,540 --> 05:16:46,240
communicate goes to two, is to two, skills is three.
7109
05:16:46,240 --> 05:16:48,340
So we have these key value pairs
7110
05:16:48,340 --> 05:16:50,340
and we're gonna loop through and look for the maximum.
7111
05:16:50,340 --> 05:16:54,860
Now in a dictionary, we can loop through the key value pairs
7112
05:16:54,860 --> 05:16:59,860
with the following syntax for, you know,
7113
05:17:00,260 --> 05:17:04,060
I would call these variables K and V for key and value,
7114
05:17:04,060 --> 05:17:09,060
but yeah, in the dictionaries name.items
7115
05:17:10,140 --> 05:17:13,420
and items is a method inside of all dictionaries
7116
05:17:13,420 --> 05:17:16,660
that says give me the key value pairs
7117
05:17:16,660 --> 05:17:18,380
and we need two iteration variables.
7118
05:17:18,380 --> 05:17:20,220
So this is like an assignment statement
7119
05:17:20,220 --> 05:17:21,060
for K and V.
7120
05:17:21,060 --> 05:17:23,660
K and V take on the successive values
7121
05:17:23,660 --> 05:17:27,140
for the keys, the key and the value, okay?
7122
05:17:27,140 --> 05:17:30,740
So if I just now print K comma V
7123
05:17:32,820 --> 05:17:34,740
and I'll take this print statement out
7124
05:17:35,820 --> 05:17:40,220
and then run the code on, oops, what I forgot,
7125
05:17:40,220 --> 05:17:43,260
oh, I fell back into my Python two days.
7126
05:17:44,140 --> 05:17:45,680
Need parentheses for my print.
7127
05:17:46,800 --> 05:17:49,300
So there's clown and it just prints it out
7128
05:17:49,300 --> 05:17:51,220
and it's kind of the same thing except it's pretty
7129
05:17:51,220 --> 05:17:55,220
where we're putting each one on a line, okay?
7130
05:17:55,220 --> 05:17:57,500
So the K, the V is the value.
7131
05:17:57,500 --> 05:17:59,860
So we're looking for the largest value, oops.
7132
05:18:02,200 --> 05:18:05,040
So the thing is we know that the values
7133
05:18:05,040 --> 05:18:08,220
are always numbers that are greater than one.
7134
05:18:08,220 --> 05:18:13,220
So I'm gonna do kind of a quickie maximum loop.
7135
05:18:14,300 --> 05:18:17,820
Largest equals negative one.
7136
05:18:17,820 --> 05:18:19,620
Now in previous times, we've seen that this
7137
05:18:19,620 --> 05:18:21,380
is a bad assumption, but because we know
7138
05:18:21,380 --> 05:18:23,320
these are counters that are always positive,
7139
05:18:23,320 --> 05:18:26,300
it turns out this is not a bad idea.
7140
05:18:26,300 --> 05:18:30,620
And so I can say if the value is greater
7141
05:18:30,620 --> 05:18:32,700
than the largest we've seen so far,
7142
05:18:38,400 --> 05:18:41,540
largest equals the value.
7143
05:18:44,620 --> 05:18:46,820
Okay, and when that loop is all done,
7144
05:18:46,820 --> 05:18:48,980
we can print the largest.
7145
05:18:57,860 --> 05:19:00,180
Okay, and so this is just a max loop
7146
05:19:00,180 --> 05:19:01,780
and we're using this value.
7147
05:19:01,780 --> 05:19:04,180
That's the number, the value is the second thing.
7148
05:19:05,340 --> 05:19:06,180
Oops.
7149
05:19:06,180 --> 05:19:11,180
Ah, can't type Python.
7150
05:19:20,540 --> 05:19:21,960
Oh, it's a typo.
7151
05:19:22,940 --> 05:19:25,700
Yeah, I'm not using value, I'm using V.
7152
05:19:25,700 --> 05:19:28,240
So largest equals V, let's try it again.
7153
05:19:30,860 --> 05:19:32,300
Okay, so we're all done with seven.
7154
05:19:32,300 --> 05:19:34,700
So these were the things that we were looking for
7155
05:19:34,700 --> 05:19:37,220
and it was looking for the maximum
7156
05:19:37,220 --> 05:19:40,280
and it just dutifully found seven was the largest.
7157
05:19:40,280 --> 05:19:42,740
But we also wanna know what the word is.
7158
05:19:42,740 --> 05:19:47,740
And so what we can say here is we can say the word is none,
7159
05:19:48,920 --> 05:19:52,260
meaning it's just like we don't know what the word is.
7160
05:19:52,260 --> 05:19:56,060
And then whenever we catch this new largest number,
7161
05:19:56,060 --> 05:19:59,340
we say the word equals W.
7162
05:19:59,340 --> 05:20:04,340
So I like to think of this as capture,
7163
05:20:04,340 --> 05:20:09,340
remember the word that was largest.
7164
05:20:13,580 --> 05:20:15,140
Right, that's what I'm doing.
7165
05:20:15,140 --> 05:20:19,540
R, E, M, E, M, R, M, E, M,
7166
05:20:22,140 --> 05:20:24,860
remember, right, that's tough.
7167
05:20:24,860 --> 05:20:26,940
R, E, M, E, M, B, E, R, there we go.
7168
05:20:28,420 --> 05:20:31,700
So we're gonna, this trick here is,
7169
05:20:31,700 --> 05:20:33,660
not only knowing what the largest number was,
7170
05:20:33,660 --> 05:20:36,620
but the word that was associated with the largest number.
7171
05:20:36,620 --> 05:20:40,820
So now I can print out at the end the word and the largest
7172
05:20:40,820 --> 05:20:41,860
and that's the count.
7173
05:20:46,540 --> 05:20:48,180
Okay, and so now we know that,
7174
05:20:50,740 --> 05:20:52,540
oops, did we make a mistake here?
7175
05:20:58,740 --> 05:21:00,700
Okay, that does not look good
7176
05:21:00,700 --> 05:21:04,060
because it says car and seven.
7177
05:21:05,820 --> 05:21:08,040
If V is greater than the largest,
7178
05:21:09,060 --> 05:21:11,380
oh, it's not W.
7179
05:21:12,900 --> 05:21:14,780
I used a really bad variable.
7180
05:21:14,780 --> 05:21:17,020
See, that's the whole value there.
7181
05:21:17,020 --> 05:21:17,920
There we go.
7182
05:21:17,920 --> 05:21:19,720
It's K, which is the key.
7183
05:21:21,500 --> 05:21:22,340
Key.
7184
05:21:23,820 --> 05:21:26,060
I was gonna say that was quite the bug.
7185
05:21:26,060 --> 05:21:26,900
See what happened there?
7186
05:21:26,900 --> 05:21:29,620
I had this as W and it just happened to be,
7187
05:21:29,620 --> 05:21:32,200
it was the last word on the file.
7188
05:21:35,700 --> 05:21:38,140
Car, the last word in the file
7189
05:21:38,140 --> 05:21:40,240
because I used a wrong variable.
7190
05:21:44,140 --> 05:21:46,140
No, little mistakes, little mistakes.
7191
05:21:48,940 --> 05:21:51,040
The and seven.
7192
05:21:51,040 --> 05:21:53,420
Okay, so let's get rid of this print statement
7193
05:21:53,420 --> 05:21:56,100
because we kind of know what's going on here
7194
05:21:56,100 --> 05:21:59,260
and away we go and this should now work.
7195
05:21:59,260 --> 05:22:00,100
If we run it.
7196
05:22:05,180 --> 05:22:07,180
I can even get rid of the word done here.
7197
05:22:11,780 --> 05:22:12,820
There we go, the seven.
7198
05:22:12,820 --> 05:22:15,260
Now, the cool thing about this is this code runs
7199
05:22:15,260 --> 05:22:17,540
just as easily with one line of code
7200
05:22:17,540 --> 05:22:20,860
or the intro of the book, intro.txt,
7201
05:22:20,860 --> 05:22:24,260
and not surprisingly, it's still the most common word
7202
05:22:24,260 --> 05:22:27,540
in the introduction.txt, I seem to like that word,
7203
05:22:27,540 --> 05:22:29,900
and it's 226 times.
7204
05:22:29,900 --> 05:22:34,900
Okay, and so that is the basic pattern of reading some,
7205
05:22:35,420 --> 05:22:37,800
this is just a word loop now sometimes,
7206
05:22:37,800 --> 05:22:39,540
there would be some, you know,
7207
05:22:39,540 --> 05:22:41,900
checking to see if the line is the one you're interested in,
7208
05:22:41,900 --> 05:22:43,180
maybe tearing apart the line,
7209
05:22:43,180 --> 05:22:44,780
but it's at the end of the day,
7210
05:22:44,780 --> 05:22:48,060
this idiom of starting a dictionary.
7211
05:22:48,060 --> 05:22:50,060
Now, it's a common problem to know
7212
05:22:50,060 --> 05:22:51,140
where to start the dictionary.
7213
05:22:51,140 --> 05:22:54,260
Do you want to accumulate the numbers for the whole file
7214
05:22:54,260 --> 05:22:58,420
so you don't want to put it in between line six and line seven?
7215
05:22:58,420 --> 05:23:03,260
Okay, so I hope that particular thing helps a little bit,
7216
05:23:03,260 --> 05:23:05,320
helps you understand dictionaries.
7217
05:23:09,500 --> 05:23:11,020
Hello and welcome to chapter 10.
7218
05:23:11,020 --> 05:23:12,980
Now, we're gonna talk about our third kind of collection
7219
05:23:12,980 --> 05:23:15,880
called tuples, but tuples are really a lot like lists,
7220
05:23:15,880 --> 05:23:18,200
there's not too much to them,
7221
05:23:18,200 --> 05:23:21,820
they're really kind of reductionist version of lists there.
7222
05:23:21,820 --> 05:23:24,100
So they function very much like lists,
7223
05:23:24,100 --> 05:23:28,260
and that, you know, they have things,
7224
05:23:28,260 --> 05:23:30,300
and the difference is there are no square braces,
7225
05:23:30,300 --> 05:23:33,860
there is a parenthesis, round brace or whatever,
7226
05:23:33,860 --> 05:23:36,300
and they have positions zero, one, and two,
7227
05:23:36,300 --> 05:23:39,900
just like a list, and you can look things up, X sub two,
7228
05:23:39,900 --> 05:23:43,020
so X sub two is actually the third element here,
7229
05:23:43,020 --> 05:23:45,220
and so that prints out Joseph.
7230
05:23:45,220 --> 05:23:47,820
You can assign, you know, make a tuple here,
7231
05:23:47,820 --> 05:23:50,260
this is the constant syntax for a tuple,
7232
05:23:50,260 --> 05:23:52,460
and print that out, and the print statement shows you
7233
05:23:52,460 --> 05:23:54,180
that this is a tuple, not a list,
7234
05:23:54,180 --> 05:23:55,740
by showing you round parenthesis,
7235
05:23:55,740 --> 05:23:58,220
and a whole bunch of functions that work with lists
7236
05:23:58,220 --> 05:24:00,500
work the same way with tuples.
7237
05:24:00,500 --> 05:24:03,240
You can put a tuple at the end of an end statement
7238
05:24:03,240 --> 05:24:04,780
in a four, as you might expect,
7239
05:24:04,780 --> 05:24:06,420
and then it iterates through the tuples,
7240
05:24:06,420 --> 05:24:09,600
tuples maintain order, so it prints out one, nine, and two.
7241
05:24:09,600 --> 05:24:14,080
So, literally this bit of code here could be identical,
7242
05:24:14,080 --> 05:24:16,300
whether it was a list or a tuple,
7243
05:24:16,300 --> 05:24:19,040
it really would do the exact same thing.
7244
05:24:19,040 --> 05:24:21,940
The difference between tuples are that they are immutable,
7245
05:24:21,940 --> 05:24:23,460
once you create the tuple,
7246
05:24:23,460 --> 05:24:25,820
you can only sort of assign a tuple,
7247
05:24:25,820 --> 05:24:28,380
but you can't modify it, you can modify a list.
7248
05:24:28,380 --> 05:24:30,320
So if we take a look at a list here,
7249
05:24:30,320 --> 05:24:32,400
we make a list that's nine, eight, seven,
7250
05:24:32,400 --> 05:24:34,180
and we say X sub two equals six,
7251
05:24:34,180 --> 05:24:36,980
well, that just means this seven becomes a six,
7252
05:24:36,980 --> 05:24:41,180
and that's just natural, meaning we can reassign slots,
7253
05:24:41,180 --> 05:24:43,420
we can delete things, we can insert things,
7254
05:24:43,420 --> 05:24:46,260
we can mutate them, we can change them,
7255
05:24:46,260 --> 05:24:49,700
so they're changeable, right, they're changeable.
7256
05:24:49,700 --> 05:24:53,460
But, if we try to do that same thing with a string,
7257
05:24:53,460 --> 05:24:54,820
so we say Y equals ABC,
7258
05:24:54,820 --> 05:24:56,860
and we know that this is position zero, one, and two,
7259
05:24:56,860 --> 05:25:00,060
but if we try to say, let's change the C to a D,
7260
05:25:00,060 --> 05:25:03,620
by saying Y sub two equals D, that is not allowed.
7261
05:25:03,620 --> 05:25:05,940
And it says it doesn't support item assignment,
7262
05:25:05,940 --> 05:25:09,580
and this little bracket, you know, X sub two,
7263
05:25:09,580 --> 05:25:13,060
is what they call item assignment inside of Python.
7264
05:25:13,060 --> 05:25:14,840
And so if we do the same thing then
7265
05:25:14,840 --> 05:25:19,060
with a three element tuple, put that in Z,
7266
05:25:19,060 --> 05:25:22,140
and we try to change this slot to be a zero,
7267
05:25:22,140 --> 05:25:24,580
it's gonna blow up, because it's the exact same thing.
7268
05:25:24,580 --> 05:25:26,140
And that has to do with the fact that,
7269
05:25:26,140 --> 05:25:30,000
once this assignment is made, this is not modifiable.
7270
05:25:30,000 --> 05:25:32,660
Now, it turns out that the reason it's not modifiable
7271
05:25:32,660 --> 05:25:34,380
is for efficiency.
7272
05:25:35,780 --> 05:25:40,780
They take up less storage, they are quicker to access,
7273
05:25:40,840 --> 05:25:43,740
and they're really designed internally behind the scenes
7274
05:25:43,740 --> 05:25:46,280
in ways we don't really need to understand.
7275
05:25:46,280 --> 05:25:49,360
They're just more efficient than lists.
7276
05:25:49,360 --> 05:25:50,840
If all you wanna do is store a list,
7277
05:25:50,840 --> 05:25:52,360
and look at it, and then throw it away,
7278
05:25:52,360 --> 05:25:54,420
you probably should use a tuple instead.
7279
05:25:54,420 --> 05:25:56,600
So there's a lot of things that you can do with lists
7280
05:25:56,600 --> 05:25:57,960
that you also can't do with tuples,
7281
05:25:57,960 --> 05:25:59,960
but they're really just a corollary
7282
05:25:59,960 --> 05:26:02,480
of this notion of non-mutability.
7283
05:26:02,480 --> 05:26:04,120
And so, like, you can sort a list,
7284
05:26:04,120 --> 05:26:05,520
but you can't sort tuples.
7285
05:26:05,520 --> 05:26:08,400
You can add a five to the end of three, two, one.
7286
05:26:08,400 --> 05:26:10,880
Can't do that in a tuple, but you can in a list.
7287
05:26:10,880 --> 05:26:13,560
And flip the order, dot, dot, dot, dot, dot, dot.
7288
05:26:13,560 --> 05:26:18,480
So anything that you can do to a list that modifies the list,
7289
05:26:18,480 --> 05:26:20,120
not allowed for tuples.
7290
05:26:20,120 --> 05:26:23,040
And so you can take a look at the kinds of things
7291
05:26:23,040 --> 05:26:27,600
that are inside the methods that are part of each list,
7292
05:26:27,600 --> 05:26:30,200
append, count, extend, index, insert, pop,
7293
05:26:30,200 --> 05:26:32,480
all of these, many of these are modifying,
7294
05:26:32,480 --> 05:26:34,120
and then count and index are the only ones
7295
05:26:34,120 --> 05:26:36,520
that work for tuples.
7296
05:26:36,520 --> 05:26:40,000
And so tuples are limited lists.
7297
05:26:40,000 --> 05:26:41,800
Now, at some point, there's gonna be a but here
7298
05:26:41,800 --> 05:26:44,040
to say, why do we like them?
7299
05:26:44,040 --> 05:26:46,760
And the reason that we like them is that
7300
05:26:46,760 --> 05:26:47,960
they're just more efficient.
7301
05:26:47,960 --> 05:26:50,480
They don't have to build in Python
7302
05:26:50,480 --> 05:26:53,960
in its own internal organization of these objects.
7303
05:26:53,960 --> 05:26:56,440
It knows that they'll never be modified,
7304
05:26:56,440 --> 05:26:58,360
because when you make a tuple, you as the programmer
7305
05:26:58,360 --> 05:27:00,200
saying, I'm never gonna modify this,
7306
05:27:00,200 --> 05:27:02,600
and Python won't let you do it.
7307
05:27:02,600 --> 05:27:05,280
So it's higher performance, better memory use,
7308
05:27:05,280 --> 05:27:06,960
and you know, to a beginning programmer,
7309
05:27:06,960 --> 05:27:09,600
that doesn't really matter, but that's the reason.
7310
05:27:09,600 --> 05:27:13,280
And so we tend to use tuples in situations
7311
05:27:13,280 --> 05:27:14,600
where we're gonna make a temporary variable
7312
05:27:14,600 --> 05:27:17,160
and then temporarily use it just a little bit
7313
05:27:17,160 --> 05:27:19,080
and then throw it away without really messing with it.
7314
05:27:19,080 --> 05:27:21,280
And we tend to use lists to build things up,
7315
05:27:21,280 --> 05:27:22,720
et cetera, et cetera, et cetera.
7316
05:27:24,640 --> 05:27:27,760
So the other thing that's interesting about tuples,
7317
05:27:27,760 --> 05:27:29,440
and we've actually sort of seen this,
7318
05:27:29,440 --> 05:27:33,200
is that you can put a tuple that includes variables
7319
05:27:33,200 --> 05:27:35,120
on the left side of the assignment.
7320
05:27:35,120 --> 05:27:37,880
And this takes a little getting used to,
7321
05:27:37,880 --> 05:27:40,400
but it's really cool, and no other language
7322
05:27:40,400 --> 05:27:41,560
that I know of does this.
7323
05:27:41,560 --> 05:27:44,520
So if we say x comma y, that's a two tuple.
7324
05:27:44,520 --> 05:27:45,560
Both have two variables.
7325
05:27:45,560 --> 05:27:47,520
You can't put constants on this side.
7326
05:27:47,520 --> 05:27:49,960
You know, it's like saying x equals four,
7327
05:27:49,960 --> 05:27:53,160
y equals Fred, right?
7328
05:27:53,160 --> 05:27:56,000
So what happens is, is you can put a tuple
7329
05:27:56,000 --> 05:27:57,760
on the far side of an assignment statement,
7330
05:27:57,760 --> 05:28:00,960
and the four goes to x, and the Fred goes to y.
7331
05:28:00,960 --> 05:28:01,920
And you say, what's in y?
7332
05:28:01,920 --> 05:28:03,120
Well, y is indeed Fred.
7333
05:28:03,120 --> 05:28:05,460
And so this is like two assignment statements.
7334
05:28:05,460 --> 05:28:07,240
Now, the way I've got this syntax,
7335
05:28:07,240 --> 05:28:09,440
I would probably do two separate statements,
7336
05:28:09,440 --> 05:28:12,040
just not to show off that I know how to do tuples.
7337
05:28:14,200 --> 05:28:15,840
And so you can, here's another one,
7338
05:28:15,840 --> 05:28:17,760
and they just move correspondingly.
7339
05:28:17,760 --> 05:28:20,240
If you don't have two here, and you do have two here,
7340
05:28:21,240 --> 05:28:24,440
well, if you have three here, or two here, and three here,
7341
05:28:24,440 --> 05:28:25,840
and you don't match the number there,
7342
05:28:25,840 --> 05:28:26,780
you get in some trouble.
7343
05:28:26,780 --> 05:28:29,480
Now, if you just say x equals tuple,
7344
05:28:29,480 --> 05:28:31,180
then that is the tuple in the list.
7345
05:28:31,180 --> 05:28:35,440
But this is just a simple straight 99 value going into a.
7346
05:28:35,440 --> 05:28:38,760
So you can put tuples as the left-hand side.
7347
05:28:38,760 --> 05:28:41,680
And you can even do things like return a tuple from functions.
7348
05:28:41,680 --> 05:28:45,140
That's a real nice Python feature that I like a lot.
7349
05:28:45,140 --> 05:28:47,160
Tuples are also related to dictionaries,
7350
05:28:47,160 --> 05:28:49,200
as we've seen in the previous chapter.
7351
05:28:49,200 --> 05:28:50,620
So here we make a little dictionary.
7352
05:28:50,620 --> 05:28:52,720
We make an empty dictionary by constructing
7353
05:28:52,720 --> 05:28:54,240
an empty dictionary, stick it in d.
7354
05:28:54,240 --> 05:28:56,160
So d is sort of like this place
7355
05:28:56,160 --> 05:28:58,080
that can hold key value pairs.
7356
05:28:58,080 --> 05:29:01,080
And we put csev, and there's a two in there,
7357
05:29:01,080 --> 05:29:03,180
and chen1, and there's a four in there.
7358
05:29:03,180 --> 05:29:05,200
So we have this associative mapping
7359
05:29:05,200 --> 05:29:08,960
between csev and two, and chen1 and four, all stuff we know.
7360
05:29:08,960 --> 05:29:11,440
And now we say, hey, we're gonna loop
7361
05:29:11,440 --> 05:29:13,160
through the key value pairs here,
7362
05:29:13,160 --> 05:29:16,600
and we've seen this syntax before, k,v.
7363
05:29:16,600 --> 05:29:17,960
So this is a tuple.
7364
05:29:17,960 --> 05:29:20,120
So you can think of this as each one of these things
7365
05:29:20,120 --> 05:29:22,080
is going to get assigned into this tuple,
7366
05:29:22,080 --> 05:29:23,680
which means the key ends up in,
7367
05:29:23,680 --> 05:29:25,760
and the first one's the key, and the second one's the value.
7368
05:29:25,760 --> 05:29:29,520
I use the variable kv all the time in code that I write,
7369
05:29:29,520 --> 05:29:30,860
just for my own sanity.
7370
05:29:30,860 --> 05:29:33,280
So kv are gonna iterate successively
7371
05:29:33,280 --> 05:29:36,920
through the successive keys and values in them.
7372
05:29:36,920 --> 05:29:38,720
So this is gonna run twice,
7373
05:29:38,720 --> 05:29:41,680
and k is gonna be csev2, and chen1, four.
7374
05:29:41,680 --> 05:29:44,380
The order just happened to stay the same.
7375
05:29:45,280 --> 05:29:49,680
And so if you say, what is in one of these things,
7376
05:29:49,680 --> 05:29:51,280
you can actually take d items,
7377
05:29:51,280 --> 05:29:53,560
the items method within that dictionary,
7378
05:29:53,560 --> 05:29:56,640
and say, hey, give me back, give that to me back,
7379
05:29:56,640 --> 05:29:57,680
and then print tops.
7380
05:29:57,680 --> 05:30:00,280
And this is, it's a special kind of a class,
7381
05:30:00,280 --> 05:30:03,840
but really ultimately it is a list of tuples.
7382
05:30:03,840 --> 05:30:07,040
This is two, this is the zero, and this is the two,
7383
05:30:07,040 --> 05:30:08,960
the one, the first and the second,
7384
05:30:08,960 --> 05:30:12,000
and then within each thing you get, you have a two tuple.
7385
05:30:12,000 --> 05:30:16,600
And so in a sense, this k and v are iterating
7386
05:30:16,600 --> 05:30:20,040
through those things when we're putting d items here
7387
05:30:20,040 --> 05:30:21,440
and d items there.
7388
05:30:23,200 --> 05:30:25,560
One nice thing about tuples is that they're comparable.
7389
05:30:25,560 --> 05:30:26,960
They're comparable in the same way
7390
05:30:26,960 --> 05:30:28,080
that strings are comparable,
7391
05:30:28,080 --> 05:30:30,240
meaning that they're compared from left to right
7392
05:30:30,240 --> 05:30:33,880
with the leftmost or zero tuple being the most significant.
7393
05:30:33,880 --> 05:30:36,440
And it doesn't compare any further than it has to
7394
05:30:36,440 --> 05:30:39,300
if it's asking less than.
7395
05:30:39,300 --> 05:30:41,560
So if it's looking at, say, this first tuple,
7396
05:30:41,560 --> 05:30:43,880
it starts at the left and says, okay,
7397
05:30:43,880 --> 05:30:46,160
ask the question, tell me true or false.
7398
05:30:46,160 --> 05:30:47,720
Is zero less than five?
7399
05:30:47,720 --> 05:30:48,920
The answer is true.
7400
05:30:48,920 --> 05:30:52,000
And so the answer to this overall expression is true,
7401
05:30:52,000 --> 05:30:54,640
and it doesn't even compare those two numbers,
7402
05:30:54,640 --> 05:30:57,920
those second and third number, they don't compare them.
7403
05:30:57,920 --> 05:31:01,780
If, on the other hand, we're asking is this less than that,
7404
05:31:01,780 --> 05:31:03,360
it only looks at the first one
7405
05:31:03,360 --> 05:31:05,240
and asks if it can answer the question.
7406
05:31:05,240 --> 05:31:07,280
The answer is, well, they're both zero,
7407
05:31:07,280 --> 05:31:08,820
and so I can't answer the question,
7408
05:31:08,820 --> 05:31:11,160
so I have to go to the second one, second pair,
7409
05:31:11,160 --> 05:31:14,520
and one is less than three, and so that means this is true.
7410
05:31:14,520 --> 05:31:16,720
And it does not check this.
7411
05:31:16,720 --> 05:31:19,720
Even though 20 million is bigger than four,
7412
05:31:19,720 --> 05:31:23,600
it doesn't matter because these are the numbers
7413
05:31:23,600 --> 05:31:26,440
that cause the true to happen.
7414
05:31:26,440 --> 05:31:31,440
And the same is true if you do this with strings.
7415
05:31:31,640 --> 05:31:33,360
Again, we start the first one.
7416
05:31:33,360 --> 05:31:36,160
So Jones, Sally, well, that's the same,
7417
05:31:36,160 --> 05:31:37,320
so we don't know the answer yet,
7418
05:31:37,320 --> 05:31:40,960
and so Sally, Sam, well, okay, S, S,
7419
05:31:40,960 --> 05:31:43,960
well, they're the same, A, A, they're the same,
7420
05:31:43,960 --> 05:31:46,820
O, L, and M.
7421
05:31:46,820 --> 05:31:50,600
L is less than M, so the actual letter
7422
05:31:50,600 --> 05:31:52,920
that makes the difference here is the L and the M
7423
05:31:52,920 --> 05:31:55,000
and leads to us being true.
7424
05:31:55,000 --> 05:31:56,840
And so it goes left to right,
7425
05:31:56,840 --> 05:31:58,280
but then even when it's doing strings,
7426
05:31:58,280 --> 05:31:59,120
it's going left to right.
7427
05:31:59,120 --> 05:32:02,360
That's just how string comparison works.
7428
05:32:02,360 --> 05:32:07,360
And if we say, is Jones Sally greater than Adam, Sam?
7429
05:32:09,560 --> 05:32:10,800
Well, we checked the first one,
7430
05:32:10,800 --> 05:32:12,480
and we checked the J and the A.
7431
05:32:12,480 --> 05:32:14,520
Well, J is greater than A,
7432
05:32:14,520 --> 05:32:16,000
and so we don't have to look at anything else.
7433
05:32:16,000 --> 05:32:18,360
We don't have to look at any more of these characters.
7434
05:32:18,360 --> 05:32:20,440
We don't have to look at the second thing in the tuple.
7435
05:32:20,440 --> 05:32:22,920
We have to look at that is enough to be true.
7436
05:32:22,920 --> 05:32:26,360
So it only scans until it has a definitive answer.
7437
05:32:26,360 --> 05:32:28,200
It doesn't scan any further.
7438
05:32:29,640 --> 05:32:30,760
So now what we're going to do
7439
05:32:30,760 --> 05:32:32,920
is use this comparable capability
7440
05:32:32,920 --> 05:32:34,320
to sort these lists of tuples
7441
05:32:34,320 --> 05:32:36,200
and then bring this all back
7442
05:32:36,200 --> 05:32:38,000
and connect it more to dictionaries.
7443
05:32:41,920 --> 05:32:43,680
So now we can take advantage of the notion
7444
05:32:43,680 --> 05:32:46,280
of comparing tuples and use sorting.
7445
05:32:46,280 --> 05:32:49,480
And so what we're going to produce is a list of tuples,
7446
05:32:49,480 --> 05:32:51,800
and then we're going to sort them, right?
7447
05:32:51,800 --> 05:32:54,280
And so we can get a list of tuples from a dictionary
7448
05:32:54,280 --> 05:32:56,200
and then we can sort that list of tuples,
7449
05:32:56,200 --> 05:32:58,560
and then we can end up sorting dictionary items
7450
05:32:58,560 --> 05:33:00,160
by taking this two-step process.
7451
05:33:00,160 --> 05:33:02,600
Convert dictionary to a list, sort the list,
7452
05:33:02,600 --> 05:33:06,360
and then we can have a sorted dictionary values, okay?
7453
05:33:06,360 --> 05:33:08,360
And so we'll do this a couple of different times.
7454
05:33:08,360 --> 05:33:10,440
So if we take a look at this code right here,
7455
05:33:10,440 --> 05:33:12,000
we have our happy little dictionary,
7456
05:33:12,000 --> 05:33:15,960
A, B, C, A maps to 10, B maps to one, C maps to 20.
7457
05:33:15,960 --> 05:33:17,120
Like what are we going to get here?
7458
05:33:17,120 --> 05:33:19,980
Well, it comes out, the mapping is the right way,
7459
05:33:19,980 --> 05:33:21,720
but the order is whatever.
7460
05:33:21,720 --> 05:33:24,200
And now we say this function called sorted,
7461
05:33:24,200 --> 05:33:26,440
which takes inside a sequence
7462
05:33:26,440 --> 05:33:29,320
and then returns us a sorted version of that,
7463
05:33:29,320 --> 05:33:30,760
a list that's sorted.
7464
05:33:30,760 --> 05:33:32,640
And so it says sort D items.
7465
05:33:32,640 --> 05:33:35,120
So it's basically going to take this list
7466
05:33:35,120 --> 05:33:37,320
and compare the A's and the C's and the B's,
7467
05:33:37,320 --> 05:33:40,000
and because it's a dictionary and all the keys are unique,
7468
05:33:40,000 --> 05:33:41,120
there's never going to be equality.
7469
05:33:41,120 --> 05:33:43,320
So it really is going to just sort this by keys
7470
05:33:43,320 --> 05:33:45,360
and never get to looking at the values.
7471
05:33:45,360 --> 05:33:48,720
You could construct a list that had duplicate,
7472
05:33:48,720 --> 05:33:50,160
you could make a list of tuples
7473
05:33:50,160 --> 05:33:53,280
that had duplicates in the first like we did before,
7474
05:33:53,280 --> 05:33:56,120
but given that this coming from a dictionary,
7475
05:33:56,120 --> 05:33:58,840
the first thing is going to always be unique and distinct.
7476
05:33:58,840 --> 05:34:00,760
And so if we say sorted D of items
7477
05:34:00,760 --> 05:34:04,080
that we're passing this stuff into sorted,
7478
05:34:04,080 --> 05:34:07,240
sorted is going to go around, move stuff around,
7479
05:34:07,240 --> 05:34:10,600
and then give us back a sorted version,
7480
05:34:10,600 --> 05:34:13,240
sorted in ascending order based on key
7481
05:34:13,240 --> 05:34:14,920
without looking at the value.
7482
05:34:14,920 --> 05:34:19,600
And so that's a way to see dictionaries
7483
05:34:19,600 --> 05:34:22,720
sorted by key is just say sorted of D sub items.
7484
05:34:22,720 --> 05:34:25,560
And sorted is a function, and so it just picks stuff.
7485
05:34:25,560 --> 05:34:27,360
And so this is the kind of loop
7486
05:34:27,360 --> 05:34:29,960
that you're going to write to do that.
7487
05:34:29,960 --> 05:34:32,560
You know, we did this before, we took sorted,
7488
05:34:32,560 --> 05:34:34,640
and we got these sorted by keys.
7489
05:34:34,640 --> 05:34:38,040
And so you can just make this nice and simple for key value.
7490
05:34:38,040 --> 05:34:40,280
By the way, you can eliminate the parentheses here,
7491
05:34:40,280 --> 05:34:42,280
and I think it's prettier if you eliminate the parentheses,
7492
05:34:42,280 --> 05:34:43,720
but you could put parentheses.
7493
05:34:43,720 --> 05:34:46,320
This is still a tuple without the parentheses
7494
05:34:46,320 --> 05:34:49,420
for key and value in sorted,
7495
05:34:49,420 --> 05:34:50,880
so that says go through D items,
7496
05:34:50,880 --> 05:34:53,240
but before I go through them, please sort them.
7497
05:34:53,240 --> 05:34:55,800
So that means K is going to go through A, B, and C
7498
05:34:55,800 --> 05:34:58,800
deterministically every single time it's going to go.
7499
05:34:58,800 --> 05:35:00,360
And of course, value is going to go
7500
05:35:00,360 --> 05:35:01,600
through the corresponding value,
7501
05:35:01,600 --> 05:35:06,600
so now we can print this out nicely sorted by key.
7502
05:35:06,640 --> 05:35:11,640
And that's a real nice succinct little way to say that.
7503
05:35:12,000 --> 05:35:15,000
I mean, again, these are one of the kind of things
7504
05:35:15,000 --> 05:35:17,000
that people really like about Python
7505
05:35:17,000 --> 05:35:19,040
is that you can do pretty powerful things
7506
05:35:19,040 --> 05:35:21,280
with easy to under, I mean, you know,
7507
05:35:21,280 --> 05:35:22,480
you might have seen this for the first time,
7508
05:35:22,480 --> 05:35:24,360
but ultimately you look at that, eventually you'll be like,
7509
05:35:24,360 --> 05:35:26,560
oh yeah, I see exactly what that's doing.
7510
05:35:26,560 --> 05:35:28,420
Easy, not hard at all.
7511
05:35:29,560 --> 05:35:32,760
So, but let's say we're looking for the most common word,
7512
05:35:32,760 --> 05:35:35,400
which we have been for weeks and weeks and weeks now.
7513
05:35:36,640 --> 05:35:39,760
And so we want to sort by values, not key.
7514
05:35:39,760 --> 05:35:42,880
So this is an example of where we're going to construct
7515
05:35:42,880 --> 05:35:45,920
a data structure, we're going to imagine a data structure,
7516
05:35:45,920 --> 05:35:47,240
and then we're going to write code
7517
05:35:47,240 --> 05:35:48,320
to construct the data structure,
7518
05:35:48,320 --> 05:35:50,080
and then that's going to make our problem easy.
7519
05:35:50,080 --> 05:35:52,000
So this is an example of using
7520
05:35:52,000 --> 05:35:54,920
cleverly constructed data structures to do this.
7521
05:35:54,920 --> 05:35:57,500
And the data structure that we're going to create
7522
05:35:57,500 --> 05:36:00,880
is a list of tuples where the value is first
7523
05:36:00,880 --> 05:36:02,080
and the key is second.
7524
05:36:02,080 --> 05:36:05,240
So you can just with items get key value.
7525
05:36:05,240 --> 05:36:06,640
I want value key.
7526
05:36:06,640 --> 05:36:09,060
So let's take a look at this code.
7527
05:36:09,060 --> 05:36:10,920
Take your time and get it right.
7528
05:36:10,920 --> 05:36:13,180
So KV goes in C items.
7529
05:36:13,180 --> 05:36:15,800
Well, that is unsorted and going to have,
7530
05:36:15,800 --> 05:36:18,640
go through whatever A, B, and C, in whatever order.
7531
05:36:18,640 --> 05:36:20,280
And we're going to make a new list.
7532
05:36:20,280 --> 05:36:23,480
So this is a data structure that we're creating temporarily.
7533
05:36:23,480 --> 05:36:26,040
And what we're going to do is this is a list.
7534
05:36:28,920 --> 05:36:32,920
And we are going to append to that list a tuple.
7535
05:36:32,920 --> 05:36:35,980
So this is going to be a list of tuples.
7536
05:36:37,440 --> 05:36:41,920
Except we're not going to append them in key value order.
7537
05:36:41,920 --> 05:36:44,000
We're going to flip them and append the first part
7538
05:36:44,000 --> 05:36:45,400
of the tuple is going to be the value
7539
05:36:45,400 --> 05:36:47,680
and the second part is going to be the key.
7540
05:36:47,680 --> 05:36:49,160
So we end up with this.
7541
05:36:49,160 --> 05:36:50,800
This is sort of our temporary data structure
7542
05:36:50,800 --> 05:36:54,760
that we have constructed to make our job really easy.
7543
05:36:54,760 --> 05:36:59,720
So this ends up being 10A, 22C, 1B.
7544
05:36:59,720 --> 05:37:00,880
Now we just kind of flipped them.
7545
05:37:00,880 --> 05:37:03,960
We took this order and then we flipped them around.
7546
05:37:03,960 --> 05:37:05,480
And so now we have this nice little list
7547
05:37:05,480 --> 05:37:07,160
sitting in memory in a variable.
7548
05:37:07,160 --> 05:37:08,720
And that's really simple.
7549
05:37:08,720 --> 05:37:12,200
We can say, oh, look, we can use sorted.
7550
05:37:12,200 --> 05:37:14,900
And we can sort by now the values
7551
05:37:14,900 --> 05:37:16,520
because they're the first thing.
7552
05:37:16,520 --> 05:37:19,380
The sorted doesn't know how we produce this list.
7553
05:37:19,380 --> 05:37:21,780
It just looks at that and says, oh, that's a list of tuples.
7554
05:37:21,780 --> 05:37:24,520
I'm going to always sort by looking at the first item
7555
05:37:24,520 --> 05:37:25,660
in any tuple.
7556
05:37:25,660 --> 05:37:27,440
And I'm going to add reverse equals true
7557
05:37:27,440 --> 05:37:28,760
so I get a descending sort.
7558
05:37:28,760 --> 05:37:33,760
So I see that the value that is highest ends up being first.
7559
05:37:34,440 --> 05:37:37,280
And so that changes this and I'm just sort it
7560
05:37:37,280 --> 05:37:39,720
and then reassign it back into temp.
7561
05:37:39,720 --> 05:37:40,980
And I'll print this out.
7562
05:37:40,980 --> 05:37:44,040
And so now you see it's sorted in descending order of key.
7563
05:37:44,040 --> 05:37:48,960
So it's value key, value key, value key,
7564
05:37:48,960 --> 05:37:51,800
but it's sorted in descending order, okay?
7565
05:37:51,800 --> 05:37:55,200
And so that's an example sort of of just like,
7566
05:37:55,200 --> 05:37:56,920
you know, if I just made a data destruction,
7567
05:37:56,920 --> 05:37:58,080
I flipped those things around,
7568
05:37:58,080 --> 05:38:00,800
I could use sorted to sort these things.
7569
05:38:00,800 --> 05:38:02,120
There's many other ways you could do it,
7570
05:38:02,120 --> 05:38:04,520
but there's sort of like the more elegant way of doing it.
7571
05:38:04,520 --> 05:38:08,160
And the clever bit here is like make a new list
7572
05:38:08,160 --> 05:38:10,360
and make it be a little bit different, okay?
7573
05:38:10,360 --> 05:38:13,960
So here we're going to print out the top 10
7574
05:38:13,960 --> 05:38:16,280
most common words in a file.
7575
05:38:16,280 --> 05:38:18,120
And most of this code is review.
7576
05:38:18,120 --> 05:38:22,720
So if we take a look at it, we're gonna open a file.
7577
05:38:22,720 --> 05:38:25,360
We're gonna start a dictionary for our counting.
7578
05:38:25,360 --> 05:38:27,560
We're going to, you know,
7579
05:38:27,560 --> 05:38:32,560
there's gonna be words and lines, right?
7580
05:38:32,680 --> 05:38:34,240
And so we're gonna have a for loop.
7581
05:38:34,240 --> 05:38:36,920
This for loop is gonna go through each line.
7582
05:38:36,920 --> 05:38:38,240
And then of course we're gonna split them
7583
05:38:38,240 --> 05:38:40,040
which is busting them into pieces.
7584
05:38:40,040 --> 05:38:42,120
And then we have a for loop within that.
7585
05:38:42,120 --> 05:38:45,000
And this for loop is gonna go through each word.
7586
05:38:45,000 --> 05:38:48,160
And so that means that by nesting these loops,
7587
05:38:48,160 --> 05:38:49,920
we're going through each line
7588
05:38:49,920 --> 05:38:51,440
and then within the line we're going through a word.
7589
05:38:51,440 --> 05:38:53,440
Then we go to the next line and go through the words.
7590
05:38:53,440 --> 05:38:55,280
And eventually this line of code,
7591
05:38:55,280 --> 05:38:58,560
count sub word equals counts dot get word zero plus one,
7592
05:38:58,560 --> 05:39:03,080
are idiom for making a histogram, right?
7593
05:39:03,080 --> 05:39:05,040
This line right here is an idiom.
7594
05:39:05,040 --> 05:39:06,920
If you don't know already what that is,
7595
05:39:06,920 --> 05:39:09,360
go back to the previous dictionary lecture
7596
05:39:09,360 --> 05:39:11,040
and understand it, understand it,
7597
05:39:11,040 --> 05:39:14,000
because you're just gonna use it over and over again.
7598
05:39:14,000 --> 05:39:15,040
So now at this point,
7599
05:39:15,040 --> 05:39:17,360
and I always like drawing horizontal lines in code
7600
05:39:17,360 --> 05:39:18,200
when we write it.
7601
05:39:18,200 --> 05:39:20,280
At this point, coming through at this point,
7602
05:39:20,280 --> 05:39:21,480
counts is right.
7603
05:39:21,480 --> 05:39:22,960
Counts is the histogram.
7604
05:39:22,960 --> 05:39:24,000
It's not sorted.
7605
05:39:24,000 --> 05:39:25,440
So now we wanna sort it.
7606
05:39:25,440 --> 05:39:27,840
So we're going to make a new list.
7607
05:39:27,840 --> 05:39:29,840
We're gonna loop through key value.
7608
05:39:29,840 --> 05:39:31,280
And then we're gonna make a tuple.
7609
05:39:31,280 --> 05:39:32,920
I'm making this be two lines
7610
05:39:32,920 --> 05:39:34,440
to make it a little easier, value key.
7611
05:39:34,440 --> 05:39:35,440
So I'm flipping it, right?
7612
05:39:35,440 --> 05:39:37,800
So I'm flipping the order of these things.
7613
05:39:37,800 --> 05:39:38,800
That's making a tuple.
7614
05:39:38,800 --> 05:39:42,280
And then I'm appending that tuple to the list, okay?
7615
05:39:42,280 --> 05:39:44,600
So at the end of this,
7616
05:39:44,600 --> 05:39:46,640
we have a list of tuples
7617
05:39:48,720 --> 05:39:53,720
in value key order, vk, vk, right?
7618
05:39:54,400 --> 05:39:56,120
So at this point, coming through here,
7619
05:39:56,120 --> 05:39:58,480
I've got in my LST variable,
7620
05:39:58,480 --> 05:40:01,080
I've got this really useful bit of code,
7621
05:40:01,080 --> 05:40:03,080
or useful bit of data that I produced.
7622
05:40:03,080 --> 05:40:05,600
And then I'm like, oh, now it's ready to be sorted.
7623
05:40:05,600 --> 05:40:07,000
Poof, sort.
7624
05:40:07,000 --> 05:40:10,160
So take list, sort it back and sort it in descending order,
7625
05:40:10,160 --> 05:40:11,840
and then stick that back in list.
7626
05:40:11,840 --> 05:40:13,440
Now we wanna print it out,
7627
05:40:13,440 --> 05:40:15,040
but we don't wanna print it out.
7628
05:40:16,960 --> 05:40:19,320
So we got a nice sorted list coming down here.
7629
05:40:19,320 --> 05:40:21,880
We don't wanna print it out in value key,
7630
05:40:21,880 --> 05:40:22,760
because that's what it is.
7631
05:40:22,760 --> 05:40:26,800
It's in parenthesis v, k order, but it's in sorted.
7632
05:40:26,800 --> 05:40:31,800
And we know that the highest value is here on down.
7633
05:40:31,800 --> 05:40:33,840
And so we're gonna say, we're gonna run through,
7634
05:40:33,840 --> 05:40:36,100
and now we're gonna go through this new list,
7635
05:40:36,100 --> 05:40:38,800
only the first 10, start at the beginning up to,
7636
05:40:38,800 --> 05:40:42,260
but not including number 10, which is the first 10,
7637
05:40:42,260 --> 05:40:44,760
for value key in, and so value is good.
7638
05:40:44,760 --> 05:40:46,240
So this is the iteration variable
7639
05:40:46,240 --> 05:40:49,280
that's gonna go through each of these things, on and down,
7640
05:40:49,280 --> 05:40:51,100
and then we're just gonna print it out flipping it.
7641
05:40:51,100 --> 05:40:54,980
So we reflip it, flip, flip, we print it out key value,
7642
05:40:54,980 --> 05:40:56,820
and it's going to work.
7643
05:40:58,540 --> 05:41:03,280
Okay, so that is one way of doing this.
7644
05:41:03,280 --> 05:41:04,420
And this slide right here,
7645
05:41:04,420 --> 05:41:06,920
you absolutely do not need to figure out,
7646
05:41:06,920 --> 05:41:10,040
but some of you will look at this slide and you're like,
7647
05:41:10,040 --> 05:41:11,560
why didn't you show us that in the beginning?
7648
05:41:11,560 --> 05:41:14,000
And others of you will be like, no, no, no, no, no,
7649
05:41:14,000 --> 05:41:16,000
keep telling me this stuff here.
7650
05:41:16,000 --> 05:41:19,300
So I don't know exactly the term for this,
7651
05:41:19,300 --> 05:41:22,080
but this is a very procedural.
7652
05:41:22,080 --> 05:41:24,820
This is a classic algorithms and data structures approach
7653
05:41:24,820 --> 05:41:26,160
to solving this problem.
7654
05:41:27,600 --> 05:41:31,120
This next thing uses what are called lambdas,
7655
05:41:31,120 --> 05:41:33,320
and they kind of create what's called,
7656
05:41:33,320 --> 05:41:34,720
what I call a closed form,
7657
05:41:34,720 --> 05:41:36,280
where you kind of do it in all one statement,
7658
05:41:36,280 --> 05:41:38,360
and there's all this implicit stuff going on.
7659
05:41:38,360 --> 05:41:39,760
So if you don't get this right away,
7660
05:41:39,760 --> 05:41:41,820
don't worry too much about that.
7661
05:41:41,820 --> 05:41:46,820
But roughly, this single line does everything
7662
05:41:47,600 --> 05:41:50,000
that bottom half of that program does.
7663
05:41:50,000 --> 05:41:52,620
I mean, if you go back, if we go back to here,
7664
05:41:52,620 --> 05:41:55,760
it's pretty much this line does everything,
7665
05:41:55,760 --> 05:41:58,240
does that in one line, okay?
7666
05:41:58,240 --> 05:41:59,800
It doesn't create the counts,
7667
05:41:59,800 --> 05:42:01,680
and it doesn't print out the top 10,
7668
05:42:01,680 --> 05:42:04,680
but it does everything in that middle bit.
7669
05:42:04,680 --> 05:42:06,520
So let's take a look at this.
7670
05:42:06,520 --> 05:42:08,600
So we all are gonna collapse this down.
7671
05:42:08,600 --> 05:42:12,080
So we have a print, that print sees the end of the print.
7672
05:42:12,080 --> 05:42:13,440
And then we have sorted,
7673
05:42:13,440 --> 05:42:17,960
and remember that sorted takes as input a list.
7674
05:42:17,960 --> 05:42:20,080
And so that's not too bad, and returns us a list.
7675
05:42:20,080 --> 05:42:22,280
And so we'll print the return from sorted.
7676
05:42:24,120 --> 05:42:26,160
And then this is the funny part,
7677
05:42:26,160 --> 05:42:27,720
the fun part, funny part.
7678
05:42:27,720 --> 05:42:32,100
This is called list comprehension.
7679
05:42:32,100 --> 05:42:33,720
And we have square brackets,
7680
05:42:33,720 --> 05:42:36,640
and we say to Python, this is a list.
7681
05:42:36,640 --> 05:42:39,120
But instead of listing the things,
7682
05:42:39,120 --> 05:42:41,920
or having a constant one comma two comma three,
7683
05:42:41,920 --> 05:42:43,640
or a pen to pen to pen,
7684
05:42:43,640 --> 05:42:45,620
we are going to create an expression
7685
05:42:45,620 --> 05:42:48,760
that will act as a generator for all the elements.
7686
05:42:48,760 --> 05:42:50,640
And so this basically says,
7687
05:42:50,640 --> 05:42:53,960
this is a list of two tuples, V and K,
7688
05:42:53,960 --> 05:42:56,400
and then this is sort of implied.
7689
05:42:56,400 --> 05:42:59,480
For all KV in CDOT items.
7690
05:42:59,480 --> 05:43:01,960
And so this is like a for loop,
7691
05:43:01,960 --> 05:43:03,760
that is sort of driving this,
7692
05:43:03,760 --> 05:43:07,000
think of this as like stamp, stamp, stamp, stamp, stamp.
7693
05:43:07,000 --> 05:43:09,320
However many times it has to make a stamp.
7694
05:43:09,320 --> 05:43:11,480
And so that's producing a list.
7695
05:43:11,480 --> 05:43:13,380
Ch-ch-ch-ch-ch, right?
7696
05:43:13,380 --> 05:43:15,500
It just manufactures this list.
7697
05:43:15,500 --> 05:43:18,460
And then that list is sort of manufactured in the moment.
7698
05:43:18,460 --> 05:43:20,520
There's no stock, it's not put in a variable.
7699
05:43:20,520 --> 05:43:24,720
Python makes that list according to the stamping pattern
7700
05:43:24,720 --> 05:43:26,720
that you've told it to stamp out this list.
7701
05:43:26,720 --> 05:43:28,880
And then it passes that stamped out list
7702
05:43:28,880 --> 05:43:30,520
without even storing it in a variable,
7703
05:43:30,520 --> 05:43:33,280
into sorted, sorted moves the list around,
7704
05:43:33,280 --> 05:43:35,120
because it is just a list of tuples,
7705
05:43:35,120 --> 05:43:37,040
and then gives us back the sorted list.
7706
05:43:37,040 --> 05:43:40,560
And so I didn't put reverse equals true on here,
7707
05:43:40,560 --> 05:43:43,920
but you see that this is sorted in ascending order now
7708
05:43:43,920 --> 05:43:44,960
by key.
7709
05:43:44,960 --> 05:43:48,360
And I did that all in one little statement.
7710
05:43:49,360 --> 05:43:51,360
So look at this,
7711
05:43:51,360 --> 05:43:54,280
this is also one of the beautiful things about Python
7712
05:43:54,280 --> 05:43:55,640
that you can build these things,
7713
05:43:55,640 --> 05:43:57,680
and you can build more complex versions of this,
7714
05:43:57,680 --> 05:44:00,000
and there's a lot of real elegant things
7715
05:44:00,000 --> 05:44:02,920
that you can do in Python that are really succinct.
7716
05:44:02,920 --> 05:44:06,000
You should be careful, because in the beginning
7717
05:44:06,000 --> 05:44:07,800
I think this is easier to understand,
7718
05:44:07,800 --> 05:44:09,200
even though after a while you're like,
7719
05:44:09,200 --> 05:44:12,160
wait a sec, why am I putting all these extra lines in?
7720
05:44:12,160 --> 05:44:14,360
Because this is not so hard to understand,
7721
05:44:14,360 --> 05:44:17,680
but at some point you will want to master
7722
05:44:17,680 --> 05:44:21,200
this more powerful and more succinct version of Python
7723
05:44:21,200 --> 05:44:24,160
that expresses it in terms of the data you wanna see
7724
05:44:24,160 --> 05:44:26,000
rather than the steps you wanna take.
7725
05:44:26,960 --> 05:44:28,960
So this sort of finishes up tuples.
7726
05:44:28,960 --> 05:44:30,840
We've done a bunch of stuff.
7727
05:44:30,840 --> 05:44:33,440
I mean, really, they're simple and elegant.
7728
05:44:33,440 --> 05:44:36,280
Tuples, lists, and dictionaries are all related.
7729
05:44:36,280 --> 05:44:38,040
They're really three different,
7730
05:44:38,040 --> 05:44:40,360
kind of three foundational data structures,
7731
05:44:40,360 --> 05:44:42,880
three foundational collections of Python.
7732
05:44:42,880 --> 05:44:45,960
And we combine those in a lot of different ways.
7733
05:44:49,680 --> 05:44:51,200
And now in this little bit of lesson,
7734
05:44:51,200 --> 05:44:53,600
we are going to talk about some tuples,
7735
05:44:53,600 --> 05:44:57,040
and we're going to create a list of the most common words
7736
05:44:57,040 --> 05:45:00,600
and find out how to sort a dictionary by the values
7737
05:45:00,600 --> 05:45:02,400
instead of by the key.
7738
05:45:02,400 --> 05:45:04,920
We're gonna use the clown.txt file
7739
05:45:04,920 --> 05:45:07,080
and the intro.txt file.
7740
05:45:07,080 --> 05:45:10,600
And I'm gonna start with the code from exercise nine
7741
05:45:10,600 --> 05:45:13,480
that I just did from chapter nine.
7742
05:45:13,480 --> 05:45:15,040
It's not exactly one of the exercises,
7743
05:45:15,040 --> 05:45:16,600
but it's very similar to them.
7744
05:45:16,600 --> 05:45:18,400
And I'm going to make a copy,
7745
05:45:18,400 --> 05:45:19,840
and I'm gonna keep it in the same folder.
7746
05:45:19,840 --> 05:45:21,920
I'm gonna keep it in the ex09 folder
7747
05:45:21,920 --> 05:45:26,520
and just call it ex10 because this code
7748
05:45:26,520 --> 05:45:30,240
is going to do much of the same stuff,
7749
05:45:30,240 --> 05:45:31,520
and it's gonna read these same files.
7750
05:45:31,520 --> 05:45:33,720
And so I've got myself exercise 10.
7751
05:45:33,720 --> 05:45:34,920
Exercise nine is still here.
7752
05:45:34,920 --> 05:45:37,960
Exercise 10 is now what I'm editing, exercise 10.
7753
05:45:37,960 --> 05:45:39,760
But I'm in the exercise nine folder.
7754
05:45:40,880 --> 05:45:45,800
So in exercise nine, we look for the most common word,
7755
05:45:45,800 --> 05:45:47,800
but we wanna find the five most common words,
7756
05:45:47,800 --> 05:45:49,840
which is gonna require us to sort.
7757
05:45:49,840 --> 05:45:51,360
So I'm gonna get rid of that code
7758
05:45:51,360 --> 05:45:52,920
because it's not really how we're gonna do it.
7759
05:45:52,920 --> 05:45:56,400
There we manually loop through it and found the maximum.
7760
05:45:56,400 --> 05:45:58,880
And so I'm gonna just run this.
7761
05:45:58,880 --> 05:46:03,880
CD, desktop, Python for everybody, ex09.
7762
05:46:05,320 --> 05:46:06,960
Now if I do an ls, you see that I've got
7763
05:46:06,960 --> 05:46:09,760
ex09.py intro.txt.
7764
05:46:09,760 --> 05:46:14,760
So I'll run python3 ex10.py and run the clown data.
7765
05:46:17,160 --> 05:46:19,400
And we see that we see the dictionary
7766
05:46:19,400 --> 05:46:22,120
is properly making it in this code right here.
7767
05:46:22,120 --> 05:46:23,040
That doesn't change.
7768
05:46:23,040 --> 05:46:25,480
It reads the file, reads all the lines,
7769
05:46:25,480 --> 05:46:27,040
goes through and splits it into words,
7770
05:46:27,040 --> 05:46:28,160
and then goes through the words
7771
05:46:28,160 --> 05:46:31,720
and does the idiom of using dictionary get
7772
05:46:31,720 --> 05:46:33,360
to maintain the counters,
7773
05:46:33,360 --> 05:46:34,680
and we print it out at the very end.
7774
05:46:34,680 --> 05:46:38,240
So the new code we're going to write is down here.
7775
05:46:40,240 --> 05:46:42,440
So let's first do a few things.
7776
05:46:43,920 --> 05:46:48,920
If I can say x is equal to the dictionary,
7777
05:46:48,920 --> 05:46:53,920
dot items, and this gives us basically a list, print x.
7778
05:46:55,440 --> 05:46:58,480
This gives us a list of the key value pairs.
7779
05:46:58,480 --> 05:46:59,640
This prints out the dictionary,
7780
05:46:59,640 --> 05:47:02,120
but if we do it this way and use items,
7781
05:47:02,120 --> 05:47:04,840
it gives us the key value pairs.
7782
05:47:04,840 --> 05:47:07,120
Okay, and so that's what we got.
7783
05:47:07,120 --> 05:47:07,960
Key value pairs.
7784
05:47:07,960 --> 05:47:11,680
Now we can sort this based on the value
7785
05:47:11,680 --> 05:47:13,480
because tuples can be compared.
7786
05:47:13,480 --> 05:47:15,680
This can be compared with this.
7787
05:47:15,680 --> 05:47:20,120
And because d is lower than r, then this one is lower.
7788
05:47:20,120 --> 05:47:23,920
This whole, this ran tuple comes after the down tuple.
7789
05:47:23,920 --> 05:47:25,840
So we can sort this whole thing.
7790
05:47:25,840 --> 05:47:29,840
And I'll do this by just putting the word sorted here
7791
05:47:29,840 --> 05:47:31,760
and say, give me a sorted version of that.
7792
05:47:31,760 --> 05:47:35,000
Now it's going to do it based on the order of the tuples.
7793
05:47:35,000 --> 05:47:37,960
This is going to be more, higher precedence than this.
7794
05:47:37,960 --> 05:47:39,800
So if I print it this way,
7795
05:47:42,080 --> 05:47:45,600
run it again, you'll see that it's sorted.
7796
05:47:45,600 --> 05:47:48,800
And now is after and car,
7797
05:47:48,800 --> 05:47:51,160
it's in alphabetical order by key.
7798
05:47:51,160 --> 05:47:55,040
And so we could actually print the first five
7799
05:47:55,040 --> 05:47:56,880
up to, but not including five
7800
05:47:56,880 --> 05:48:00,960
by adding a list on the slice, a list slice here.
7801
05:48:00,960 --> 05:48:05,680
And so that will show you only the first five, right?
7802
05:48:05,680 --> 05:48:07,160
Except that that's not what we're trying to do.
7803
05:48:07,160 --> 05:48:10,880
We really want to sort by this, okay?
7804
05:48:10,880 --> 05:48:15,880
So we have this mechanism that can take a list
7805
05:48:16,800 --> 05:48:19,320
and sort it based on the tuple values.
7806
05:48:19,320 --> 05:48:23,000
If we could create a list where it was one comma after
7807
05:48:23,000 --> 05:48:26,640
instead of after comma one and make it exact same thing,
7808
05:48:26,640 --> 05:48:29,760
then we could actually then sort it and it would be fine.
7809
05:48:29,760 --> 05:48:30,760
Okay?
7810
05:48:30,760 --> 05:48:32,920
So let me show you a couple of ways,
7811
05:48:32,920 --> 05:48:35,160
at least one way to do that, okay?
7812
05:48:37,560 --> 05:48:38,720
Get rid of this.
7813
05:48:38,720 --> 05:48:43,000
We're gonna hand construct a list
7814
05:48:43,000 --> 05:48:46,760
and just call it temp equals, give me a new list.
7815
05:48:47,840 --> 05:48:49,640
Temp equals new list.
7816
05:48:49,640 --> 05:48:54,640
And then four K comma V in the dictionary.items.
7817
05:48:58,880 --> 05:49:02,480
And I'll just start by printing K comma V.
7818
05:49:02,480 --> 05:49:05,880
So we see, and this is where it's really nice
7819
05:49:05,880 --> 05:49:08,280
to do these with the clown code first
7820
05:49:08,280 --> 05:49:11,560
and then only do your test on the bigger file later.
7821
05:49:11,560 --> 05:49:13,680
And so it's pretty much the same thing
7822
05:49:13,680 --> 05:49:16,520
we are going through in key value order,
7823
05:49:16,520 --> 05:49:19,520
which is dictionary order, which is not sorted at all.
7824
05:49:19,520 --> 05:49:20,440
Okay?
7825
05:49:20,440 --> 05:49:22,720
Now, instead of printing this out,
7826
05:49:22,720 --> 05:49:26,560
we are going to, let me do this in a couple of steps.
7827
05:49:26,560 --> 05:49:31,560
Make a new tuple and I'll just call it newt
7828
05:49:32,760 --> 05:49:36,600
equals parenthesis V comma K.
7829
05:49:36,600 --> 05:49:40,120
Okay, so this is, I'm saying make a new tuple.
7830
05:49:40,120 --> 05:49:42,960
This is like a new tuple with two items in it
7831
05:49:42,960 --> 05:49:47,120
and I'm gonna make the value and the key.
7832
05:49:47,120 --> 05:49:52,120
Okay, so then I'm going to say temp.append newt, newtuple.
7833
05:49:56,080 --> 05:49:59,760
So I'm gonna end up with a list of tuples.
7834
05:49:59,760 --> 05:50:01,400
Let me comment this one out
7835
05:50:01,400 --> 05:50:03,160
and I'm gonna then, when I'm done here,
7836
05:50:03,160 --> 05:50:07,000
I'm gonna print temp.
7837
05:50:11,200 --> 05:50:15,440
So if I run clown.txt, you see what happens in temp.
7838
05:50:15,440 --> 05:50:18,440
It's still, well, let's print temp twice.
7839
05:50:23,640 --> 05:50:26,360
I mean, it's not sorted, it's flipped.
7840
05:50:28,280 --> 05:50:32,280
Let's print it, that's okay.
7841
05:50:32,280 --> 05:50:34,040
We'll just, that's the flipped one.
7842
05:50:36,280 --> 05:50:38,440
Okay, so it's flipped and all we did is we made it,
7843
05:50:38,440 --> 05:50:41,800
instead of car comma three, it's three comma car.
7844
05:50:41,800 --> 05:50:43,520
But now we have a list.
7845
05:50:44,800 --> 05:50:45,960
Okay?
7846
05:50:45,960 --> 05:50:49,160
So now it's flipped and now we can sort that.
7847
05:50:49,160 --> 05:50:54,160
We can say temp equals sorted temp.
7848
05:50:55,440 --> 05:50:59,000
So it says, takes temp and sort it and give it back to me
7849
05:50:59,000 --> 05:51:04,000
and now I'm gonna say print sorted comma temp.
7850
05:51:10,680 --> 05:51:12,680
Okay, so here's the first print.
7851
05:51:13,680 --> 05:51:15,720
When we flipped it, we've got two tent,
7852
05:51:17,040 --> 05:51:18,320
but it's not sorted at all.
7853
05:51:18,320 --> 05:51:20,680
But after we sorted it, it's sorted by tuple
7854
05:51:20,680 --> 05:51:22,880
and the lowest is one after.
7855
05:51:22,880 --> 05:51:26,160
So you'll notice that one is the same as one,
7856
05:51:26,160 --> 05:51:28,080
so it checked the second item in the tuple.
7857
05:51:28,080 --> 05:51:32,920
So down comes before after, fell becomes after down.
7858
05:51:32,920 --> 05:51:36,000
Intro on alphabetical order, but now we get the twos.
7859
05:51:36,000 --> 05:51:41,000
So all the ones sort there and then the twos come here,
7860
05:51:42,200 --> 05:51:44,360
but then within the twos, it's sorted in alphabetical order
7861
05:51:44,360 --> 05:51:48,980
because like a string, if the first character matches,
7862
05:51:48,980 --> 05:51:50,760
then it looks to the second character.
7863
05:51:50,760 --> 05:51:54,120
And then we see, oh, here we go, the threes
7864
05:51:54,120 --> 05:51:55,840
and then the one we actually wanted,
7865
05:51:55,840 --> 05:51:58,880
the highest one is the seven.
7866
05:51:58,880 --> 05:52:01,600
And so one of the things we can do is we can say,
7867
05:52:01,600 --> 05:52:03,360
you'll notice that we want the highest one,
7868
05:52:03,360 --> 05:52:04,640
not the lowest one.
7869
05:52:04,640 --> 05:52:07,840
So we can just tell this with this parameter,
7870
05:52:07,840 --> 05:52:09,940
reverse equals true.
7871
05:52:13,320 --> 05:52:15,560
And we just say, hey, sorted, do this backwards,
7872
05:52:15,560 --> 05:52:19,080
do it from highest to lowest rather than lowest to highest.
7873
05:52:19,080 --> 05:52:24,080
And now our sorted one says seven, the, et cetera.
7874
05:52:24,080 --> 05:52:25,600
Okay.
7875
05:52:25,600 --> 05:52:27,720
And so we want the first five,
7876
05:52:30,400 --> 05:52:35,400
we can say up to, but not including five.
7877
05:52:37,920 --> 05:52:40,780
So this is now the top five.
7878
05:52:42,820 --> 05:52:46,640
So the sorted one is, that's the top five.
7879
05:52:46,640 --> 05:52:48,560
If there is, it's a tie, we're gonna go
7880
05:52:48,560 --> 05:52:49,960
and reverse alphabetical order,
7881
05:52:49,960 --> 05:52:52,720
but let's not worry about that too much for now.
7882
05:52:52,720 --> 05:52:56,720
So it makes a flipped list, then it sorts the flipped list.
7883
05:52:59,060 --> 05:53:02,880
Now, if I just wanted to print it out nicer,
7884
05:53:02,880 --> 05:53:04,840
I could loop through this new list.
7885
05:53:04,840 --> 05:53:09,480
I could say for V comma K, remember this is a flipped list.
7886
05:53:09,480 --> 05:53:11,560
So the sensible thing is what's coming out,
7887
05:53:11,560 --> 05:53:13,200
I mean, coming out of this list,
7888
05:53:13,200 --> 05:53:17,200
each tuple is value comma key in temp.
7889
05:53:17,200 --> 05:53:19,080
And I'm only gonna go up through five up through,
7890
05:53:19,080 --> 05:53:21,840
but not including five, so the first five.
7891
05:53:21,840 --> 05:53:24,520
And so I'm pulling them back out as value key
7892
05:53:24,520 --> 05:53:26,360
because that's what they are.
7893
05:53:26,360 --> 05:53:31,360
They're value key, see value key, value key, value key.
7894
05:53:31,700 --> 05:53:33,340
So V is gonna go through these
7895
05:53:33,340 --> 05:53:34,920
and K is gonna go through these.
7896
05:53:34,920 --> 05:53:38,520
And then I'm just gonna print K comma V.
7897
05:53:38,520 --> 05:53:40,400
So this is kind of my flipping backwards
7898
05:53:40,400 --> 05:53:42,880
because I wanna see them this way.
7899
05:53:44,800 --> 05:53:47,160
And thus the most common one, car three.
7900
05:53:47,160 --> 05:53:50,120
And so it's just going through this up through the fifth one
7901
05:53:50,120 --> 05:53:51,720
and then printing them out.
7902
05:53:51,720 --> 05:53:55,320
Okay, so let me comment this out.
7903
05:53:55,320 --> 05:53:56,980
Let me comment that out.
7904
05:54:01,280 --> 05:54:03,460
Let me just delete this.
7905
05:54:03,460 --> 05:54:05,240
So we have a dictionary.
7906
05:54:05,240 --> 05:54:07,640
Let me comment out the dictionary.
7907
05:54:07,640 --> 05:54:09,720
We have a dictionary, we make a list
7908
05:54:09,720 --> 05:54:12,600
and we make these reversed tuples
7909
05:54:12,600 --> 05:54:14,440
where we have the value first and the key second.
7910
05:54:14,440 --> 05:54:17,680
We're setting it up so the sort's gonna work.
7911
05:54:17,680 --> 05:54:19,760
And then once it's sorted, we have to flip them back.
7912
05:54:19,760 --> 05:54:22,880
So we flip them for sorting from key value
7913
05:54:22,880 --> 05:54:25,160
to value key for sorting.
7914
05:54:25,160 --> 05:54:28,280
We do the sort, then we flip them back with key value
7915
05:54:28,280 --> 05:54:29,220
and print them out.
7916
05:54:34,640 --> 05:54:35,800
And it works fine.
7917
05:54:35,800 --> 05:54:40,800
So let's try our big file, intro.txt and there you go.
7918
05:54:41,180 --> 05:54:46,180
Those are the five most common words in intro.txt.
7919
05:54:46,580 --> 05:54:49,240
So you might ask yourself, why did we use tuples?
7920
05:54:49,240 --> 05:54:52,040
We probably, we could have really used lists for this
7921
05:54:52,040 --> 05:54:53,860
but tuples are more efficient than lists
7922
05:54:53,860 --> 05:54:56,380
and you notice that we weren't gonna modify.
7923
05:54:56,380 --> 05:54:59,000
We did modify the temp list, it's a list of tuples
7924
05:54:59,000 --> 05:55:02,760
but the tuples within the list, we weren't gonna modify.
7925
05:55:02,760 --> 05:55:05,720
And so we tend not to make lists
7926
05:55:05,720 --> 05:55:07,760
if we can get away with using tuples.
7927
05:55:07,760 --> 05:55:11,880
And so that's why we made this flipped tuple thing.
7928
05:55:11,880 --> 05:55:15,680
Okay, so I hope that was useful to you.
7929
05:55:15,680 --> 05:55:20,680
Hope to see you on the net.
7930
05:55:20,840 --> 05:55:24,080
Hello and welcome to chapter 11, regular expressions.
7931
05:55:24,080 --> 05:55:25,800
The fun thing about this chapter is
7932
05:55:25,800 --> 05:55:27,360
unlike all the rest of the chapters,
7933
05:55:27,360 --> 05:55:30,640
you sort of had to really understand every single thing
7934
05:55:30,640 --> 05:55:33,240
in chapters one through 11 built on one another,
7935
05:55:33,240 --> 05:55:35,500
one through 10 built on one another.
7936
05:55:35,500 --> 05:55:38,520
But you can really get along without using chapter 11.
7937
05:55:38,520 --> 05:55:41,000
It's not a really required topic
7938
05:55:41,000 --> 05:55:43,200
but it's a fun topic and an interesting topic.
7939
05:55:43,200 --> 05:55:46,880
So you can relax a little bit and realize
7940
05:55:46,880 --> 05:55:48,680
that you may or may not like regular expressions
7941
05:55:48,680 --> 05:55:49,960
and if you don't like them, that's okay.
7942
05:55:49,960 --> 05:55:50,880
You don't have to use them.
7943
05:55:50,880 --> 05:55:52,440
You can go for your whole life
7944
05:55:52,440 --> 05:55:54,500
without using regular expressions.
7945
05:55:54,500 --> 05:55:56,800
The idea of a regular expression is that
7946
05:55:56,800 --> 05:55:59,960
you come up with a language.
7947
05:55:59,960 --> 05:56:02,040
It's a little character based programming language
7948
05:56:02,040 --> 05:56:06,080
where you can do smart searching basically.
7949
05:56:06,080 --> 05:56:07,800
Start searching and as you'll see in a bit
7950
05:56:07,800 --> 05:56:11,020
with smart extraction.
7951
05:56:11,020 --> 05:56:14,560
And it's really almost programmable wild card expressions.
7952
05:56:14,560 --> 05:56:16,280
There's no looping but there is looping
7953
05:56:16,280 --> 05:56:17,720
and there's all this implicit thing
7954
05:56:17,720 --> 05:56:19,640
and you say look for patterns that look like this
7955
05:56:19,640 --> 05:56:22,920
and then you give back things that match those patterns.
7956
05:56:22,920 --> 05:56:24,760
We do searching for everything.
7957
05:56:24,760 --> 05:56:26,920
We're looking through large blocks of text.
7958
05:56:26,920 --> 05:56:29,400
Say go find me everything that has the word Python in it
7959
05:56:29,400 --> 05:56:30,320
or something like that.
7960
05:56:30,320 --> 05:56:32,120
So that's just such a common thing to do
7961
05:56:32,120 --> 05:56:35,000
and regular expressions are a very structured way
7962
05:56:35,000 --> 05:56:37,220
to go about searching for information.
7963
05:56:37,220 --> 05:56:39,200
They're very powerful but they're also very cryptic
7964
05:56:39,200 --> 05:56:40,360
and you may not like them
7965
05:56:40,360 --> 05:56:42,600
but they're a lot of fun actually once you understand them.
7966
05:56:42,600 --> 05:56:45,360
Learning how to program them takes a while.
7967
05:56:45,360 --> 05:56:47,240
Writing good regular expression programs
7968
05:56:47,240 --> 05:56:50,240
requires some try it, play with it, check it,
7969
05:56:50,240 --> 05:56:51,520
try it, check it, try it, check it.
7970
05:56:51,520 --> 05:56:54,240
But once you get them they're really quite cool.
7971
05:56:54,240 --> 05:56:56,500
It's a very old programming language.
7972
05:56:57,800 --> 05:57:00,000
It comes almost from the 1960s.
7973
05:57:00,000 --> 05:57:02,420
The concept of it's a theory of computing
7974
05:57:02,420 --> 05:57:03,260
where they were trying to come up
7975
05:57:03,260 --> 05:57:05,200
with theory of languages and regular expressions
7976
05:57:05,200 --> 05:57:09,400
was one form of languages that computers could understand.
7977
05:57:09,400 --> 05:57:11,960
And so it has some fun old words.
7978
05:57:11,960 --> 05:57:16,320
And one of the advantages of knowing regular expressions
7979
05:57:16,320 --> 05:57:18,720
is that you're kind of a cool person.
7980
05:57:18,720 --> 05:57:22,040
You can take a quick look at this XKCD
7981
05:57:22,040 --> 05:57:25,360
that sort of captures the devil may care,
7982
05:57:25,360 --> 05:57:28,880
awesome power that regular expressions do.
7983
05:57:28,880 --> 05:57:32,600
And while we're at it, you know,
7984
05:57:32,600 --> 05:57:33,720
while we're talking about awesome,
7985
05:57:33,720 --> 05:57:35,120
I do want to take this moment
7986
05:57:35,120 --> 05:57:36,800
and show you my awesome tattoos.
7987
05:57:36,800 --> 05:57:39,080
And so you may not know this
7988
05:57:39,080 --> 05:57:40,480
but I got a couple tattoos here.
7989
05:57:40,480 --> 05:57:42,000
Here's the first tattoo.
7990
05:57:42,000 --> 05:57:44,000
This is where I went to, got my PhD
7991
05:57:44,000 --> 05:57:46,520
and this is my University of Michigan faculty
7992
05:57:46,520 --> 05:57:47,360
member position.
7993
05:57:47,360 --> 05:57:50,560
I got PhD in engineering and I teach in a school
7994
05:57:50,560 --> 05:57:52,160
of information and library science.
7995
05:57:52,160 --> 05:57:53,880
And then I have this other tattoo
7996
05:57:53,880 --> 05:57:57,840
and this tattoo is what I call the ring of compliance.
7997
05:57:57,840 --> 05:57:59,400
I work on learning management systems
7998
05:57:59,400 --> 05:58:01,600
and educational technology and standards.
7999
05:58:01,600 --> 05:58:02,640
And there's this standard called
8000
05:58:02,640 --> 05:58:03,840
learning tools interoperability,
8001
05:58:03,840 --> 05:58:06,320
which if you're using this course
8002
05:58:06,320 --> 05:58:07,280
and doing the auto grader,
8003
05:58:07,280 --> 05:58:09,520
it uses learning tools interoperability to integrate
8004
05:58:09,520 --> 05:58:11,000
into whatever learning management system
8005
05:58:11,000 --> 05:58:12,520
you happen to be using.
8006
05:58:12,520 --> 05:58:14,280
And one of those learning management systems
8007
05:58:14,280 --> 05:58:15,920
is the open source learning management system
8008
05:58:15,920 --> 05:58:17,600
that I helped write called Sakai.
8009
05:58:17,600 --> 05:58:20,120
And these are the rest of the major vendors.
8010
05:58:20,120 --> 05:58:22,800
And the idea of that tattoo was
8011
05:58:22,800 --> 05:58:26,160
that I would put the tattoo of every vendor
8012
05:58:26,160 --> 05:58:28,120
that would comply with learning tools interoperability.
8013
05:58:28,120 --> 05:58:29,080
So you'll notice Coursera,
8014
05:58:29,080 --> 05:58:31,680
I help Coursera put learning tools interoperability in.
8015
05:58:31,680 --> 05:58:33,920
And so the auto graders integrate into Coursera,
8016
05:58:33,920 --> 05:58:36,360
Blackboard or Canvas or Sakai or Moodle
8017
05:58:36,360 --> 05:58:37,400
or often those other things.
8018
05:58:37,400 --> 05:58:41,200
So it's just like a cool techno thing,
8019
05:58:41,200 --> 05:58:43,520
just like regular expressions.
8020
05:58:43,520 --> 05:58:46,640
So I've got a URL here for regular expression quick guide.
8021
05:58:46,640 --> 05:58:48,440
You might wanna print this out
8022
05:58:48,440 --> 05:58:50,880
so that you can look at it
8023
05:58:50,880 --> 05:58:53,280
even while you're watching this lecture
8024
05:58:53,280 --> 05:58:55,240
because it's a little programming language
8025
05:58:55,240 --> 05:58:56,480
except that it's character based,
8026
05:58:56,480 --> 05:58:58,240
not line based and not keyword based.
8027
05:58:58,240 --> 05:59:00,480
It has certain active characters
8028
05:59:00,480 --> 05:59:02,960
that the character means something
8029
05:59:02,960 --> 05:59:06,840
versus the character represents the character itself.
8030
05:59:06,840 --> 05:59:08,200
And so the regular expressions
8031
05:59:08,200 --> 05:59:09,760
is not part of the base Python
8032
05:59:09,760 --> 05:59:11,080
but it's distributed with Python.
8033
05:59:11,080 --> 05:59:13,680
So you have to put an import re at the top
8034
05:59:13,680 --> 05:59:15,000
to say that's really saying
8035
05:59:15,000 --> 05:59:17,040
pull in the regular expression library.
8036
05:59:17,040 --> 05:59:20,680
And there is a couple of functions inside that re.search
8037
05:59:20,680 --> 05:59:22,620
which is kind of like a really smart version
8038
05:59:22,620 --> 05:59:25,120
of the find method inside of strings.
8039
05:59:25,120 --> 05:59:28,800
And re.findall which is kind of like
8040
05:59:28,800 --> 05:59:31,160
taking and stamping your way through a loop
8041
05:59:31,160 --> 05:59:33,960
through a string and finding all of the things
8042
05:59:33,960 --> 05:59:38,040
that match a particular pattern and then extracting those.
8043
05:59:38,040 --> 05:59:41,220
And we'll talk about both of these in this lecture.
8044
05:59:42,220 --> 05:59:44,700
So here's a really simple piece of code
8045
05:59:44,700 --> 05:59:46,160
where I'm just gonna sort of show you
8046
05:59:46,160 --> 05:59:47,760
sort of before and after.
8047
05:59:47,760 --> 05:59:50,620
So here's a thing where we're looking for lines
8048
05:59:50,620 --> 05:59:52,200
that begin with from colon.
8049
05:59:52,200 --> 05:59:55,040
And so we open a file, we loop through the whole file,
8050
05:59:55,040 --> 05:59:57,020
we strip off the lines text
8051
05:59:57,020 --> 05:59:59,720
and then we say if line.find from
8052
05:59:59,720 --> 06:00:02,280
is greater than equal to zero, then we print it.
8053
06:00:02,280 --> 06:00:03,840
It gives you negative one if it's not found.
8054
06:00:03,840 --> 06:00:05,920
And so reads all the lines and once in a while
8055
06:00:05,920 --> 06:00:07,000
it'll print it out, reads all the lines
8056
06:00:07,000 --> 06:00:08,120
once in a while print it out.
8057
06:00:08,120 --> 06:00:10,920
So that's kind of like a needle in a haystack.
8058
06:00:10,920 --> 06:00:12,360
Use regular expressions to do that.
8059
06:00:12,360 --> 06:00:14,700
We have to import the regular expression library.
8060
06:00:14,700 --> 06:00:16,360
These lines are the same, we're gonna loop through,
8061
06:00:16,360 --> 06:00:17,480
we're gonna strip.
8062
06:00:17,480 --> 06:00:20,120
And now we're gonna say if re.search,
8063
06:00:20,120 --> 06:00:22,120
the way to say this is within the library
8064
06:00:22,120 --> 06:00:25,360
regular expressions, go find the search function
8065
06:00:25,360 --> 06:00:30,320
and search for the string from in the string line.
8066
06:00:30,320 --> 06:00:32,560
So this is the line to search whereas here
8067
06:00:32,560 --> 06:00:34,920
it was more object-oriented where we say line.find
8068
06:00:34,920 --> 06:00:37,800
and we say re.search and we pass in line as parameter.
8069
06:00:37,800 --> 06:00:38,880
These two things are equivalent
8070
06:00:38,880 --> 06:00:40,180
which means most of the time it's gonna run
8071
06:00:40,180 --> 06:00:42,500
and once in a while hit a line and it'll print that out
8072
06:00:42,500 --> 06:00:44,080
and then it'll finish the whole thing.
8073
06:00:44,080 --> 06:00:49,080
So that is doing what we would do with the find operation
8074
06:00:49,960 --> 06:00:50,980
with regular expressions.
8075
06:00:50,980 --> 06:00:55,000
Now, searching with regular expressions
8076
06:00:55,000 --> 06:00:57,000
has these special characters and so here we have
8077
06:00:57,000 --> 06:00:59,440
the same basic code except now we're saying
8078
06:00:59,440 --> 06:01:02,440
if line starts with from, so we're not using find anymore
8079
06:01:03,680 --> 06:01:06,080
and that way we're only gonna get that thing
8080
06:01:06,080 --> 06:01:08,480
in the first position not like blah blah blah blah
8081
06:01:08,480 --> 06:01:11,600
from colon, we don't want that to match,
8082
06:01:11,600 --> 06:01:13,760
we only want it to match here at the beginning of the line.
8083
06:01:13,760 --> 06:01:15,400
And so that's what we use line starts with.
8084
06:01:15,400 --> 06:01:17,280
So it's gonna do the same thing and find lines
8085
06:01:17,280 --> 06:01:20,560
that have the prefix and print those out and then be done.
8086
06:01:20,560 --> 06:01:22,900
Now in regular expression search we don't in a sense
8087
06:01:22,900 --> 06:01:25,560
change the method, we have a certain number of things
8088
06:01:25,560 --> 06:01:28,040
we can do with strings based on what they built in.
8089
06:01:28,040 --> 06:01:30,260
But in regular expression we actually can turn
8090
06:01:30,260 --> 06:01:33,120
this first parameter into code.
8091
06:01:33,120 --> 06:01:36,760
And so what's happening here is the caret,
8092
06:01:36,760 --> 06:01:38,140
if you go back to the little cheat sheet,
8093
06:01:38,140 --> 06:01:40,800
caret means this is the beginning of line.
8094
06:01:40,800 --> 06:01:42,880
It's a virtual character that matches the beginning line.
8095
06:01:42,880 --> 06:01:44,840
It's like from that starts at the beginning.
8096
06:01:44,840 --> 06:01:47,780
So from at the beginning does match
8097
06:01:47,780 --> 06:01:49,480
and from in the middle does not match
8098
06:01:49,480 --> 06:01:51,240
by putting that little caret there.
8099
06:01:51,240 --> 06:01:53,160
Same thing, line is what we're searching
8100
06:01:53,160 --> 06:01:55,160
and then from is what we caret from.
8101
06:01:55,160 --> 06:01:58,160
Line from at the beginning is what we're looking for.
8102
06:01:58,160 --> 06:02:00,160
And so again it does the exact same thing.
8103
06:02:00,160 --> 06:02:01,720
Only prints lines that have from colon
8104
06:02:01,720 --> 06:02:03,980
is the first character in the line.
8105
06:02:03,980 --> 06:02:06,480
So the difference is we look for a method
8106
06:02:06,480 --> 06:02:09,360
and the other one is we program the regular expression.
8107
06:02:09,360 --> 06:02:13,480
So we're gonna run out of methods in the string class
8108
06:02:13,480 --> 06:02:15,680
long before we run out of things that we can do
8109
06:02:15,680 --> 06:02:17,400
with regular expressions.
8110
06:02:18,520 --> 06:02:20,720
And so a couple other special characters
8111
06:02:20,720 --> 06:02:23,640
that caret matches the beginning of the line.
8112
06:02:23,640 --> 06:02:25,600
So caret matches the beginning of the line.
8113
06:02:25,600 --> 06:02:28,120
This capital X matches itself.
8114
06:02:28,120 --> 06:02:31,080
Dot is a wildcard that matches any character
8115
06:02:31,080 --> 06:02:33,440
and then some of the characters in regular expressions
8116
06:02:33,440 --> 06:02:35,760
modify the immediately preceding character.
8117
06:02:35,760 --> 06:02:39,400
And so that says look for a line that starts with X
8118
06:02:39,400 --> 06:02:42,840
and then has many characters, that's these two things.
8119
06:02:42,840 --> 06:02:45,840
Zero or more characters followed by a colon.
8120
06:02:45,840 --> 06:02:47,640
And so you can see that it's sort of,
8121
06:02:47,640 --> 06:02:49,660
it's this sort of like expanding stamp.
8122
06:02:49,660 --> 06:02:51,160
It's like oh there's an X at the beginning of the line,
8123
06:02:51,160 --> 06:02:52,520
that line, it looks good.
8124
06:02:52,520 --> 06:02:54,160
I got some characters here and then I got a colon,
8125
06:02:54,160 --> 06:02:55,000
that's good.
8126
06:02:55,000 --> 06:02:59,920
So this is an X, some characters, and a colon, check.
8127
06:02:59,920 --> 06:03:02,400
X, some characters, and a colon, check.
8128
06:03:02,400 --> 06:03:04,680
X, and these things, away we go.
8129
06:03:04,680 --> 06:03:06,820
And so you can, that's what's gonna match.
8130
06:03:06,820 --> 06:03:09,440
And so you can see how some of these characters are special.
8131
06:03:09,440 --> 06:03:11,000
Again, go back to your cheat sheet.
8132
06:03:11,000 --> 06:03:12,000
Some of them are special
8133
06:03:12,000 --> 06:03:13,640
and some of them are actual characters.
8134
06:03:13,640 --> 06:03:16,300
And this colon and X are just, they're not special,
8135
06:03:16,300 --> 06:03:19,160
they're just the characters, okay?
8136
06:03:19,160 --> 06:03:22,920
Now, sometimes you wanna be a little more clear
8137
06:03:22,920 --> 06:03:23,760
on your match.
8138
06:03:23,760 --> 06:03:25,680
So, let's take a look at these lines
8139
06:03:25,680 --> 06:03:28,720
that match that particular thing that we just did.
8140
06:03:28,720 --> 06:03:31,720
So we have these two, X dash civ colon,
8141
06:03:31,720 --> 06:03:33,280
X dash D, stem dash result,
8142
06:03:33,280 --> 06:03:34,840
like these are from mail messages.
8143
06:03:34,840 --> 06:03:36,800
And then one of the mail messages has a line in it
8144
06:03:36,800 --> 06:03:39,740
that says X dash plain is behind schedule.
8145
06:03:39,740 --> 06:03:41,880
And this matches.
8146
06:03:41,880 --> 06:03:43,480
Is that what you really wanted?
8147
06:03:43,480 --> 06:03:46,080
And so what we can basically say is,
8148
06:03:46,080 --> 06:03:48,780
because this is an X, this is some number of characters,
8149
06:03:48,780 --> 06:03:50,740
and that's a colon, it matches.
8150
06:03:50,740 --> 06:03:51,580
It has to match.
8151
06:03:51,580 --> 06:03:55,600
That's this rule applied to this line results in a yes.
8152
06:03:55,600 --> 06:03:56,600
It does.
8153
06:03:56,600 --> 06:04:00,040
And so how can you be a little more clear
8154
06:04:00,040 --> 06:04:01,440
as to what you want to match
8155
06:04:01,440 --> 06:04:03,440
and what you don't want to match?
8156
06:04:03,440 --> 06:04:06,360
So we can write code.
8157
06:04:06,360 --> 06:04:09,200
So now what we're going to say is,
8158
06:04:10,760 --> 06:04:12,340
we wanna match the beginning of the line
8159
06:04:12,340 --> 06:04:15,080
and we wanna capital X and we wanna dash.
8160
06:04:15,080 --> 06:04:16,940
So now we're gonna match those first two characters,
8161
06:04:16,940 --> 06:04:19,360
X dash at the beginning of the line.
8162
06:04:19,360 --> 06:04:21,400
Carrot X dash says first two characters of the line
8163
06:04:21,400 --> 06:04:22,640
must be X dash.
8164
06:04:22,640 --> 06:04:24,160
Now we have another special character.
8165
06:04:24,160 --> 06:04:25,980
Again, refer to your cheat sheet.
8166
06:04:25,980 --> 06:04:30,980
Backslash capital S means a non-whitespace character, right?
8167
06:04:31,240 --> 06:04:33,200
Any character other than whitespace.
8168
06:04:33,200 --> 06:04:35,640
And then plus means one or more times,
8169
06:04:35,640 --> 06:04:38,040
one or more non-whitespace characters.
8170
06:04:38,040 --> 06:04:39,860
That's what this whole thing says.
8171
06:04:39,860 --> 06:04:41,720
One or more non-whitespace characters
8172
06:04:41,720 --> 06:04:43,800
and followed by a colon, which is just a character.
8173
06:04:43,800 --> 06:04:46,200
So now we have X dash followed by one or more
8174
06:04:46,200 --> 06:04:48,480
non-whitespace characters followed by a colon.
8175
06:04:48,480 --> 06:04:50,920
X dash followed by one or more non-whitespace characters
8176
06:04:50,920 --> 06:04:52,040
followed by a colon.
8177
06:04:52,040 --> 06:04:54,160
Here we have X dash followed by one or more,
8178
06:04:54,160 --> 06:04:55,960
whoops, there's a space there.
8179
06:04:55,960 --> 06:04:57,260
And so this doesn't match.
8180
06:04:57,260 --> 06:04:58,760
Even though there's a colon there,
8181
06:04:58,760 --> 06:05:01,160
it means that between the dash and the colon,
8182
06:05:01,160 --> 06:05:02,600
you can only have some number
8183
06:05:02,600 --> 06:05:04,140
of non-whitespace characters.
8184
06:05:04,140 --> 06:05:07,560
So this is a no, it does not match.
8185
06:05:07,560 --> 06:05:11,040
And so you just can, if you didn't wanna match this,
8186
06:05:11,040 --> 06:05:15,520
you then sort of create a more precise,
8187
06:05:15,520 --> 06:05:17,320
you know, we could even have a thing that said,
8188
06:05:17,320 --> 06:05:20,080
I want X dash with an uppercase character,
8189
06:05:20,080 --> 06:05:21,840
uppercase letter, if you wanted to.
8190
06:05:21,840 --> 06:05:23,760
And so there's all kind of fine tuning
8191
06:05:23,760 --> 06:05:28,120
if you sort of learn the structure that you've got to do.
8192
06:05:28,120 --> 06:05:29,960
And so that's kind of the matching
8193
06:05:29,960 --> 06:05:32,720
where you're taking a whole line and taking this template
8194
06:05:32,720 --> 06:05:36,040
and deciding if the template anywhere in that line matches.
8195
06:05:36,040 --> 06:05:37,640
And now what we're gonna do is use this
8196
06:05:37,640 --> 06:05:40,840
to actually pull data out of strings
8197
06:05:40,840 --> 06:05:45,840
using the regular expression library.
8198
06:05:46,640 --> 06:05:48,600
So now we're going to move from merely matching
8199
06:05:48,600 --> 06:05:49,840
to matching and extracting.
8200
06:05:49,840 --> 06:05:51,220
So we're going to say, hey,
8201
06:05:51,220 --> 06:05:53,560
I would like to not only have you take this template,
8202
06:05:53,560 --> 06:05:55,240
this little pattern, the string pattern,
8203
06:05:55,240 --> 06:05:58,160
regular expression pattern, run it across the line,
8204
06:05:58,160 --> 06:06:00,680
I want you to give me all the ones that match
8205
06:06:00,680 --> 06:06:01,960
and I want a list of those.
8206
06:06:01,960 --> 06:06:04,020
And that's what we're going to use the find all.
8207
06:06:04,020 --> 06:06:06,080
So search gives a true false,
8208
06:06:06,080 --> 06:06:09,080
find all gives a list of all the strings that match.
8209
06:06:09,080 --> 06:06:10,040
So if there's four of them,
8210
06:06:10,040 --> 06:06:11,400
you'll get four things in the list.
8211
06:06:11,400 --> 06:06:14,000
If there's nothing that matches, you'll get an empty list.
8212
06:06:14,000 --> 06:06:17,140
So let's take a look at what we got going here.
8213
06:06:17,140 --> 06:06:20,320
So instead of calling search, we call find all.
8214
06:06:20,320 --> 06:06:22,680
We still pass in the string that we're looking through.
8215
06:06:22,680 --> 06:06:25,840
And then we have our little template pattern.
8216
06:06:25,840 --> 06:06:29,480
And this is a new bit of regular expression.
8217
06:06:29,480 --> 06:06:32,600
Any little bracket operation, square bracket
8218
06:06:32,600 --> 06:06:35,400
is one character, that's just a character,
8219
06:06:35,400 --> 06:06:39,840
but then there in between here is a set of allowed characters.
8220
06:06:39,840 --> 06:06:43,120
So zero dash nine means a single digit.
8221
06:06:43,120 --> 06:06:45,440
Zero, one, two, three, four, five, six, seven, eight,
8222
06:06:45,440 --> 06:06:48,560
or nine, but that's really one character.
8223
06:06:48,560 --> 06:06:51,040
And then we have, so that's one character.
8224
06:06:52,000 --> 06:06:53,560
And then when the plus applies to that,
8225
06:06:53,560 --> 06:06:56,800
which means if we look at this whole thing,
8226
06:06:56,800 --> 06:06:59,620
this whole thing says one or more digits.
8227
06:06:59,620 --> 06:07:02,280
That's the code we write in a regular expression
8228
06:07:02,280 --> 06:07:03,720
that says one or more digits.
8229
06:07:03,720 --> 06:07:04,660
And we're just gonna use that
8230
06:07:04,660 --> 06:07:06,540
in our regular expression by itself.
8231
06:07:06,540 --> 06:07:09,400
So we're going to look for any string
8232
06:07:09,400 --> 06:07:11,960
that's one or more digits and pull it out
8233
06:07:11,960 --> 06:07:12,800
and give it back to me.
8234
06:07:12,800 --> 06:07:14,920
So it's gonna look, so that's my little template,
8235
06:07:14,920 --> 06:07:17,120
stamp, stamp, stamp, stamp, oop, got it.
8236
06:07:17,120 --> 06:07:19,160
Stamp, stamp, stamp, stamp, stamp, stamp, stamp, stamp,
8237
06:07:19,160 --> 06:07:22,260
oop, got it, stamp, stamp, stamp, stamp, got it.
8238
06:07:22,260 --> 06:07:25,000
So what we get back after we ask find all,
8239
06:07:25,000 --> 06:07:27,960
to find all of the one or more digit strings
8240
06:07:27,960 --> 06:07:30,840
is two, nine, and 42.
8241
06:07:30,840 --> 06:07:32,480
So it actually parsed it, it split it,
8242
06:07:32,480 --> 06:07:33,760
it found all these things and said,
8243
06:07:33,760 --> 06:07:38,080
I found them all for you and here they are, two, 19, and 42.
8244
06:07:38,080 --> 06:07:40,800
So it's a list of three strings
8245
06:07:40,800 --> 06:07:41,720
because that's how many it found.
8246
06:07:41,720 --> 06:07:43,000
Now it might have found none
8247
06:07:43,000 --> 06:07:45,680
and we would have got an empty list at that point.
8248
06:07:45,680 --> 06:07:48,640
But it found some, okay?
8249
06:07:48,640 --> 06:07:51,600
So just as an example, we did this thing,
8250
06:07:51,600 --> 06:07:54,760
we get two, 19, and 42, but if I said this,
8251
06:07:54,760 --> 06:07:59,760
that basically is a uppercase vowel, A, E, I, O, or U.
8252
06:08:00,240 --> 06:08:04,440
So that's one letter and that's one or more.
8253
06:08:04,440 --> 06:08:08,040
So it's saying something like A, A would match,
8254
06:08:08,040 --> 06:08:12,060
E, I would match, O, O would match.
8255
06:08:12,060 --> 06:08:14,000
But if you look now, it's saying, okay,
8256
06:08:14,000 --> 06:08:18,400
I'm looking for one or more, minimum one or more uppercase,
8257
06:08:18,400 --> 06:08:20,360
A, E, I, O, U is a set of characters,
8258
06:08:20,360 --> 06:08:22,680
one or more uppercase letters.
8259
06:08:22,680 --> 06:08:25,360
And so it says, look, do you find, oh, there's an uppercase,
8260
06:08:25,360 --> 06:08:27,760
but it's an M, no, no, no, no uppercase, no uppercase,
8261
06:08:27,760 --> 06:08:30,880
no uppercase, no uppercase, found nothing,
8262
06:08:30,880 --> 06:08:35,140
did not find anything and so it gives us back an empty list.
8263
06:08:35,140 --> 06:08:37,940
And so it's like, find all the things that match this
8264
06:08:37,940 --> 06:08:41,400
and the answer is, none match, here's your list of nothing.
8265
06:08:41,400 --> 06:08:43,400
Okay, and so you have to check,
8266
06:08:43,400 --> 06:08:45,240
that's how you have to check even if you got something
8267
06:08:45,240 --> 06:08:47,080
because it's not gonna return you false,
8268
06:08:47,080 --> 06:08:49,540
it returns you a list with no items in it.
8269
06:08:50,480 --> 06:08:53,800
Now, the way it works, like I said,
8270
06:08:53,800 --> 06:08:55,480
it sort of is taking this template
8271
06:08:55,480 --> 06:08:58,040
and stamping it across the line,
8272
06:08:58,040 --> 06:08:59,400
stamping across the characters.
8273
06:08:59,400 --> 06:09:03,280
Now, there's a behavior that might not be intuitive,
8274
06:09:03,280 --> 06:09:05,360
intuitive you at the very beginning,
8275
06:09:05,360 --> 06:09:08,160
but the notion of what we call greedy matching.
8276
06:09:08,160 --> 06:09:10,320
And that is, when it can match
8277
06:09:10,320 --> 06:09:13,720
more than one possible string, overlapping string,
8278
06:09:13,720 --> 06:09:16,640
it chooses the largest of the overlapping strings.
8279
06:09:16,640 --> 06:09:19,360
And so the easiest way to show this with an example,
8280
06:09:19,360 --> 06:09:21,880
and we're saying, I want something that starts with an F
8281
06:09:21,880 --> 06:09:25,440
with one or more characters and ends with a colon.
8282
06:09:25,440 --> 06:09:28,880
So that's my little stamp, that's my template.
8283
06:09:28,880 --> 06:09:31,680
So starts with an F, good, that's good.
8284
06:09:31,680 --> 06:09:34,440
One or more characters, da da da da da, have a colon.
8285
06:09:34,440 --> 06:09:38,000
So that could be from colon, that would match.
8286
06:09:38,000 --> 06:09:39,400
But look, I've got another colon here,
8287
06:09:39,400 --> 06:09:40,800
and this is just continuing on
8288
06:09:40,800 --> 06:09:42,960
with one or more characters and this.
8289
06:09:42,960 --> 06:09:45,300
So the question is, do we get this
8290
06:09:45,300 --> 06:09:47,380
or do we get this part, right?
8291
06:09:47,380 --> 06:09:49,740
And the answer is, with greedy matching,
8292
06:09:49,740 --> 06:09:52,680
is we get the larger of the two, okay?
8293
06:09:52,680 --> 06:09:56,160
And so what you get back is somewhat counterintuitive.
8294
06:09:56,160 --> 06:09:58,640
You get the whole thing as the match, from colon,
8295
06:09:58,640 --> 06:10:01,640
using the, we could have got from colon,
8296
06:10:01,640 --> 06:10:03,800
but the reason it picks this is this one's longer.
8297
06:10:03,800 --> 06:10:06,720
So any time it has a choice, it picks the longer one,
8298
06:10:06,720 --> 06:10:08,360
and that's what greedy is, meaning,
8299
06:10:08,360 --> 06:10:10,760
it probably better described as larger
8300
06:10:10,760 --> 06:10:15,760
or tending toward the longest string or something like that.
8301
06:10:16,200 --> 06:10:18,260
So you can, of course, suppress this behavior,
8302
06:10:18,260 --> 06:10:20,820
like everything, in programming regular expressions,
8303
06:10:20,820 --> 06:10:23,520
you simply add another character.
8304
06:10:23,520 --> 06:10:26,560
And so now, it's going to say,
8305
06:10:26,560 --> 06:10:29,120
I would like to start with letter F,
8306
06:10:29,120 --> 06:10:31,080
any character, one or more times,
8307
06:10:31,080 --> 06:10:33,360
and then this question mark, this is still one,
8308
06:10:33,360 --> 06:10:37,320
you know, one little thing, non-greedy, okay?
8309
06:10:38,440 --> 06:10:41,160
And so that just says, do it not greedy,
8310
06:10:41,160 --> 06:10:43,480
which just means that it prefers
8311
06:10:43,480 --> 06:10:44,760
the shorter of the strings.
8312
06:10:44,760 --> 06:10:48,480
And so now, it could still match this string or this string,
8313
06:10:48,480 --> 06:10:50,680
but because it's been told to not be greedy,
8314
06:10:50,680 --> 06:10:52,120
it chooses this string instead,
8315
06:10:52,120 --> 06:10:53,360
and that's the string that we get.
8316
06:10:53,360 --> 06:10:54,880
And so that's the not greedy,
8317
06:10:54,880 --> 06:10:57,400
and you just add the question mark after the asterisk.
8318
06:10:57,400 --> 06:10:59,760
So it's usually an asterisk question mark
8319
06:10:59,760 --> 06:11:02,600
or a plus question mark, though that's a two thing,
8320
06:11:02,600 --> 06:11:04,820
that's zero more characters, non-greedy,
8321
06:11:04,820 --> 06:11:07,480
and that's one or more characters, non-greedy.
8322
06:11:07,480 --> 06:11:08,800
Actually, most of the time,
8323
06:11:10,160 --> 06:11:12,360
it seems to me that the non-greedy
8324
06:11:12,360 --> 06:11:13,600
would be the more reasonable default,
8325
06:11:13,600 --> 06:11:14,800
but that's not how it is.
8326
06:11:14,800 --> 06:11:17,760
A greedy is the default, and non-greedy is optional.
8327
06:11:17,760 --> 06:11:21,920
Now, we can play some more with this stuff, okay?
8328
06:11:21,920 --> 06:11:25,640
And so let's take a look at this little example
8329
06:11:25,640 --> 06:11:30,280
where we have a non-blank characters, backslash capital S,
8330
06:11:30,280 --> 06:11:32,840
one or more of those non-blank characters,
8331
06:11:32,840 --> 06:11:34,160
followed by an at sign,
8332
06:11:34,160 --> 06:11:36,480
and then again, one or more non-blank characters.
8333
06:11:36,480 --> 06:11:38,560
So this is looking for strings that have an at sign
8334
06:11:38,560 --> 06:11:40,780
with non-blank characters on both sides.
8335
06:11:40,780 --> 06:11:43,960
This is an example of where it sort of comes to this at,
8336
06:11:43,960 --> 06:11:46,440
and it goes this way, and it does it in a greedy manner.
8337
06:11:46,440 --> 06:11:48,320
If you told it to not be greedy,
8338
06:11:48,320 --> 06:11:50,920
it would give you this, these three characters,
8339
06:11:50,920 --> 06:11:52,280
but we're telling it to go greedy,
8340
06:11:52,280 --> 06:11:53,760
so it goes all the way to here,
8341
06:11:53,760 --> 06:11:57,280
and stops at this blank, and then stops at this blank.
8342
06:11:57,280 --> 06:11:58,680
And so that's a nice little thing.
8343
06:11:58,680 --> 06:12:03,360
Find the at signs, go to the first blank, blank,
8344
06:12:03,360 --> 06:12:04,440
and pull that stuff out.
8345
06:12:04,440 --> 06:12:06,520
And so that, with one little match,
8346
06:12:06,520 --> 06:12:07,620
you've pulled this thing out.
8347
06:12:07,620 --> 06:12:10,780
Now, of course, we've done that before with other techniques.
8348
06:12:11,880 --> 06:12:15,560
So that's just another way to pull stuff out.
8349
06:12:16,680 --> 06:12:21,160
Now, if we, we get this whole thing,
8350
06:12:21,160 --> 06:12:23,680
but what if that's not exactly what we wanted?
8351
06:12:23,680 --> 06:12:28,680
We can tell, we can give it a matching string
8352
06:12:30,200 --> 06:12:32,000
that's different than the extracting string
8353
06:12:32,000 --> 06:12:35,000
by adding parentheses, and so here's another example
8354
06:12:35,000 --> 06:12:38,640
where we basically say, this is our string,
8355
06:12:38,640 --> 06:12:43,640
we wanna match from at the beginning, followed by a space,
8356
06:12:43,960 --> 06:12:47,160
followed by, ignore the parentheses for the minute,
8357
06:12:47,160 --> 06:12:48,460
one or more non-blank characters,
8358
06:12:48,460 --> 06:12:49,640
followed by an at sign,
8359
06:12:49,640 --> 06:12:51,560
followed by one or more non-blank characters.
8360
06:12:51,560 --> 06:12:54,040
So this is also going to, if there's no from,
8361
06:12:54,040 --> 06:12:56,040
it's not going to be looking for that, right?
8362
06:12:56,040 --> 06:12:58,800
So it demands the from is here, so it matches that,
8363
06:12:58,800 --> 06:13:00,840
and the space is demanded as well.
8364
06:13:00,840 --> 06:13:03,640
And then it says, oh, non-blank characters, great.
8365
06:13:03,640 --> 06:13:05,040
I got an at sign, great.
8366
06:13:05,040 --> 06:13:06,960
Non-blank characters, oops, stop there.
8367
06:13:06,960 --> 06:13:09,560
And so this is what's going to match.
8368
06:13:09,560 --> 06:13:11,900
Now, the key is that we don't actually want that back
8369
06:13:11,900 --> 06:13:13,120
in our extraction.
8370
06:13:13,120 --> 06:13:15,220
What we really want back in our extraction
8371
06:13:15,220 --> 06:13:17,000
is this part right here.
8372
06:13:17,000 --> 06:13:19,480
So what we do is we put parentheses in.
8373
06:13:19,480 --> 06:13:22,080
Parentheses don't, are a code,
8374
06:13:22,080 --> 06:13:24,720
they're code in the regular expression world.
8375
06:13:24,720 --> 06:13:26,620
Parentheses say, start your extraction
8376
06:13:26,620 --> 06:13:27,880
and end your extraction.
8377
06:13:27,880 --> 06:13:31,120
And so when you do this with a parenthesis,
8378
06:13:31,120 --> 06:13:33,840
when you do it without a parenthesis,
8379
06:13:33,840 --> 06:13:36,360
you get the whole from, right?
8380
06:13:36,360 --> 06:13:37,620
Without a parenthesis.
8381
06:13:39,440 --> 06:13:42,440
Oh wait, no, okay, that doesn't have the from in it, so.
8382
06:13:43,360 --> 06:13:46,360
But if you do that with the parenthesis,
8383
06:13:46,360 --> 06:13:50,640
you match the from but you only get the this bit
8384
06:13:50,640 --> 06:13:51,840
to come out as well.
8385
06:13:51,840 --> 06:13:55,960
So you can add this to make the matching part more precise
8386
06:13:55,960 --> 06:13:58,160
but without changing what you get returned
8387
06:13:58,160 --> 06:14:00,380
and you specify what you want to get returned
8388
06:14:00,380 --> 06:14:01,900
with the parentheses.
8389
06:14:03,020 --> 06:14:06,520
So next I want to show you just a couple of different ways
8390
06:14:06,520 --> 06:14:08,240
to use these newfound skills.
8391
06:14:12,440 --> 06:14:14,040
So now what we want to do is use some of these
8392
06:14:14,040 --> 06:14:16,940
newfound skills in some more practical applications
8393
06:14:16,940 --> 06:14:18,780
of regular expressions.
8394
06:14:18,780 --> 06:14:22,580
So let's go back to the way we first tore apart strings
8395
06:14:22,580 --> 06:14:26,080
and look at the situation where if you recall,
8396
06:14:26,080 --> 06:14:28,040
we just wanted the host name, right?
8397
06:14:28,040 --> 06:14:29,880
This is an email address and we're interested
8398
06:14:29,880 --> 06:14:30,800
in the host name.
8399
06:14:30,800 --> 06:14:34,320
So we have this string and we go find the at, right?
8400
06:14:34,320 --> 06:14:38,220
The find looks up and tells us the at is at position 21.
8401
06:14:38,220 --> 06:14:39,600
And then what we do is we say, okay,
8402
06:14:39,600 --> 06:14:42,040
let's look beyond there to the space
8403
06:14:42,040 --> 06:14:45,680
and that tells us the space is in position 31.
8404
06:14:45,680 --> 06:14:48,840
And then we're saying we can extract starting
8405
06:14:48,840 --> 06:14:51,680
beyond the at sign up to but not including the space
8406
06:14:51,680 --> 06:14:55,540
by saying at post plus one colon space position.
8407
06:14:55,540 --> 06:14:58,280
And when we get that, now we have to have a thing
8408
06:14:58,280 --> 06:15:00,860
that decides to only look at this on from lines
8409
06:15:00,860 --> 06:15:02,540
but then it can print out the host
8410
06:15:02,540 --> 06:15:04,980
that is extracting of this information.
8411
06:15:04,980 --> 06:15:07,600
So that was one way that we did that, right?
8412
06:15:07,600 --> 06:15:08,600
One way we did it.
8413
06:15:08,600 --> 06:15:12,440
The next way we did this was the double split pattern,
8414
06:15:13,320 --> 06:15:14,160
right?
8415
06:15:14,160 --> 06:15:15,760
So we said, okay, let's take this line,
8416
06:15:15,760 --> 06:15:19,400
let's break it into words based on spaces.
8417
06:15:19,400 --> 06:15:20,320
That's what words is.
8418
06:15:20,320 --> 06:15:25,040
So that's zero, one, two, three, four, five, six.
8419
06:15:25,040 --> 06:15:26,680
And then we know that the email address
8420
06:15:26,680 --> 06:15:30,040
on lines that start with from space is the second one.
8421
06:15:30,040 --> 06:15:31,480
So we pull out email address,
8422
06:15:31,480 --> 06:15:34,880
which pulls this bit out into email.
8423
06:15:34,880 --> 06:15:37,520
And then we're gonna split that again
8424
06:15:37,520 --> 06:15:38,800
based on the at sign.
8425
06:15:40,160 --> 06:15:42,440
So we're gonna split this part again based on the at sign.
8426
06:15:42,440 --> 06:15:43,520
So it splits right there
8427
06:15:43,520 --> 06:15:46,320
and then this becomes the zero and one in pieces.
8428
06:15:46,320 --> 06:15:49,200
And then pieces sub one is that host.
8429
06:15:49,200 --> 06:15:51,820
And if we print that out, we get the host.
8430
06:15:51,820 --> 06:15:53,320
So that's the double split pattern.
8431
06:15:53,320 --> 06:15:55,280
Nice thing about that is you don't have to keep track.
8432
06:15:55,280 --> 06:15:57,640
The little plus ones kind of annoying
8433
06:15:57,640 --> 06:15:59,840
to use the space position.
8434
06:15:59,840 --> 06:16:02,360
That previous one, that's just hard to remember.
8435
06:16:02,360 --> 06:16:06,240
It's just, I've written this code way too many times
8436
06:16:06,240 --> 06:16:08,440
in my career and I've made mistakes
8437
06:16:08,440 --> 06:16:10,040
and I have to debug it every single time.
8438
06:16:10,040 --> 06:16:11,240
And I print all these numbers out.
8439
06:16:11,240 --> 06:16:13,200
I'm like, did I get it right?
8440
06:16:13,200 --> 06:16:14,240
Oh, I did it in Python.
8441
06:16:14,240 --> 06:16:15,080
I did it in Java.
8442
06:16:15,080 --> 06:16:15,960
I did it in C.
8443
06:16:15,960 --> 06:16:17,360
Wait a second, did it differently?
8444
06:16:17,360 --> 06:16:20,760
And so it's, so this is a lot cleaner.
8445
06:16:20,760 --> 06:16:22,400
I mean, I can write this every time
8446
06:16:22,400 --> 06:16:23,680
and I know it's gonna work every time.
8447
06:16:23,680 --> 06:16:25,200
I barely even need to test this code
8448
06:16:25,200 --> 06:16:27,200
because it's so obvious.
8449
06:16:27,200 --> 06:16:29,400
So double split is another way of extracting stuff.
8450
06:16:29,400 --> 06:16:31,960
But if we look at this thing with the regular expression,
8451
06:16:31,960 --> 06:16:33,680
we can say, oh, okay,
8452
06:16:33,680 --> 06:16:37,480
let's use a regular expression to do this.
8453
06:16:37,480 --> 06:16:39,560
So we'll start looking through the string.
8454
06:16:39,560 --> 06:16:40,880
We'll start by saying, hey,
8455
06:16:40,880 --> 06:16:43,120
let's look until we find an at sign.
8456
06:16:44,320 --> 06:16:47,960
Then let's start extracting with the parentheses.
8457
06:16:47,960 --> 06:16:51,520
And then once we have found the at sign,
8458
06:16:51,520 --> 06:16:54,000
let's look for non-blank characters.
8459
06:16:54,000 --> 06:16:56,680
This is a set of characters.
8460
06:16:56,680 --> 06:17:01,680
This caret as the first one means not a blank.
8461
06:17:01,680 --> 06:17:03,840
So that's another way to do non-blank,
8462
06:17:03,840 --> 06:17:07,200
not a set of characters which are everything but blank.
8463
06:17:07,200 --> 06:17:09,280
That's what this little bit is saying.
8464
06:17:09,280 --> 06:17:12,080
Star means zero more times,
8465
06:17:12,080 --> 06:17:13,720
which means it's gonna run, run, run, run, run
8466
06:17:13,720 --> 06:17:15,980
until it finds a blank which is gonna stop it.
8467
06:17:15,980 --> 06:17:18,200
The greediness is what keeps pushing it, right?
8468
06:17:18,200 --> 06:17:19,720
It's, this is a greedy match.
8469
06:17:19,720 --> 06:17:20,840
That asterisk is greedy
8470
06:17:20,840 --> 06:17:22,520
because there's no question mark after it.
8471
06:17:22,520 --> 06:17:26,920
And so that does go and starts at the at sign
8472
06:17:28,040 --> 06:17:31,200
with the parentheses, goes to the space,
8473
06:17:31,200 --> 06:17:33,640
and that's the end parentheses and that's what prints out.
8474
06:17:33,640 --> 06:17:37,160
Now, Y is gonna be a list that's a one item list
8475
06:17:37,160 --> 06:17:39,080
that has the string in it that we're looking for,
8476
06:17:39,080 --> 06:17:40,840
but you can just go sub-zero
8477
06:17:40,840 --> 06:17:43,180
to get that guy right out of there, okay?
8478
06:17:43,180 --> 06:17:45,680
So that's sort of the regular expression version of it.
8479
06:17:45,680 --> 06:17:49,720
But we can make this a more fine-tuned thing.
8480
06:17:49,720 --> 06:17:53,320
So we can say, look, we also wanna pick the line
8481
06:17:53,320 --> 06:17:55,560
and we wanna know if there are,
8482
06:17:55,560 --> 06:17:58,440
if we don't get that line, we wanna skip it.
8483
06:17:58,440 --> 06:18:00,320
If we do get the line, we wanna extract the data.
8484
06:18:00,320 --> 06:18:03,000
And we can do this all in a single regular expression.
8485
06:18:03,000 --> 06:18:06,380
So again, we say start from the beginning of the line.
8486
06:18:06,380 --> 06:18:08,800
And if it's gotta be a from, followed by a space,
8487
06:18:08,800 --> 06:18:11,800
and then followed by any number of characters,
8488
06:18:11,800 --> 06:18:14,200
dot star, followed by an at sign.
8489
06:18:14,200 --> 06:18:17,440
So this has to match, we see a space,
8490
06:18:17,440 --> 06:18:19,040
then we're gonna have any number of characters,
8491
06:18:19,040 --> 06:18:20,720
and then we're gonna see an at sign.
8492
06:18:20,720 --> 06:18:23,200
And then we're going to start extracting,
8493
06:18:23,200 --> 06:18:24,600
and then we're gonna go non-blank, non-blank,
8494
06:18:24,600 --> 06:18:27,120
non-blank, non-blank, non-blank, up-blank,
8495
06:18:27,120 --> 06:18:29,720
and extracting, and out that comes.
8496
06:18:29,720 --> 06:18:31,960
And this has the advantage of the previous one
8497
06:18:31,960 --> 06:18:34,680
in that that makes it much more precise.
8498
06:18:34,680 --> 06:18:38,120
If we look at the previous one, while it works on good lines,
8499
06:18:38,120 --> 06:18:39,760
it might actually trigger on lines
8500
06:18:39,760 --> 06:18:41,400
that we actually don't wanna see.
8501
06:18:41,400 --> 06:18:43,720
So this allows us to refine it
8502
06:18:43,720 --> 06:18:46,280
so it only actually does this to lines that we care about.
8503
06:18:46,280 --> 06:18:49,560
So it's sort of both an if statement
8504
06:18:49,560 --> 06:18:53,080
and a splitting, extracting, going on all at the same time
8505
06:18:53,080 --> 06:18:55,500
by having a bigger string that we're matching
8506
06:18:55,500 --> 06:18:56,560
than we're extracting.
8507
06:18:56,560 --> 06:19:00,400
So it's a way to kind of clean up your data.
8508
06:19:00,400 --> 06:19:02,600
So here is a simple program
8509
06:19:02,600 --> 06:19:05,080
that we're going to just put all this together
8510
06:19:05,080 --> 06:19:07,040
and actually accomplish something.
8511
06:19:07,040 --> 06:19:09,080
And so we're gonna read through
8512
06:19:09,080 --> 06:19:12,720
and look for lines in a file that have this form.
8513
06:19:12,720 --> 06:19:14,780
And we're gonna extract this number,
8514
06:19:14,780 --> 06:19:19,780
and then we're going to compute the maximum of this.
8515
06:19:20,520 --> 06:19:21,640
So we're gonna extract this number
8516
06:19:21,640 --> 06:19:23,720
and then convert it to a float and compute the maximum.
8517
06:19:23,720 --> 06:19:26,520
So we're gonna open a file,
8518
06:19:26,520 --> 06:19:29,360
we're gonna write a for loop, we're gonna strip.
8519
06:19:29,360 --> 06:19:30,840
So we're gonna do this for every line in the file,
8520
06:19:30,840 --> 06:19:33,480
but the first thing we wanna do is not get line,
8521
06:19:33,480 --> 06:19:36,640
we wanna discard all the lines except ones that have this.
8522
06:19:36,640 --> 06:19:39,360
So our regular expression is look for lines
8523
06:19:39,360 --> 06:19:42,460
that start with x dash d span dash confidence colon.
8524
06:19:42,460 --> 06:19:43,940
So that's a pretty strong match.
8525
06:19:43,940 --> 06:19:46,600
If that's not there, we're not gonna get anything.
8526
06:19:46,600 --> 06:19:49,320
And then there's a space, there's a space,
8527
06:19:49,320 --> 06:19:51,540
and then start extracting,
8528
06:19:51,540 --> 06:19:56,440
and then go as long, one or more digits and dots,
8529
06:19:56,440 --> 06:19:58,840
that's a single character, and that's one or more,
8530
06:19:58,840 --> 06:19:59,960
and then stop extracting.
8531
06:19:59,960 --> 06:20:01,840
So that says start extracting,
8532
06:20:01,840 --> 06:20:04,800
da da da da, greedy, greedy, greedy, greedy, stop extracting.
8533
06:20:04,800 --> 06:20:06,800
And so that's what we're going to get.
8534
06:20:06,800 --> 06:20:09,520
Now, if the line doesn't have this,
8535
06:20:09,520 --> 06:20:12,540
it means missing in some way,
8536
06:20:12,540 --> 06:20:14,000
whether it's this prefix or this number.
8537
06:20:14,000 --> 06:20:16,200
If the number's missing, it's gonna fail too.
8538
06:20:16,200 --> 06:20:19,640
We're going to get back a list, an empty list.
8539
06:20:19,640 --> 06:20:21,640
So the first thing you have to do is check to see
8540
06:20:21,640 --> 06:20:23,440
if you actually got a match.
8541
06:20:23,440 --> 06:20:25,720
So you say if the number of items in the list,
8542
06:20:25,720 --> 06:20:28,640
len of stuff, is not equal to one, continue.
8543
06:20:28,640 --> 06:20:33,160
And so this is the skip all the lines that don't match.
8544
06:20:33,160 --> 06:20:35,040
Skip, skip, skip, skip, skip, skip.
8545
06:20:35,040 --> 06:20:37,240
So there could be thousands of lines that don't match.
8546
06:20:37,240 --> 06:20:39,640
But then, when this match hits,
8547
06:20:39,640 --> 06:20:42,520
it's gonna come down and fall through, right?
8548
06:20:42,520 --> 06:20:46,000
So that, most of the lines will skip up,
8549
06:20:46,000 --> 06:20:47,880
but then when we actually get one,
8550
06:20:47,880 --> 06:20:52,160
and we know instantly that we've got one and stuff sub zero
8551
06:20:52,160 --> 06:20:55,460
because that's what we extracted, is this number.
8552
06:20:55,460 --> 06:20:57,220
And we can take the floating point of it.
8553
06:20:57,220 --> 06:20:58,520
We append it to our list.
8554
06:20:58,520 --> 06:21:00,240
We made a list to store them.
8555
06:21:00,240 --> 06:21:01,640
That runs.
8556
06:21:01,640 --> 06:21:02,580
The list grows.
8557
06:21:03,800 --> 06:21:06,080
And then we just say, what was the largest one?
8558
06:21:06,080 --> 06:21:09,520
And so you can run this and see that.
8559
06:21:09,520 --> 06:21:11,000
We have an escape character.
8560
06:21:11,000 --> 06:21:12,900
And the whole idea is that sometimes
8561
06:21:12,900 --> 06:21:14,080
all these little special characters
8562
06:21:14,080 --> 06:21:15,720
that make a lot of sense to us,
8563
06:21:15,720 --> 06:21:16,880
we actually want to search for it.
8564
06:21:16,880 --> 06:21:19,220
So what if we want to search for a dollar sign?
8565
06:21:19,220 --> 06:21:23,120
Well, we just prefix it with the backslash.
8566
06:21:23,120 --> 06:21:25,160
And that just means this is a real dollar sign.
8567
06:21:25,160 --> 06:21:27,440
So backslash dollar is a real dollar sign.
8568
06:21:27,440 --> 06:21:30,960
So this says, I would like a dollar sign
8569
06:21:30,960 --> 06:21:34,440
followed by one or more digits or dots.
8570
06:21:34,440 --> 06:21:36,880
And so that's going to match a dollar sign
8571
06:21:36,880 --> 06:21:38,160
followed by one or more digits.
8572
06:21:38,160 --> 06:21:39,080
Dots are okay.
8573
06:21:39,080 --> 06:21:40,640
This is a set, remember.
8574
06:21:40,640 --> 06:21:42,640
Zero dash nine or dot.
8575
06:21:42,640 --> 06:21:44,780
That's a set of the list of legit characters.
8576
06:21:44,780 --> 06:21:47,480
This is a range of characters that's a shortcut
8577
06:21:47,480 --> 06:21:48,440
to how to make the set.
8578
06:21:48,440 --> 06:21:50,560
You could make it be zero, one, two, three, four, five,
8579
06:21:50,560 --> 06:21:51,760
seven, eight, nine, dot.
8580
06:21:51,760 --> 06:21:53,920
Or zero dash nine and it assumes that.
8581
06:21:53,920 --> 06:21:55,000
And that's one or more.
8582
06:21:55,000 --> 06:21:57,440
So then it stops because this is a space.
8583
06:21:57,440 --> 06:21:59,160
It's greedy matching.
8584
06:21:59,160 --> 06:22:00,480
Then it pulls this out.
8585
06:22:00,480 --> 06:22:03,000
So that's kind of why greedy has to be the default.
8586
06:22:03,000 --> 06:22:05,960
Because otherwise, if it wasn't doing greedy matching,
8587
06:22:05,960 --> 06:22:07,540
oops, come back, come back.
8588
06:22:08,900 --> 06:22:11,200
If it wasn't doing greedy matching,
8589
06:22:11,200 --> 06:22:14,520
it would, if it wasn't doing greedy matching,
8590
06:22:14,520 --> 06:22:16,780
it would stop here because it would find a dollar sign.
8591
06:22:16,780 --> 06:22:19,660
Non greedy would find a dollar sign and one character
8592
06:22:19,660 --> 06:22:23,360
and then it would give us dollar one rather than dollar 10.
8593
06:22:23,360 --> 06:22:26,560
So, in summary, regular expressions
8594
06:22:26,560 --> 06:22:28,700
are a cryptic but powerful language
8595
06:22:28,700 --> 06:22:31,560
and they're an acquired taste.
8596
06:22:31,560 --> 06:22:35,320
I think that, I bet eventually you'll find them fun
8597
06:22:35,320 --> 06:22:37,540
even though on your first impression
8598
06:22:37,540 --> 06:22:39,540
you might not think that they're so fun.
8599
06:22:42,960 --> 06:22:44,920
Welcome to network programs.
8600
06:22:44,920 --> 06:22:45,760
This is chapter 12.
8601
06:22:45,760 --> 06:22:47,960
Now we're going to learn a little bit
8602
06:22:47,960 --> 06:22:52,960
about how we talk to resources on the network using Python.
8603
06:22:52,960 --> 06:22:56,160
Now, this is a really quick introduction
8604
06:22:56,160 --> 06:22:57,400
to how the network really works.
8605
06:22:57,400 --> 06:22:59,680
I have a whole book that I wrote.
8606
06:22:59,680 --> 06:23:01,800
It's also translated into Spanish
8607
06:23:01,800 --> 06:23:04,800
on how the network works starting at the very lowest
8608
06:23:04,800 --> 06:23:07,680
layer packets and everything right on up.
8609
06:23:07,680 --> 06:23:09,160
And it's actually really easy to read.
8610
06:23:09,160 --> 06:23:11,680
I wrote it for a high school audience.
8611
06:23:11,680 --> 06:23:14,040
It's a short book and pretty easy to read.
8612
06:23:15,080 --> 06:23:16,920
So if you read that book, you will understand
8613
06:23:16,920 --> 06:23:20,040
that there is this layered architecture,
8614
06:23:20,040 --> 06:23:23,080
the TCP architecture that sort of runs our network
8615
06:23:23,080 --> 06:23:24,200
at the lowest layer.
8616
06:23:24,200 --> 06:23:26,360
On one side here, this is your computer
8617
06:23:26,360 --> 06:23:28,080
and this is a server computer.
8618
06:23:28,080 --> 06:23:30,200
And if you sort of want a webpage,
8619
06:23:30,200 --> 06:23:31,520
it goes across the network,
8620
06:23:31,520 --> 06:23:33,760
does this like 15 or 20 times,
8621
06:23:33,760 --> 06:23:36,760
then it goes up into the server, reads the data
8622
06:23:36,760 --> 06:23:41,520
and then the data comes back 15, 20 hops for the packets
8623
06:23:41,520 --> 06:23:44,800
and then it's shown to you as what you see.
8624
06:23:46,500 --> 06:23:48,720
And so, that's how it works.
8625
06:23:48,720 --> 06:23:51,200
And there's these layers that we're not gonna talk about
8626
06:23:51,200 --> 06:23:53,280
in this section but I talk about in that book.
8627
06:23:54,400 --> 06:23:56,360
The layers of the link layer which talk about
8628
06:23:56,360 --> 06:23:58,640
how to get over one hop, the internet layer
8629
06:23:58,640 --> 06:24:03,640
which talks about how to construct say 15 or so hops
8630
06:24:03,640 --> 06:24:06,000
to get packets back and forth,
8631
06:24:06,000 --> 06:24:08,800
that's the sort of lower level bits.
8632
06:24:08,800 --> 06:24:10,880
We're gonna start at what we call the transport layer
8633
06:24:10,880 --> 06:24:14,320
and that's the layer where your computer sort of assumes
8634
06:24:14,320 --> 06:24:17,120
that it can make a phone call to another computer,
8635
06:24:17,120 --> 06:24:20,120
another process running on a program on this computer,
8636
06:24:20,120 --> 06:24:22,000
talks to a program on this computer
8637
06:24:22,000 --> 06:24:24,120
and then it kind of comes back, okay?
8638
06:24:24,120 --> 06:24:26,880
And so, we're gonna leave this alone,
8639
06:24:26,880 --> 06:24:28,800
we're gonna ignore it, we're gonna assume
8640
06:24:28,800 --> 06:24:30,400
that there's this nice reliable pipe
8641
06:24:30,400 --> 06:24:32,840
that's going from point A to point B
8642
06:24:32,840 --> 06:24:34,160
and what are we gonna do with the pipe?
8643
06:24:34,160 --> 06:24:36,960
But if you're interested, take a look at the book.
8644
06:24:36,960 --> 06:24:39,240
So, we're just gonna start with a pipe,
8645
06:24:39,240 --> 06:24:42,880
some kind of a connection, we have two processes,
8646
06:24:42,880 --> 06:24:46,760
process, process and we have some kind of a connection
8647
06:24:46,760 --> 06:24:48,880
between them and it is a connection
8648
06:24:48,880 --> 06:24:53,440
that we can both use to talk and to listen.
8649
06:24:53,440 --> 06:24:56,160
In nerd terms, we call these things sockets
8650
06:24:56,160 --> 06:24:58,680
and that is one process running on one computer,
8651
06:24:58,680 --> 06:25:01,760
another process running on another second computer
8652
06:25:01,760 --> 06:25:03,880
connected through the internet somehow
8653
06:25:03,880 --> 06:25:07,960
and one computer speaks into that socket and it comes out
8654
06:25:07,960 --> 06:25:10,960
and the other computer returns something and it comes.
8655
06:25:10,960 --> 06:25:14,400
And so, this is a bi-directional protocol of data
8656
06:25:14,400 --> 06:25:17,360
which is a series of, in effect, data phone calls
8657
06:25:17,360 --> 06:25:18,960
between applications.
8658
06:25:18,960 --> 06:25:21,480
So, the application might be, on your side,
8659
06:25:21,480 --> 06:25:22,920
this might be your browser.
8660
06:25:23,960 --> 06:25:27,320
Chrome, Firefox, Internet Explorer.
8661
06:25:27,320 --> 06:25:29,240
On the other side, this is a web server.
8662
06:25:30,120 --> 06:25:33,120
Might be internet IIS, internet something something
8663
06:25:33,120 --> 06:25:37,080
from Microsoft or Apache or Java Tomcat.
8664
06:25:37,080 --> 06:25:40,160
There's another program and you are making phone calls
8665
06:25:40,160 --> 06:25:41,360
between these programs.
8666
06:25:41,360 --> 06:25:45,480
Now, in general, these servers here stay up all the time
8667
06:25:45,480 --> 06:25:47,640
and you sort of just can make a request
8668
06:25:47,640 --> 06:25:50,000
when you feel like it in your program
8669
06:25:50,000 --> 06:25:51,280
but that's what we're going to do
8670
06:25:51,280 --> 06:25:53,680
and this is what we call a socket.
8671
06:25:53,680 --> 06:25:55,840
So, that little connection, that phone call,
8672
06:25:55,840 --> 06:25:58,660
that data phone call is what we call a socket.
8673
06:25:59,520 --> 06:26:03,160
Now, you have to decide which of the systems
8674
06:26:03,160 --> 06:26:05,520
you're gonna talk to and then which of the services
8675
06:26:05,520 --> 06:26:07,840
on those systems or which process.
8676
06:26:07,840 --> 06:26:10,640
And so, we have this concept called port numbers
8677
06:26:10,640 --> 06:26:13,080
and they're best thought of like extensions on phones.
8678
06:26:13,080 --> 06:26:15,520
So, one organization has one phone number
8679
06:26:15,520 --> 06:26:17,120
and it says, please enter the extension
8680
06:26:17,120 --> 06:26:19,040
of the party you'd like to talk to.
8681
06:26:19,040 --> 06:26:20,680
Well, that's kind of what ports are.
8682
06:26:20,680 --> 06:26:22,760
They're like, here is, I'm a server
8683
06:26:22,760 --> 06:26:23,920
and I'm connected to the internet.
8684
06:26:23,920 --> 06:26:26,220
Please enter the extension of the process
8685
06:26:26,220 --> 06:26:28,120
that you would like to talk to.
8686
06:26:28,120 --> 06:26:30,920
And so, for example, there might be processes
8687
06:26:30,920 --> 06:26:33,200
running on various computers
8688
06:26:33,200 --> 06:26:36,800
and so the email is known to hang out on port 25
8689
06:26:36,800 --> 06:26:38,240
or extension 25.
8690
06:26:38,240 --> 06:26:41,680
Login, insecure login lives on port 23.
8691
06:26:41,680 --> 06:26:45,280
Insecure web lives on 80 and secure web lives on 443
8692
06:26:45,280 --> 06:26:47,120
and there's a couple of different protocols.
8693
06:26:47,120 --> 06:26:49,640
Say if you have your mail stored on Gmail
8694
06:26:49,640 --> 06:26:51,840
and you have a local mail client,
8695
06:26:51,840 --> 06:26:53,900
say like Thunderbird or Apple Mail,
8696
06:26:53,900 --> 06:26:56,660
that talks a protocol to pull that mail across
8697
06:26:56,660 --> 06:26:58,360
and those live on various ports.
8698
06:26:58,360 --> 06:27:01,320
So, these ports are those extensions
8699
06:27:01,320 --> 06:27:05,320
and by convention, we have standards
8700
06:27:05,320 --> 06:27:08,600
that tell us what to roughly expect at those ports.
8701
06:27:08,600 --> 06:27:10,820
So, when you're talking to port 80,
8702
06:27:10,820 --> 06:27:15,560
you expect to talk to a web server or an HTTP server.
8703
06:27:15,560 --> 06:27:17,040
If you're talking on port 23,
8704
06:27:17,040 --> 06:27:19,000
you expect to talk to a telnet server
8705
06:27:19,000 --> 06:27:20,440
and on and on and on and on.
8706
06:27:20,440 --> 06:27:22,100
And so, these are the extensions,
8707
06:27:22,100 --> 06:27:26,040
the typical commonly used default extensions
8708
06:27:26,040 --> 06:27:29,100
for various network application processes
8709
06:27:29,100 --> 06:27:31,140
that are serving us data.
8710
06:27:31,140 --> 06:27:32,840
Now, sometimes you'll go to a URL
8711
06:27:32,840 --> 06:27:34,560
and you'll see in that URL,
8712
06:27:34,560 --> 06:27:35,800
there's a colon and a number,
8713
06:27:35,800 --> 06:27:39,160
that means it's a web server that's running on a port
8714
06:27:39,160 --> 06:27:42,180
other than the official 80 or 443 port.
8715
06:27:43,400 --> 06:27:47,260
Now, in Python, we can talk to these sockets, right?
8716
06:27:47,260 --> 06:27:49,980
We can just talk to them and it's really easy,
8717
06:27:49,980 --> 06:27:51,560
surprisingly easy.
8718
06:27:52,480 --> 06:27:55,080
We have to import socket because that's a library.
8719
06:27:55,080 --> 06:27:58,800
It comes with Python, but until you can use it,
8720
06:27:58,800 --> 06:28:01,880
you can't use it in your program until you say it.
8721
06:28:01,880 --> 06:28:04,240
And then you, basically in the socket library,
8722
06:28:04,240 --> 06:28:07,360
call socket function, that's what that syntax is saying.
8723
06:28:08,800 --> 06:28:10,040
You're making a socket.
8724
06:28:10,040 --> 06:28:11,040
Now, the key to a socket,
8725
06:28:11,040 --> 06:28:14,560
it's sort of like an unopened file handle.
8726
06:28:14,560 --> 06:28:16,280
It's half of a file handle.
8727
06:28:16,280 --> 06:28:19,020
It's an outward looking thing that's not yet connected.
8728
06:28:19,020 --> 06:28:21,360
These parameters, you're just gonna type them in.
8729
06:28:21,360 --> 06:28:22,800
This says we're gonna make a socket
8730
06:28:22,800 --> 06:28:23,920
that goes across the internet
8731
06:28:23,920 --> 06:28:25,240
and it's a stream socket,
8732
06:28:25,240 --> 06:28:27,400
which means that it's a series of characters
8733
06:28:27,400 --> 06:28:28,840
that come one after another
8734
06:28:28,840 --> 06:28:30,680
rather than a series of blocks of text.
8735
06:28:30,680 --> 06:28:32,920
There's another kind that's harder to deal with,
8736
06:28:32,920 --> 06:28:34,000
but we're gonna do this.
8737
06:28:34,000 --> 06:28:36,200
So this, don't worry about this line.
8738
06:28:36,200 --> 06:28:38,040
Just know that this creates a socket,
8739
06:28:38,040 --> 06:28:40,160
but does not associate it.
8740
06:28:40,160 --> 06:28:43,520
The very next line, we get back a socket object
8741
06:28:43,520 --> 06:28:46,040
in this variable that I'm storing in the variable mySock.
8742
06:28:46,040 --> 06:28:48,120
And then when you wanna make a connection
8743
06:28:48,120 --> 06:28:50,660
across the internet to the far end,
8744
06:28:50,660 --> 06:28:52,800
you say, oh, hey, dear socket,
8745
06:28:52,800 --> 06:28:54,560
extend yourself across the internet.
8746
06:28:54,560 --> 06:28:59,320
Make the phone call to this host, data.pr4e.org,
8747
06:28:59,320 --> 06:29:00,880
and on that port 80.
8748
06:29:00,880 --> 06:29:02,320
So that's making the phone call.
8749
06:29:02,320 --> 06:29:03,740
This is like the phone number
8750
06:29:03,740 --> 06:29:05,440
and this is like the phone extension.
8751
06:29:05,440 --> 06:29:07,500
So that's, we haven't sent any data yet.
8752
06:29:07,500 --> 06:29:11,300
We have simply rung the phone of a process,
8753
06:29:11,300 --> 06:29:12,940
hopefully living on port 80.
8754
06:29:12,940 --> 06:29:14,180
If it's there, great.
8755
06:29:14,180 --> 06:29:15,240
This might blow up.
8756
06:29:15,240 --> 06:29:16,360
This one here won't blow up,
8757
06:29:16,360 --> 06:29:18,360
but this line here could blow up.
8758
06:29:18,360 --> 06:29:19,960
If there's nothing sitting on that process,
8759
06:29:19,960 --> 06:29:20,920
it would come back and say,
8760
06:29:20,920 --> 06:29:23,120
oh, you try to call, you got no answer.
8761
06:29:23,120 --> 06:29:24,720
That's a legitimate thing to happen.
8762
06:29:24,720 --> 06:29:26,580
Maybe you don't have a network connection
8763
06:29:26,580 --> 06:29:29,160
or maybe that service is down on that server
8764
06:29:29,160 --> 06:29:30,760
or the whole server is down.
8765
06:29:30,760 --> 06:29:35,760
But, so I just, it's kind of amazing
8766
06:29:35,960 --> 06:29:37,740
that we're sitting here in Python
8767
06:29:37,740 --> 06:29:41,640
and in three lines we have probably
8768
06:29:41,640 --> 06:29:43,800
a half a million engineers who built this thing
8769
06:29:43,800 --> 06:29:45,420
called the internet, all these protocols
8770
06:29:45,420 --> 06:29:46,760
and all this software.
8771
06:29:46,760 --> 06:29:50,600
And we just made use of it in three lines of Python.
8772
06:29:50,600 --> 06:29:52,560
And in case, this is one of the reasons
8773
06:29:52,560 --> 06:29:55,040
that people absolutely love Python,
8774
06:29:55,040 --> 06:29:56,540
absolutely love Python.
8775
06:29:57,560 --> 06:29:59,260
So now that we have a socket,
8776
06:29:59,260 --> 06:30:01,760
we have to ask ourselves what kind of data
8777
06:30:01,760 --> 06:30:04,080
are we going to send and then what kind of data
8778
06:30:04,080 --> 06:30:07,420
are we going to expect to receive across that socket?
8779
06:30:11,680 --> 06:30:13,000
So now we have a socket.
8780
06:30:13,000 --> 06:30:15,520
We are going to talk about what we're going to do with it.
8781
06:30:15,520 --> 06:30:17,760
So the socket basically functions at this level.
8782
06:30:17,760 --> 06:30:20,160
Your application is saying, make me a socket,
8783
06:30:20,160 --> 06:30:21,720
which is sort of this end point.
8784
06:30:21,720 --> 06:30:23,360
And then the connect actually connects
8785
06:30:23,360 --> 06:30:25,440
to an application on the far side.
8786
06:30:25,440 --> 06:30:28,160
And there's a port involved, so that might be port 80.
8787
06:30:28,160 --> 06:30:30,680
And this is the far host and that could be
8788
06:30:30,680 --> 06:30:35,680
www.py4e.org or data.py4e.org.
8789
06:30:36,080 --> 06:30:39,080
Okay, and so the socket is solving this.
8790
06:30:39,080 --> 06:30:43,720
And the question then is what are we going to send
8791
06:30:43,720 --> 06:30:45,340
and what are we going to expect to get back?
8792
06:30:45,340 --> 06:30:47,640
And that's what we call the application protocol.
8793
06:30:47,640 --> 06:30:50,040
So we know that these two have made a phone call.
8794
06:30:50,040 --> 06:30:51,880
There's no different than making the phone call
8795
06:30:51,880 --> 06:30:54,880
and saying, you know, hello, right?
8796
06:30:54,880 --> 06:30:57,760
And everyone knows that when the phone rings
8797
06:30:57,760 --> 06:31:00,040
and you pick it up, you're supposed to say hello.
8798
06:31:00,040 --> 06:31:01,440
And that's part of our protocol.
8799
06:31:01,440 --> 06:31:03,320
So who talks first, right?
8800
06:31:03,320 --> 06:31:07,320
So the dominant protocol that we use in this section
8801
06:31:07,320 --> 06:31:08,680
is the HTTP protocol.
8802
06:31:08,680 --> 06:31:11,720
The key is hypertext transfer protocol.
8803
06:31:11,720 --> 06:31:13,520
It's dominant, it's really easy to use.
8804
06:31:13,520 --> 06:31:15,000
That's why I use it as an example.
8805
06:31:15,000 --> 06:31:16,640
But realize that there are many others,
8806
06:31:16,640 --> 06:31:19,880
like mail and file transfer and remote login
8807
06:31:19,880 --> 06:31:21,700
and all kinds of other protocols.
8808
06:31:21,700 --> 06:31:23,600
Each is a different application protocol.
8809
06:31:23,600 --> 06:31:26,320
They all use sort of sockets at their lower level.
8810
06:31:26,320 --> 06:31:29,480
But then on top of that, they layer the rules of the road
8811
06:31:29,480 --> 06:31:33,000
for retrieving hypertext web pages.
8812
06:31:33,000 --> 06:31:36,960
And we have used these for all kinds of other things.
8813
06:31:36,960 --> 06:31:38,240
So the protocol, like I said,
8814
06:31:38,240 --> 06:31:39,800
is like who answers the phone first?
8815
06:31:39,800 --> 06:31:40,920
What do they say?
8816
06:31:40,920 --> 06:31:43,000
What happens if the person doesn't answer right?
8817
06:31:43,000 --> 06:31:44,240
Can you hear me now?
8818
06:31:44,240 --> 06:31:45,560
Those kinds of things.
8819
06:31:45,560 --> 06:31:47,240
And it's a real simple thing.
8820
06:31:47,240 --> 06:31:48,360
And all you really need to do
8821
06:31:48,360 --> 06:31:50,480
is so that both sides can agree,
8822
06:31:50,480 --> 06:31:51,640
you have to write a thing
8823
06:31:51,640 --> 06:31:53,280
that's like the rules in the middle
8824
06:31:53,280 --> 06:31:56,120
and say, okay, everybody, as long as we all do this,
8825
06:31:56,120 --> 06:31:57,560
we'll be fine.
8826
06:31:57,560 --> 06:31:59,440
It's as simple as picking on which side of the road
8827
06:31:59,440 --> 06:32:00,640
the cars can drive on.
8828
06:32:00,640 --> 06:32:02,580
It works fine no matter which side.
8829
06:32:04,880 --> 06:32:06,160
But if each car randomly picked,
8830
06:32:06,160 --> 06:32:07,900
it would be really kind of a mess.
8831
06:32:09,060 --> 06:32:10,800
So if you look at the typical URL,
8832
06:32:10,800 --> 06:32:12,100
and this is one of the things
8833
06:32:12,100 --> 06:32:15,280
that the web innovators in 1980
8834
06:32:15,280 --> 06:32:16,920
really invented that was wonderful.
8835
06:32:16,920 --> 06:32:19,100
And it seems second nature today,
8836
06:32:19,100 --> 06:32:21,280
but in 1990, it was rather revolutionary.
8837
06:32:21,280 --> 06:32:23,960
And that these uniform resource locators
8838
06:32:23,960 --> 06:32:27,200
encrypted included in themselves a protocol,
8839
06:32:27,200 --> 06:32:29,820
the host to connect to, and the document to retrieve.
8840
06:32:29,820 --> 06:32:33,440
So this is one of the clever, clever ideas
8841
06:32:33,440 --> 06:32:35,160
that the web came up with,
8842
06:32:35,160 --> 06:32:37,880
because we used to have to pick a program
8843
06:32:37,880 --> 06:32:42,480
like FTP or Telnet or whatever, SMTP.
8844
06:32:42,480 --> 06:32:44,280
Then we had to go to the right host,
8845
06:32:44,280 --> 06:32:47,520
and then we had to talk to that host a certain way.
8846
06:32:47,520 --> 06:32:50,520
So in HTTP, it's a really simple protocol
8847
06:32:50,520 --> 06:32:54,480
invented in 1989 and 1990 by Tim Berners-Lee
8848
06:32:54,480 --> 06:32:58,760
and Robert Caillou at CERN.
8849
06:32:59,840 --> 06:33:03,640
And they created a protocol that we have grown to know
8850
06:33:03,640 --> 06:33:06,700
and love and use for way more than retrieving documents,
8851
06:33:06,700 --> 06:33:09,340
as we'll see in the upcoming chapters.
8852
06:33:09,340 --> 06:33:11,160
So we're gonna talk a little bit about what happens
8853
06:33:11,160 --> 06:33:13,280
when you click on a page that has a link.
8854
06:33:13,280 --> 06:33:15,440
Now, there's all kind of fancy stuff that can go on,
8855
06:33:15,440 --> 06:33:17,000
but this is the basics.
8856
06:33:17,000 --> 06:33:19,040
And so let's just imagine for the moment
8857
06:33:19,040 --> 06:33:21,320
you start sitting looking at a web page,
8858
06:33:21,320 --> 06:33:22,960
drchuck.com slash page one,
8859
06:33:22,960 --> 06:33:25,400
and inside that there is a hyperlink.
8860
06:33:25,400 --> 06:33:29,600
It is a indication that says when you click on this page,
8861
06:33:29,600 --> 06:33:31,160
go to a different page.
8862
06:33:31,160 --> 06:33:34,600
And in that, you see the name of the page
8863
06:33:34,600 --> 06:33:36,160
that you're supposed to go to.
8864
06:33:37,240 --> 06:33:41,520
So we click on this link, and that is a browser.
8865
06:33:41,520 --> 06:33:44,780
This is an application, this is a process,
8866
06:33:46,440 --> 06:33:49,000
or an app that's running on your computer.
8867
06:33:49,000 --> 06:33:50,560
This is the browser, okay?
8868
06:33:50,560 --> 06:33:53,880
And when the browser sees the click inside your computer,
8869
06:33:53,880 --> 06:33:56,560
then the browser makes a connection
8870
06:33:56,560 --> 06:34:00,200
to port 80 on the web server, drchuck.com,
8871
06:34:00,200 --> 06:34:02,520
and sends the request.
8872
06:34:02,520 --> 06:34:06,660
This request that it sends is precisely specified
8873
06:34:06,660 --> 06:34:09,040
by a standard, which we will see in a second.
8874
06:34:09,040 --> 06:34:12,440
Then the web server does some magic work.
8875
06:34:12,440 --> 06:34:14,320
Oops, let's go back.
8876
06:34:14,320 --> 06:34:16,400
Then the web server does some magic work in here,
8877
06:34:16,400 --> 06:34:19,080
reads some files, runs some code, does whatever,
8878
06:34:19,080 --> 06:34:23,680
constructs an answer to our phone call, and sends it back.
8879
06:34:23,680 --> 06:34:26,240
And it sends, in this case, back a web page
8880
06:34:26,240 --> 06:34:30,000
in the format of HTML, the hypertext markup link,
8881
06:34:30,000 --> 06:34:32,020
which is different than HTTP,
8882
06:34:32,020 --> 06:34:33,960
which is the protocol that we're exchanging.
8883
06:34:33,960 --> 06:34:36,340
HTML is the format of the document we're getting back.
8884
06:34:36,340 --> 06:34:38,480
And in this has an anchor tag,
8885
06:34:38,480 --> 06:34:41,440
href and an end of anchor tag, and some highlighted text.
8886
06:34:41,440 --> 06:34:44,240
And now your browser gets this back
8887
06:34:44,240 --> 06:34:47,120
and then renders it according to the rules of HTML
8888
06:34:47,120 --> 06:34:50,120
and CSS and JavaScript, et cetera, parses it,
8889
06:34:50,120 --> 06:34:51,600
and then makes a pretty web page.
8890
06:34:51,600 --> 06:34:53,520
And this web page happens to have a link
8891
06:34:53,520 --> 06:34:55,480
back to the first page, and if you click there,
8892
06:34:55,480 --> 06:34:57,900
it will do this over and over and over again.
8893
06:34:57,900 --> 06:35:00,380
And that is the request response cycle.
8894
06:35:00,380 --> 06:35:03,680
And that's governed by a series of internet standards.
8895
06:35:03,680 --> 06:35:05,640
These are standards that were built
8896
06:35:05,640 --> 06:35:08,000
from the 60s, 70s, 80s, and 90s,
8897
06:35:08,000 --> 06:35:09,840
and continue to this day,
8898
06:35:09,840 --> 06:35:12,140
by a group called the Internet Engineering Task Force,
8899
06:35:12,140 --> 06:35:14,120
or IETF.
8900
06:35:14,120 --> 06:35:17,000
The documents they produce are called RFCs,
8901
06:35:17,000 --> 06:35:19,280
which stands for Request for Comments.
8902
06:35:20,640 --> 06:35:24,760
The RFC, the word RFC is kind of like a sort of joke,
8903
06:35:24,760 --> 06:35:25,600
as it were.
8904
06:35:28,600 --> 06:35:30,900
They're trying to be kind of funny in that,
8905
06:35:30,900 --> 06:35:32,600
funny is not the right word.
8906
06:35:32,600 --> 06:35:34,520
It's ironic in that they're trying to say,
8907
06:35:34,520 --> 06:35:36,400
even so in the protocols of the internet
8908
06:35:36,400 --> 06:35:38,980
that we've used for several decades,
8909
06:35:38,980 --> 06:35:40,920
they're always interested in improvements.
8910
06:35:40,920 --> 06:35:42,400
And that's what the RFC stands for.
8911
06:35:42,400 --> 06:35:44,720
And they're all named RFC-whatever.
8912
06:35:45,560 --> 06:35:47,080
And if we were gonna cruise around,
8913
06:35:47,080 --> 06:35:49,240
we could find some various RFCs.
8914
06:35:49,240 --> 06:35:52,000
And this is RFC 2616.
8915
06:35:52,000 --> 06:35:53,800
It might have been revised since then.
8916
06:35:53,800 --> 06:35:55,440
But this is like a document,
8917
06:35:55,440 --> 06:35:56,560
and this is what they look like.
8918
06:35:56,560 --> 06:35:59,240
Hypertext Transfer Protocol version one.
8919
06:35:59,240 --> 06:36:00,320
And so you're reading this document,
8920
06:36:00,320 --> 06:36:01,400
you're gonna write a browser,
8921
06:36:01,400 --> 06:36:04,640
and you wanna talk the application protocol that is HTTP.
8922
06:36:04,640 --> 06:36:06,440
This is one of many documents
8923
06:36:06,440 --> 06:36:09,440
that helps define what HTTP is.
8924
06:36:09,440 --> 06:36:11,000
So if you look down and you look down and say,
8925
06:36:11,000 --> 06:36:12,560
oh, here's what a request looks like.
8926
06:36:12,560 --> 06:36:15,120
This is how I'm gonna get a document from the server.
8927
06:36:15,120 --> 06:36:17,040
And you keep reading, and you keep reading,
8928
06:36:17,040 --> 06:36:21,320
and it says, you're supposed to have the request method
8929
06:36:21,320 --> 06:36:23,560
with a space, with the request URL,
8930
06:36:23,560 --> 06:36:25,560
the request method with a space,
8931
06:36:25,560 --> 06:36:27,880
with a URI with a space, the HTTP version,
8932
06:36:27,880 --> 06:36:29,200
and the carriage eternal line feed.
8933
06:36:29,200 --> 06:36:30,600
That's what it's saying.
8934
06:36:30,600 --> 06:36:32,960
And so it looks kind of like this, right?
8935
06:36:32,960 --> 06:36:36,120
We say get the document followed by a space.
8936
06:36:36,120 --> 06:36:37,120
There's gotta be one space.
8937
06:36:37,120 --> 06:36:38,560
You do two spaces,
8938
06:36:38,560 --> 06:36:41,560
and it's going to be quite frustrating, okay?
8939
06:36:41,560 --> 06:36:45,080
And so this is an example that you can run
8940
06:36:50,080 --> 06:36:51,640
on Linux operating systems
8941
06:36:51,640 --> 06:36:54,520
and Macintosh operating systems with no changes.
8942
06:36:54,520 --> 06:36:57,040
If you install Telnet on your Windows box,
8943
06:36:57,040 --> 06:36:59,560
you should be able to run something like this as well.
8944
06:36:59,560 --> 06:37:04,320
So Telnet is a program that we used in the old days.
8945
06:37:04,320 --> 06:37:05,920
It used to be how we logged into servers,
8946
06:37:05,920 --> 06:37:08,040
but because it doesn't encrypt your data back and forth,
8947
06:37:08,040 --> 06:37:09,360
we don't use it anymore,
8948
06:37:09,360 --> 06:37:13,480
but it basically is a program that can open a socket
8949
06:37:13,480 --> 06:37:15,800
to a host on a port.
8950
06:37:15,800 --> 06:37:18,600
And I'm saying Telnet to this host on port 80.
8951
06:37:18,600 --> 06:37:20,240
And at this point, I am connected,
8952
06:37:20,240 --> 06:37:21,800
and whatever I type on my keyboard
8953
06:37:21,800 --> 06:37:23,560
is gonna be sent to that server.
8954
06:37:23,560 --> 06:37:24,400
Now if you're doing this,
8955
06:37:24,400 --> 06:37:27,280
you probably wanna cut and paste this really fast,
8956
06:37:27,280 --> 06:37:28,800
because if you take too long,
8957
06:37:28,800 --> 06:37:30,800
most web servers will be like, you're a human.
8958
06:37:30,800 --> 06:37:31,760
I don't wanna talk to humans.
8959
06:37:31,760 --> 06:37:32,880
I wanna talk to programs.
8960
06:37:32,880 --> 06:37:35,200
So remember to type this fast enough,
8961
06:37:35,200 --> 06:37:38,000
and then you have to hit Enter twice.
8962
06:37:38,000 --> 06:37:39,800
So you have to have a blank line here.
8963
06:37:39,800 --> 06:37:42,040
Just type this exactly as it's shown,
8964
06:37:42,040 --> 06:37:44,560
and then you will get back the server.
8965
06:37:44,560 --> 06:37:45,400
If you do it right,
8966
06:37:45,400 --> 06:37:47,960
the server and the server is properly configured.
8967
06:37:47,960 --> 06:37:50,080
The server will give you back some headers,
8968
06:37:51,600 --> 06:37:53,800
and this is metadata about the document you're going to get.
8969
06:37:53,800 --> 06:37:56,680
For example, it's saying it's got text slash HTML,
8970
06:37:56,680 --> 06:37:58,120
which means that the remaining stuff
8971
06:37:58,120 --> 06:38:00,520
is gonna be in HTML, Hypertext Markup Language.
8972
06:38:00,520 --> 06:38:03,480
It has a blank line, and then the actual document,
8973
06:38:03,480 --> 06:38:05,200
and then the connection is closed.
8974
06:38:05,200 --> 06:38:08,320
And so if you do this, you can set this up in a way
8975
06:38:08,320 --> 06:38:11,040
that you can run this on your own computer,
8976
06:38:11,040 --> 06:38:15,200
and in effect, hack through the back door a web server.
8977
06:38:15,200 --> 06:38:18,160
Now you can't hack the secure web servers,
8978
06:38:18,160 --> 06:38:20,680
and mail servers used to be easy to hack,
8979
06:38:20,680 --> 06:38:21,900
but they're harder to hack now
8980
06:38:21,900 --> 06:38:24,080
because they challenge you for information.
8981
06:38:24,080 --> 06:38:28,080
But part of the reason I'm so obsessed with the command line
8982
06:38:28,080 --> 06:38:29,680
is this is how real hackers work,
8983
06:38:29,680 --> 06:38:32,480
and they know how to talk some of these protocols
8984
06:38:32,480 --> 06:38:33,320
more directly.
8985
06:38:33,320 --> 06:38:36,600
And so we think of this beautiful sophisticated application
8986
06:38:36,600 --> 06:38:39,040
talking to some other thing, and it's all pretty,
8987
06:38:39,040 --> 06:38:42,040
and we got wonderful clicky buttons and nice usability,
8988
06:38:42,040 --> 06:38:46,160
but the reality is, like in the Matrix Reloaded here,
8989
06:38:46,160 --> 06:38:49,240
the kinds of things that really talented hackers are doing
8990
06:38:49,240 --> 06:38:52,480
use command lines, and they really know what's going on,
8991
06:38:52,480 --> 06:38:53,320
and that's how they do it.
8992
06:38:53,320 --> 06:38:56,240
They understand what's going on better than the developers
8993
06:38:56,240 --> 06:38:58,560
of the computers that are trying to be resistant
8994
06:38:58,560 --> 06:38:59,400
to the hacking.
8995
06:38:59,400 --> 06:39:02,360
So I come from a long line of using the command line,
8996
06:39:02,360 --> 06:39:05,160
and that's why I encourage you to use the command line
8997
06:39:05,160 --> 06:39:07,120
in this course.
8998
06:39:07,120 --> 06:39:08,240
So the next thing we're going to do
8999
06:39:08,240 --> 06:39:10,640
is we're going to go up into the application layer,
9000
06:39:10,640 --> 06:39:12,960
and instead of typing those commands by hand,
9001
06:39:12,960 --> 06:39:15,980
we're going to actually send them from Python
9002
06:39:15,980 --> 06:39:19,180
and write a very simple Python web browser.
9003
06:39:22,920 --> 06:39:25,040
In this section, we're going to write a web browser
9004
06:39:25,040 --> 06:39:27,080
using Python, so we've already got a socket.
9005
06:39:27,080 --> 06:39:28,520
We know how to write a socket.
9006
06:39:28,520 --> 06:39:31,880
In the previous section, we played with the protocol,
9007
06:39:31,880 --> 06:39:34,000
and used Telnet to do it by hand,
9008
06:39:34,000 --> 06:39:35,560
and now we're going to do it in Python.
9009
06:39:35,560 --> 06:39:38,600
And what you're going to find is it's not that hard.
9010
06:39:40,520 --> 06:39:41,700
So here we go.
9011
06:39:41,700 --> 06:39:45,440
So the first three lines of this program, import socket,
9012
06:39:45,440 --> 06:39:46,280
make the socket.
9013
06:39:46,280 --> 06:39:48,960
Remember, the socket isn't really got the connection,
9014
06:39:48,960 --> 06:39:51,580
so when you make the socket, again,
9015
06:39:51,580 --> 06:39:53,200
we're going to make a stream-based socket,
9016
06:39:53,200 --> 06:39:55,300
and it's suitable for going across the internet.
9017
06:39:55,300 --> 06:39:58,480
The connection, it's like ring, phone call,
9018
06:39:58,480 --> 06:40:02,640
connect to data.pr4e.org and port 80,
9019
06:40:02,640 --> 06:40:06,160
and so that basically says extend the socket across
9020
06:40:06,160 --> 06:40:08,060
and connect to a web server,
9021
06:40:08,060 --> 06:40:10,400
and so there's got to be a piece of software running,
9022
06:40:10,400 --> 06:40:13,920
and this will blow up if the software is not running, okay?
9023
06:40:14,900 --> 06:40:18,640
So then, now we've got a phone, we've made a phone call.
9024
06:40:18,640 --> 06:40:22,480
Now, whether or not the remote side says hello or not
9025
06:40:22,480 --> 06:40:24,240
is up to the application protocol,
9026
06:40:24,240 --> 06:40:26,520
and in this case, the web servers say nothing,
9027
06:40:26,520 --> 06:40:27,880
and they wait for you to talk first,
9028
06:40:27,880 --> 06:40:30,080
so we're the web browser in this case,
9029
06:40:30,080 --> 06:40:31,760
and so we're going to talk first,
9030
06:40:31,760 --> 06:40:34,280
and we know what, because we read the documentation,
9031
06:40:34,280 --> 06:40:35,400
we know that we're going to send get,
9032
06:40:35,400 --> 06:40:37,480
blah, blah, blah, blah, blah, blah, blah, blah, blah,
9033
06:40:37,480 --> 06:40:38,760
space, blah, blah, blah, blah, blah,
9034
06:40:38,760 --> 06:40:41,080
HT1, and then two new lines.
9035
06:40:41,080 --> 06:40:43,880
Return, return, remember you had to have a blank line.
9036
06:40:43,880 --> 06:40:45,500
We'll talk a little bit about this end code,
9037
06:40:45,500 --> 06:40:48,120
it's preparing the data to go across the internet,
9038
06:40:48,120 --> 06:40:49,800
and then we say send it,
9039
06:40:49,800 --> 06:40:52,040
and so this basically takes that little string
9040
06:40:52,040 --> 06:40:54,080
and sends it across the network,
9041
06:40:54,080 --> 06:40:57,180
and then this piece of software is waiting for it,
9042
06:40:57,180 --> 06:40:59,580
and then the software goes and reads a file
9043
06:40:59,580 --> 06:41:00,500
or does some other stuff,
9044
06:41:00,500 --> 06:41:02,760
and then it starts sending us data back,
9045
06:41:02,760 --> 06:41:05,740
which we can then choose to receive.
9046
06:41:05,740 --> 06:41:07,740
So now we write a real simple loop.
9047
06:41:07,740 --> 06:41:08,780
We're going to receive the first,
9048
06:41:08,780 --> 06:41:09,840
we're going to receive these things
9049
06:41:09,840 --> 06:41:11,760
512 characters at a time,
9050
06:41:11,760 --> 06:41:15,140
so we're going to loop through 512 each time,
9051
06:41:15,140 --> 06:41:18,020
and if we get zero characters,
9052
06:41:18,020 --> 06:41:20,840
that means it's end of the stream, the stream is closed,
9053
06:41:20,840 --> 06:41:23,280
and if you look at the little example from the previous one,
9054
06:41:23,280 --> 06:41:24,880
you saw a connection closed.
9055
06:41:24,880 --> 06:41:27,260
When the connection is closed, we get an indication
9056
06:41:27,260 --> 06:41:29,400
that it is because we ask for some data
9057
06:41:29,400 --> 06:41:30,800
and we get zero data.
9058
06:41:30,800 --> 06:41:33,520
Otherwise, if there might be more data, this'll wait.
9059
06:41:33,520 --> 06:41:34,960
If the network is slow, you'll see,
9060
06:41:34,960 --> 06:41:36,640
if you do a print statement in here,
9061
06:41:36,640 --> 06:41:38,680
you will see that this will pause from time to time
9062
06:41:38,680 --> 06:41:40,160
on a really slow network.
9063
06:41:40,160 --> 06:41:41,880
If your network is fast, it'll just go blank
9064
06:41:41,880 --> 06:41:44,280
and it'll be so fast it won't matter.
9065
06:41:44,280 --> 06:41:45,360
But this is how we go.
9066
06:41:45,360 --> 06:41:48,680
So this is basically until the entire socket,
9067
06:41:48,680 --> 06:41:50,920
until the socket is closed,
9068
06:41:50,920 --> 06:41:52,600
we are going to read this data,
9069
06:41:52,600 --> 06:41:55,120
and because this data's coming from the outside world,
9070
06:41:55,120 --> 06:41:57,720
we have to decode it before we print it,
9071
06:41:57,720 --> 06:41:59,640
and then when we're all done, we break out of here
9072
06:41:59,640 --> 06:42:00,980
and we close the socket.
9073
06:42:00,980 --> 06:42:05,980
So literally, that is an entire web browser
9074
06:42:07,200 --> 06:42:11,180
written in 10 lines of Python,
9075
06:42:11,180 --> 06:42:14,360
and again, this is why everybody loves Python.
9076
06:42:14,360 --> 06:42:17,160
So this is what this program will show if you run.
9077
06:42:18,360 --> 06:42:22,360
The get is sent, it looks exactly like doing it by hand.
9078
06:42:22,360 --> 06:42:25,120
You get some headers, again, this is metadata
9079
06:42:25,120 --> 06:42:27,120
that tells you something about the file.
9080
06:42:27,120 --> 06:42:28,520
In this case, one of the important things
9081
06:42:28,520 --> 06:42:30,100
is what kind of thing is coming next.
9082
06:42:30,100 --> 06:42:31,520
There's always a blank line,
9083
06:42:31,520 --> 06:42:34,720
there's a break between the headers and the actual data,
9084
06:42:34,720 --> 06:42:36,320
the metadata and the data,
9085
06:42:36,320 --> 06:42:41,240
and then here is the actual text of that romeo.txt file,
9086
06:42:41,240 --> 06:42:43,860
and then it's gonna run this, gonna print data.decode,
9087
06:42:43,860 --> 06:42:45,900
all this is coming from the print statement.
9088
06:42:45,900 --> 06:42:48,080
If you were gonna parse this, you have to know
9089
06:42:48,080 --> 06:42:51,240
that you're gonna read the headers up to a little blank line.
9090
06:42:51,240 --> 06:42:54,480
The blank line is your indication as a software developer
9091
06:42:54,480 --> 06:42:56,960
that the headers have stopped and the actual text begins,
9092
06:42:56,960 --> 06:42:58,200
and you know the syntax.
9093
06:42:58,200 --> 06:43:02,120
This actually could be a JPEG or PNG
9094
06:43:02,120 --> 06:43:03,600
or some kind of image, right?
9095
06:43:03,600 --> 06:43:05,760
And this data would here look like, blah, blah, blah.
9096
06:43:05,760 --> 06:43:07,940
So if you type this and you change that code
9097
06:43:07,940 --> 06:43:10,880
to actually go retrieve a JPEG URL,
9098
06:43:10,880 --> 06:43:12,820
gibberish will come out, okay?
9099
06:43:13,760 --> 06:43:15,920
And so that's exactly what you will see,
9100
06:43:15,920 --> 06:43:20,700
and so now you have built a very simple web browser.
9101
06:43:20,700 --> 06:43:23,680
Next, I wanna talk a little bit about what happens
9102
06:43:23,680 --> 06:43:28,600
when characters transition outside your computer,
9103
06:43:28,600 --> 06:43:30,760
I mean from inside the computer in strings,
9104
06:43:30,760 --> 06:43:33,900
out across these sockets to servers and then back.
9105
06:43:40,140 --> 06:43:43,560
Hello, everybody, and welcome to some work to sample code.
9106
06:43:43,560 --> 06:43:45,060
If you are interested in the source code,
9107
06:43:45,060 --> 06:43:49,320
you go to materials and download this sample code.zip.
9108
06:43:49,320 --> 06:43:51,800
I have this downloaded.
9109
06:43:51,800 --> 06:43:54,800
It'll be in a folder called code3 on my computer.
9110
06:43:54,800 --> 06:43:57,160
This is where I'm at, I'm in the code3 folder,
9111
06:43:57,160 --> 06:44:00,800
and this has a ton of bits of code here.
9112
06:44:00,800 --> 06:44:04,480
So if I do an ls, you'll see I got all these files here,
9113
06:44:04,480 --> 06:44:08,120
and so we'll just leave those there.
9114
06:44:08,120 --> 06:44:10,880
And so this is the one I wanna work through right now
9115
06:44:10,880 --> 06:44:13,440
is this socket1.py code.
9116
06:44:13,440 --> 06:44:17,240
And basically what we're doing here is we're simulating
9117
06:44:17,240 --> 06:44:20,080
what is gonna happen in a web browser.
9118
06:44:20,080 --> 06:44:24,480
And the cool thing about the HTML, the HTTP protocol,
9119
06:44:24,480 --> 06:44:26,800
is that we can do this by hand,
9120
06:44:26,800 --> 06:44:29,880
and I'm actually gonna hack this HTTP protocol.
9121
06:44:29,880 --> 06:44:34,720
This is gonna go to data.pr4e.org and retrieve a document.
9122
06:44:36,340 --> 06:44:40,880
And so I'm gonna do telnet to,
9123
06:44:40,880 --> 06:44:43,040
now you can do this on a Mac and Linux,
9124
06:44:43,040 --> 06:44:45,920
and if you put telnet on a Windows box, you can do it here,
9125
06:44:45,920 --> 06:44:50,680
data.pr4e.org, and I wanna talk to port 80,
9126
06:44:50,680 --> 06:44:52,320
and the port 80 is a different port,
9127
06:44:52,320 --> 06:44:54,560
it's a non-standard port, but what we're doing here
9128
06:44:54,560 --> 06:44:57,880
is talking to the HTTP port.
9129
06:44:57,880 --> 06:45:02,760
And so I'm going to be able to hand send commands
9130
06:45:02,760 --> 06:45:05,280
to the web server and retrieve a document.
9131
06:45:05,280 --> 06:45:09,440
So I'm gonna cut, I've already copied this string,
9132
06:45:09,440 --> 06:45:14,440
this get HTTP romeo.txt, I'm copying that into my buffer
9133
06:45:14,440 --> 06:45:17,640
because if I wait too long, this won't work.
9134
06:45:17,640 --> 06:45:20,440
So here I go, and now I'm gonna type that,
9135
06:45:20,440 --> 06:45:22,240
and I have to hit enter twice,
9136
06:45:22,240 --> 06:45:25,440
and that literally was the HTTP protocol.
9137
06:45:25,440 --> 06:45:27,760
What I typed there was the HTTP protocol,
9138
06:45:27,760 --> 06:45:30,240
and the web server responds with some metadata
9139
06:45:30,240 --> 06:45:33,720
about the document, how much data there is,
9140
06:45:33,720 --> 06:45:35,520
the kind of data is there.
9141
06:45:36,680 --> 06:45:40,160
A blank line separates the header information
9142
06:45:40,160 --> 06:45:42,680
from the body of the document.
9143
06:45:42,680 --> 06:45:45,600
If I was to go to this in a browser, right there,
9144
06:45:45,600 --> 06:45:50,600
you would see, and if I turned on developer console,
9145
06:45:55,160 --> 06:45:57,280
and I went to the network, let's make this
9146
06:45:57,280 --> 06:46:02,280
a little bit bigger, you would see that
9147
06:46:04,960 --> 06:46:07,640
it retrieves this file romeo.txt,
9148
06:46:07,640 --> 06:46:10,600
and it gets back, it tells us, it shows us the headers,
9149
06:46:10,600 --> 06:46:11,880
and it shows us the response.
9150
06:46:11,880 --> 06:46:15,480
And so this is all the same way of doing the same thing,
9151
06:46:15,480 --> 06:46:20,000
and that is how to do the HTTP protocol, okay?
9152
06:46:20,000 --> 06:46:21,880
But now we're gonna do this in Python,
9153
06:46:21,880 --> 06:46:24,080
and so here's the code we're gonna write.
9154
06:46:24,080 --> 06:46:26,280
So we're gonna import the socket library,
9155
06:46:26,280 --> 06:46:27,640
and we're gonna make a socket.
9156
06:46:27,640 --> 06:46:29,400
Now this doesn't actually make a connection,
9157
06:46:29,400 --> 06:46:31,560
think of a socket as a file handle
9158
06:46:31,560 --> 06:46:34,360
that doesn't have any data associated with it yet.
9159
06:46:34,360 --> 06:46:36,000
And then what we're going to do is we're going to
9160
06:46:36,000 --> 06:46:40,720
reach out and connect that socket to a destination
9161
06:46:40,720 --> 06:46:42,640
across the internet with the domain name
9162
06:46:42,640 --> 06:46:45,720
of data.pr4e.org, and the second parameter
9163
06:46:45,720 --> 06:46:48,400
in this tuple, this is a function call
9164
06:46:48,400 --> 06:46:50,640
with a single tuple as a parameter,
9165
06:46:50,640 --> 06:46:53,280
and so tuple sub zero is data.pr4e.org,
9166
06:46:53,280 --> 06:46:55,120
and tuple sub one is the 80, which says
9167
06:46:55,120 --> 06:46:56,480
I wanna talk to port 80.
9168
06:46:57,560 --> 06:47:01,360
That could fail, it will make the connection,
9169
06:47:01,360 --> 06:47:05,680
and if the port 80 is there, away it goes.
9170
06:47:05,680 --> 06:47:08,220
And then we're gonna actually send the HTTP command,
9171
06:47:08,220 --> 06:47:10,680
so get, this is the HTTP rules,
9172
06:47:10,680 --> 06:47:14,080
followed by an end of line, followed by a blank line.
9173
06:47:14,080 --> 06:47:17,160
So you saw me do this, this was what I typed here,
9174
06:47:17,160 --> 06:47:18,480
and then I had to type a blank line.
9175
06:47:18,480 --> 06:47:21,600
Now if you wanna go read the RFCs for how to do this,
9176
06:47:21,600 --> 06:47:22,960
you can figure this out.
9177
06:47:22,960 --> 06:47:25,120
So the only other thing that's kinda weird here
9178
06:47:25,120 --> 06:47:28,440
is we have to add this dot in code,
9179
06:47:29,600 --> 06:47:32,900
and that's because there are strings inside of Python
9180
06:47:32,900 --> 06:47:35,840
that are in Unicode, and we have to send them out
9181
06:47:35,840 --> 06:47:38,480
as what's called UTF-8, and in code,
9182
06:47:38,480 --> 06:47:41,440
converts from Unicode internally to UTF-8.
9183
06:47:41,440 --> 06:47:45,480
So this command is a set of UTF-8 bytes
9184
06:47:45,480 --> 06:47:46,760
that we're then going to send.
9185
06:47:46,760 --> 06:47:49,080
It still has that same set of characters in it,
9186
06:47:50,000 --> 06:47:51,000
and now we're gonna send it.
9187
06:47:51,000 --> 06:47:54,400
And that's, after we've made the connection,
9188
06:47:54,400 --> 06:47:55,880
we're gonna send these two things,
9189
06:47:55,880 --> 06:47:57,960
and then we're going to wait.
9190
06:47:57,960 --> 06:48:00,920
And my SOC is like a file handle at that point,
9191
06:48:00,920 --> 06:48:03,160
because it's been opened and we've sent data.
9192
06:48:03,160 --> 06:48:06,120
The HTTP protocol told us what we had to send
9193
06:48:06,120 --> 06:48:07,880
and the fact that we did have to send it.
9194
06:48:07,880 --> 06:48:10,200
So now I have just a simple while loop,
9195
06:48:10,200 --> 06:48:14,060
and I'm going to ask up to 512 characters,
9196
06:48:14,060 --> 06:48:18,240
and receive up to 512 characters and get that back.
9197
06:48:18,240 --> 06:48:21,200
If I will know that this is the end of file
9198
06:48:21,200 --> 06:48:24,080
if I got no data back, so if the length of the data,
9199
06:48:24,080 --> 06:48:26,840
the byte array that I got back is less than one,
9200
06:48:26,840 --> 06:48:28,080
then I'm gonna quit.
9201
06:48:28,080 --> 06:48:29,420
Otherwise, I'm gonna print the data,
9202
06:48:29,420 --> 06:48:30,620
and I'm gonna use this decode,
9203
06:48:30,620 --> 06:48:32,860
which is kinda the opposite of this end code.
9204
06:48:32,860 --> 06:48:37,320
What I'm getting is UTF-8 encoded data, most likely,
9205
06:48:37,320 --> 06:48:40,400
and decode basically converts it to the internal format
9206
06:48:40,400 --> 06:48:43,000
called Unicode that runs inside.
9207
06:48:43,000 --> 06:48:44,560
So this is gonna run a bunch of times,
9208
06:48:44,560 --> 06:48:47,120
pulling in the blocks, basically 512,
9209
06:48:47,120 --> 06:48:50,080
up to 512 characters at a time, printing it out,
9210
06:48:50,080 --> 06:48:51,360
and then when it's all said and done,
9211
06:48:51,360 --> 06:48:53,160
we will close that connection.
9212
06:48:53,160 --> 06:48:55,880
And so, it's not too exciting.
9213
06:48:55,880 --> 06:49:00,880
Python three, socket, one.py.
9214
06:49:00,880 --> 06:49:03,040
And you'll see that it's just gonna,
9215
06:49:03,040 --> 06:49:05,900
Python is now gonna do what I did by hand.
9216
06:49:05,900 --> 06:49:07,400
Now, of course, the interesting thing is
9217
06:49:07,400 --> 06:49:09,040
these are all in strings, right?
9218
06:49:09,040 --> 06:49:12,480
And so, you know, this way we could write code
9219
06:49:12,480 --> 06:49:13,520
that does stuff with this.
9220
06:49:13,520 --> 06:49:15,200
But all we're really trying to do
9221
06:49:15,200 --> 06:49:19,400
in this particular situation is show how you open a socket,
9222
06:49:19,400 --> 06:49:22,480
send a command, and then retrieve the data.
9223
06:49:26,800 --> 06:49:30,560
Okay, so now it's time to teach you a bit of complexity
9224
06:49:30,560 --> 06:49:32,200
about text processing.
9225
06:49:32,200 --> 06:49:35,040
Up till now, we've kind of been ignoring
9226
06:49:35,040 --> 06:49:36,900
the complexity of text processing.
9227
06:49:37,840 --> 06:49:40,400
Everything that I have been doing,
9228
06:49:40,400 --> 06:49:43,040
most of what I've been doing is in ASCII,
9229
06:49:43,920 --> 06:49:47,000
the Latin character set, the character set that,
9230
06:49:47,000 --> 06:49:49,320
you know, United States, Europe,
9231
06:49:49,320 --> 06:49:53,060
lots of Western civilizations use this character set.
9232
06:49:53,060 --> 06:49:57,360
And if you go back to the 1950s and 1960s,
9233
06:49:57,360 --> 06:50:00,160
they, we were happy to have one computer
9234
06:50:00,160 --> 06:50:02,000
and we didn't care what the character set was
9235
06:50:02,000 --> 06:50:04,040
as long as what you typed on the keyboard
9236
06:50:04,040 --> 06:50:05,280
came out on the printer,
9237
06:50:05,280 --> 06:50:08,360
the internal representation didn't matter.
9238
06:50:08,360 --> 06:50:13,320
And as the 70s and 80s came along, certainly 70s,
9239
06:50:13,320 --> 06:50:14,720
we needed some interoperability.
9240
06:50:14,720 --> 06:50:16,840
And so they standardized that character set,
9241
06:50:16,840 --> 06:50:18,740
but they standardized that character set,
9242
06:50:18,740 --> 06:50:22,120
certainly in the West, that did not represent anything.
9243
06:50:22,120 --> 06:50:26,020
And so if you look at this sheet,
9244
06:50:26,020 --> 06:50:27,880
basically what it's telling you
9245
06:50:27,880 --> 06:50:30,680
is for the various characters,
9246
06:50:30,680 --> 06:50:32,280
there's some non-printing characters,
9247
06:50:32,280 --> 06:50:34,560
white space, non-printing characters,
9248
06:50:34,560 --> 06:50:35,860
and then here's some printing characters
9249
06:50:35,860 --> 06:50:37,880
like the and key, the zero,
9250
06:50:37,880 --> 06:50:39,820
and then the uppercase characters,
9251
06:50:39,820 --> 06:50:41,520
and then the lowercase characters.
9252
06:50:41,520 --> 06:50:45,040
And there's 128 of these possible values.
9253
06:50:45,040 --> 06:50:48,880
And there are nothing even for Spanish or French in here.
9254
06:50:48,880 --> 06:50:50,840
And it's also why, by the way,
9255
06:50:50,840 --> 06:50:54,000
uppercase letters in Latin sort lower
9256
06:50:54,000 --> 06:50:55,200
than lowercase letters,
9257
06:50:55,200 --> 06:50:57,480
and we saw that in some of the string stuff.
9258
06:50:57,480 --> 06:51:00,560
And what these do is it maps and says, okay.
9259
06:51:03,240 --> 06:51:07,460
And a lowercase a maps to the number integer number 97,
9260
06:51:07,460 --> 06:51:12,460
which in base 16 is 61, and in octal it's 141.
9261
06:51:12,480 --> 06:51:14,840
But in binary, it's eight bit numbers.
9262
06:51:14,840 --> 06:51:16,860
And so these are eight bits,
9263
06:51:18,160 --> 06:51:19,520
otherwise known as a byte.
9264
06:51:20,760 --> 06:51:22,200
And they're very efficient.
9265
06:51:22,200 --> 06:51:24,520
Like when you buy a disk drive,
9266
06:51:24,520 --> 06:51:26,520
it's megabytes or gigabytes or whatever,
9267
06:51:26,520 --> 06:51:30,080
that's how many of these kind of characters it can store.
9268
06:51:30,080 --> 06:51:33,520
But unfortunately, this doesn't work
9269
06:51:33,520 --> 06:51:35,280
for more complex characters.
9270
06:51:35,280 --> 06:51:38,680
You can figure out these numbers inside of Python
9271
06:51:38,680 --> 06:51:41,820
by using the ord function.
9272
06:51:41,820 --> 06:51:44,000
And so you say, what is the ordinal
9273
06:51:44,000 --> 06:51:47,360
or the numeric representation of the uppercase h,
9274
06:51:47,360 --> 06:51:49,920
lowercase e, and newline is a character as well.
9275
06:51:49,920 --> 06:51:53,320
And so like 10 is the ordinal position of newline.
9276
06:51:53,320 --> 06:51:54,920
And this actually has to do with sorting
9277
06:51:54,920 --> 06:51:59,360
so that lowercase e is higher than uppercase h.
9278
06:51:59,360 --> 06:52:01,920
And that's just because in the simplest of sorts,
9279
06:52:01,920 --> 06:52:04,240
we just sort them numerically.
9280
06:52:04,240 --> 06:52:07,360
So newline, if you go back to the previous little sheet,
9281
06:52:07,360 --> 06:52:10,680
newline is this 10 right here, it's that 10,
9282
06:52:10,680 --> 06:52:12,680
which is a line feed and that's a 10.
9283
06:52:12,680 --> 06:52:15,400
And that's why when we print newline out, we get a 10.
9284
06:52:16,680 --> 06:52:19,900
And so again, in the early days when strings were simple,
9285
06:52:19,900 --> 06:52:22,560
we just represented them as one byte per character.
9286
06:52:22,560 --> 06:52:27,560
But the problem is that as we have gotten more complex
9287
06:52:28,080 --> 06:52:30,360
and in today's modern world, it's simply unacceptable
9288
06:52:30,360 --> 06:52:32,980
to say that the only thing computers can understand
9289
06:52:32,980 --> 06:52:34,040
is ASCII.
9290
06:52:34,040 --> 06:52:37,420
And so this leads to a very, very,
9291
06:52:37,420 --> 06:52:39,080
from the simplest of character sets
9292
06:52:39,080 --> 06:52:42,720
to a super complex character set called Unicode,
9293
06:52:42,720 --> 06:52:46,120
which basically is billions of characters,
9294
06:52:46,120 --> 06:52:49,780
potential billions of characters for every language
9295
06:52:49,780 --> 06:52:51,400
and every character set.
9296
06:52:51,400 --> 06:52:53,640
And because there's so much space in Unicode,
9297
06:52:53,640 --> 06:52:57,680
it's easy to take very small variations of characters
9298
06:52:57,680 --> 06:52:58,680
and give them a space.
9299
06:52:58,680 --> 06:53:01,080
It's so large that you can have,
9300
06:53:02,800 --> 06:53:05,960
you can have pretty much any character that you want.
9301
06:53:05,960 --> 06:53:07,220
So that's Unicode.
9302
06:53:08,480 --> 06:53:12,840
The problem is that if we sent Unicode across the network,
9303
06:53:12,840 --> 06:53:14,600
it would be way too large.
9304
06:53:14,600 --> 06:53:17,920
It'd be this UTF32, which instead of being
9305
06:53:17,920 --> 06:53:21,160
eight bytes per character would be four bytes per character.
9306
06:53:21,160 --> 06:53:24,440
And so it would take all of the data that we build
9307
06:53:24,440 --> 06:53:29,440
and make it four times larger and it'd be very difficult.
9308
06:53:29,440 --> 06:53:33,780
And so what they've come up with is ways to compress this.
9309
06:53:35,020 --> 06:53:37,600
And UTF-16 is this weird thing.
9310
06:53:37,600 --> 06:53:42,160
UTF-32 is really sort of the full Unicode pretty much.
9311
06:53:42,160 --> 06:53:44,920
UTF-16 is a subset of Unicode.
9312
06:53:44,920 --> 06:53:47,860
It's used in some countries.
9313
06:53:47,860 --> 06:53:52,580
But the best practice for moving data across the internet
9314
06:53:52,580 --> 06:53:56,000
or in a file that you're gonna move between computers
9315
06:53:56,000 --> 06:53:58,200
is what's called UTF-8.
9316
06:53:58,200 --> 06:54:01,800
And so what happens is that UTF-32 is fixed length.
9317
06:54:01,800 --> 06:54:06,060
ASCII is one byte.
9318
06:54:08,080 --> 06:54:10,680
UTF-16 is two bytes, UTF-32 is four bytes.
9319
06:54:10,680 --> 06:54:14,560
And UTF-8 has dynamic length,
9320
06:54:14,560 --> 06:54:16,940
meaning that it is one to four bytes.
9321
06:54:16,940 --> 06:54:18,560
And if it's only one byte long,
9322
06:54:18,560 --> 06:54:20,680
it's perfectly compatible with ASCII,
9323
06:54:20,680 --> 06:54:24,400
meaning that an ASCII file is also UTF-8.
9324
06:54:24,400 --> 06:54:25,840
And so here's this little sheet.
9325
06:54:25,840 --> 06:54:28,420
It's not critical that you understand this graph too much,
9326
06:54:28,420 --> 06:54:30,640
but basically as time passed,
9327
06:54:30,640 --> 06:54:34,960
2000 internets coming, coming, coming, coming, not 2014,
9328
06:54:34,960 --> 06:54:38,040
pretty much overwhelmingly the documents on the internet
9329
06:54:38,040 --> 06:54:40,800
that you might retrieve are UTF-8.
9330
06:54:40,800 --> 06:54:44,100
Now, so UTF-8 is the recommended practice
9331
06:54:44,100 --> 06:54:48,040
and it's sort of a compression of UTF-8 can represent
9332
06:54:48,040 --> 06:54:50,640
all the things UTF-32 can represent.
9333
06:54:50,640 --> 06:54:52,760
It's just a compression of it
9334
06:54:52,760 --> 06:54:56,160
so that with an overlap of ASCII, which is awesome.
9335
06:54:56,160 --> 06:54:57,260
It's what you want.
9336
06:54:58,520 --> 06:54:59,740
I don't even talk anymore.
9337
06:54:59,740 --> 06:55:02,440
So in Python, we have always had
9338
06:55:02,440 --> 06:55:05,540
sort of two ways of representing strings.
9339
06:55:05,540 --> 06:55:10,480
In Python 2, the normal string was a byte string,
9340
06:55:10,480 --> 06:55:13,680
was an ASCII string, was a Latin string.
9341
06:55:13,680 --> 06:55:15,640
And if you wanted to represent Unicode,
9342
06:55:15,640 --> 06:55:18,540
there was a separate kind of object that we had.
9343
06:55:18,540 --> 06:55:21,560
And so you would do that.
9344
06:55:21,560 --> 06:55:25,960
And in Python 3.0 or later,
9345
06:55:25,960 --> 06:55:28,720
one of the main features of Python 3
9346
06:55:28,720 --> 06:55:31,400
was to make Unicode and string the same.
9347
06:55:31,400 --> 06:55:33,640
So that means inside of Python,
9348
06:55:33,640 --> 06:55:36,940
when you have a string variable, it's a Unicode.
9349
06:55:36,940 --> 06:55:40,840
Whereas inside of Python 2, it was a byte variable.
9350
06:55:40,840 --> 06:55:44,160
And so now we have this notion,
9351
06:55:44,160 --> 06:55:47,520
separately in Python 2 and on Python 3,
9352
06:55:47,520 --> 06:55:49,740
where we have byte variables.
9353
06:55:49,740 --> 06:55:54,360
And so byte variables are, in effect, an array of bytes.
9354
06:55:54,360 --> 06:55:57,280
So if there's ABC, that means it's three bytes,
9355
06:55:57,280 --> 06:55:58,560
it's three bytes long.
9356
06:55:58,560 --> 06:56:01,240
Whereas a string might be, a three-character string
9357
06:56:01,240 --> 06:56:04,140
might be anywhere from three to 12 bytes long.
9358
06:56:05,120 --> 06:56:08,720
So Python 2 had bytes and strings that were the same.
9359
06:56:08,720 --> 06:56:11,680
So bytes and strings are the same,
9360
06:56:11,680 --> 06:56:13,000
and Unicode is weird.
9361
06:56:13,000 --> 06:56:14,580
And in Python 3,
9362
06:56:18,040 --> 06:56:21,320
strings and Unicode are the same, and bytes are weird.
9363
06:56:21,320 --> 06:56:24,520
Okay, and so that's what we've got to deal with.
9364
06:56:24,520 --> 06:56:29,520
And there'll be times when we get bytes from APIs,
9365
06:56:29,600 --> 06:56:32,320
when we call things, we have to then figure out
9366
06:56:32,320 --> 06:56:33,960
what kind of thing those bytes contain.
9367
06:56:33,960 --> 06:56:35,920
Because the bytes might contain ASCII,
9368
06:56:35,920 --> 06:56:39,460
they might contain UTF-8, they might contain various things.
9369
06:56:39,460 --> 06:56:43,080
And so internally, all the strings in Python 3 are Unicode.
9370
06:56:44,040 --> 06:56:46,660
Most of the time, if you're inside the program,
9371
06:56:46,660 --> 06:56:48,960
or reading and writing files, we just work.
9372
06:56:48,960 --> 06:56:50,880
And that's why we haven't mentioned it.
9373
06:56:50,880 --> 06:56:52,560
But now that we're talking over sockets,
9374
06:56:52,560 --> 06:56:55,080
and we're talking to the sort of random world out there,
9375
06:56:55,080 --> 06:56:57,240
we have to be a little more aware
9376
06:56:57,240 --> 06:56:58,360
of the data we're dealing with.
9377
06:56:58,360 --> 06:57:01,280
Now, the good news is 98% of the time,
9378
06:57:01,280 --> 06:57:04,800
or 95% of the time, it's UTF-8,
9379
06:57:04,800 --> 06:57:08,000
which might also include ASCII, and so it's quite nice.
9380
06:57:08,000 --> 06:57:10,440
But we have to be aware of this.
9381
06:57:11,440 --> 06:57:15,040
And so if we are going to take data
9382
06:57:15,040 --> 06:57:17,640
that comes off of the network in the bytes,
9383
06:57:17,640 --> 06:57:20,520
then we have to make sure that we interpret it,
9384
06:57:20,520 --> 06:57:23,380
or decode it in the right way,
9385
06:57:23,380 --> 06:57:25,960
so that internally the strings, which are Unicode,
9386
06:57:25,960 --> 06:57:27,680
are properly represented.
9387
06:57:27,680 --> 06:57:30,760
And so that's why when we read data in
9388
06:57:30,760 --> 06:57:33,500
from a network connection like a socket,
9389
06:57:33,500 --> 06:57:35,560
we have to say, hey, decode it.
9390
06:57:35,560 --> 06:57:37,440
Now, there's a couple things going on
9391
06:57:37,440 --> 06:57:39,320
at that moment of decode.
9392
06:57:40,320 --> 06:57:42,040
And so this is where we're doing it.
9393
06:57:42,040 --> 06:57:45,280
We see this, we have to manage this in this code,
9394
06:57:45,280 --> 06:57:47,320
where we, before we send this stuff,
9395
06:57:47,320 --> 06:57:50,760
we're gonna encode it, which takes a Unicode string
9396
06:57:50,760 --> 06:57:53,240
and turns it into UTF-8 bytes.
9397
06:57:53,240 --> 06:57:54,320
There's actually a parameter here
9398
06:57:54,320 --> 06:57:56,320
that you could do it different than UTF-8,
9399
06:57:56,320 --> 06:57:57,840
but no one ever does.
9400
06:57:57,840 --> 06:57:59,700
You might have to for certain situations,
9401
06:57:59,700 --> 06:58:03,640
but so that says that we're gonna encode this into UTF-8
9402
06:58:03,640 --> 06:58:06,540
before we send it, and then when we get something back,
9403
06:58:06,540 --> 06:58:08,760
before we print it, we're gonna decode it.
9404
06:58:08,760 --> 06:58:10,900
And that's how this ends up working out.
9405
06:58:11,920 --> 06:58:14,180
And if you look at the documentation,
9406
06:58:14,180 --> 06:58:17,040
you will see that sometimes it says it's a string,
9407
06:58:17,040 --> 06:58:18,120
or it's bytes.
9408
06:58:18,120 --> 06:58:22,120
And so you take a byte array,
9409
06:58:22,120 --> 06:58:23,820
and you decode it to get a string,
9410
06:58:23,820 --> 06:58:26,680
and you take a string and encode it to get a byte array.
9411
06:58:26,680 --> 06:58:29,280
And so that's what we're doing.
9412
06:58:29,280 --> 06:58:32,120
So you can think of the process as this way,
9413
06:58:32,120 --> 06:58:36,860
and that is the network has these UTF-8,
9414
06:58:36,860 --> 06:58:40,680
mostly UTF-8 resources, not ASCII.
9415
06:58:40,680 --> 06:58:42,160
If it's ASCII, it's okay.
9416
06:58:42,160 --> 06:58:44,840
So you read with the receive.
9417
06:58:44,840 --> 06:58:47,160
So this receive here pulls data,
9418
06:58:47,160 --> 06:58:49,480
well, we have a Unicode string,
9419
06:58:49,480 --> 06:58:50,780
let's start with the send.
9420
06:58:50,780 --> 06:58:53,480
So up here, we have a Unicode string,
9421
06:58:53,480 --> 06:58:54,720
that's a Unicode string,
9422
06:58:54,720 --> 06:58:56,680
even though there's no special characters in it,
9423
06:58:56,680 --> 06:58:58,880
no Asian characters or French characters,
9424
06:58:58,880 --> 06:59:00,800
that's a Unicode string.
9425
06:59:00,800 --> 06:59:02,400
And before we can send it,
9426
06:59:02,400 --> 06:59:04,320
we have to send it in UTF-8.
9427
06:59:04,320 --> 06:59:07,440
If that had Asian characters, it'd be okay,
9428
06:59:07,440 --> 06:59:09,240
and that would be set up just right,
9429
06:59:09,240 --> 06:59:11,000
so that the UTF-8 would be right.
9430
06:59:11,000 --> 06:59:13,820
So we encode it first, and that's the CMD.
9431
06:59:13,820 --> 06:59:15,240
This is now bytes, okay?
9432
06:59:15,240 --> 06:59:18,440
CMD is bytes, and then we actually send the bytes.
9433
06:59:18,440 --> 06:59:19,800
And that goes across the network.
9434
06:59:19,800 --> 06:59:22,280
We get back our thing, and we receive,
9435
06:59:22,280 --> 06:59:26,820
and we receive into data, well, data is bytes, not string.
9436
06:59:26,820 --> 06:59:27,760
It's bytes.
9437
06:59:27,760 --> 06:59:29,440
We can say how big it is.
9438
06:59:29,440 --> 06:59:31,760
Function's kinda like a string, and it has len,
9439
06:59:31,760 --> 06:59:34,520
except that it is one byte per character,
9440
06:59:34,520 --> 06:59:37,520
which means some of it might be UTF-8.
9441
06:59:37,520 --> 06:59:39,200
And then all we have to do is say decode.
9442
06:59:39,200 --> 06:59:41,480
Again, you could, if you were dealing with a situation
9443
06:59:41,480 --> 06:59:42,600
where you weren't expecting it
9444
06:59:42,600 --> 06:59:45,560
to typically be UTF-8 or ASCII,
9445
06:59:45,560 --> 06:59:48,040
you could tell it UTF-16 or something,
9446
06:59:48,040 --> 06:59:49,320
and it's more complex,
9447
06:59:49,320 --> 06:59:51,480
but the simple thing is to just say,
9448
06:59:51,480 --> 06:59:53,240
I'm gonna clean up my data on the way in,
9449
06:59:53,240 --> 06:59:54,940
I'm gonna clean it up by running it through decode,
9450
06:59:54,940 --> 06:59:56,960
and I'm gonna encode stuff on the way out.
9451
06:59:56,960 --> 07:00:01,440
And so sockets are the place where this comes into play.
9452
07:00:01,440 --> 07:00:04,500
And so you'll see, we'll always do this encode and decode
9453
07:00:04,500 --> 07:00:07,720
every time we're sending data kind of outside of Python
9454
07:00:07,720 --> 07:00:09,680
and inside of Python.
9455
07:00:09,680 --> 07:00:11,960
So now that we've talked a little bit about character sets,
9456
07:00:11,960 --> 07:00:15,080
we're going to make this even easier
9457
07:00:15,080 --> 07:00:16,280
so you don't have to use sockets.
9458
07:00:16,280 --> 07:00:20,040
A URL lib is a bit of Python code in the library
9459
07:00:20,040 --> 07:00:22,040
that does all the socket stuff for you.
9460
07:00:22,040 --> 07:00:27,040
Okay, so now we're going to write a web browser again
9461
07:00:28,480 --> 07:00:31,120
in Python, but it's going to even be shorter
9462
07:00:31,120 --> 07:00:32,320
than what we did before.
9463
07:00:32,320 --> 07:00:34,540
We did it in 10 lines using sockets.
9464
07:00:34,540 --> 07:00:37,900
Now we're gonna do it in four lines with URL lib.
9465
07:00:37,900 --> 07:00:41,380
So URL lib really is just because the idea
9466
07:00:41,380 --> 07:00:43,520
of opening a connection, sending a GET request,
9467
07:00:43,520 --> 07:00:45,700
sending the new line, retrieving the stuff,
9468
07:00:45,700 --> 07:00:48,080
breaking the headers out, doing all this stuff,
9469
07:00:48,080 --> 07:00:50,520
that's so common, why not put it in a library
9470
07:00:50,520 --> 07:00:52,280
to save ourselves some effort.
9471
07:00:52,280 --> 07:00:54,520
So here's how we do it.
9472
07:00:54,520 --> 07:00:58,600
We're going to read it in, we're gonna import this library
9473
07:00:58,600 --> 07:01:01,000
so it's not part, we had to import sockets before,
9474
07:01:01,000 --> 07:01:02,920
but we're gonna import URL lib now.
9475
07:01:02,920 --> 07:01:04,880
And so this is really quite simple.
9476
07:01:04,880 --> 07:01:07,240
It's like elegantly simple.
9477
07:01:07,240 --> 07:01:09,880
You say, URL lib, that's a library,
9478
07:01:09,880 --> 07:01:11,920
that's part of a module within the library
9479
07:01:11,920 --> 07:01:12,960
and this is a function.
9480
07:01:12,960 --> 07:01:17,120
So let's call URL open and then give it the URL.
9481
07:01:17,120 --> 07:01:19,760
Now that's a string which it's gonna encode automatically
9482
07:01:19,760 --> 07:01:22,360
for us, so it's taking care of all kind of pretty things
9483
07:01:22,360 --> 07:01:24,520
for us, it does the GET, it does the ENCODE.
9484
07:01:24,520 --> 07:01:26,120
Look back at that previous code.
9485
07:01:26,120 --> 07:01:29,360
That's kind of what URL lib is doing for us, okay?
9486
07:01:29,360 --> 07:01:32,720
Now what URL lib also does is it makes the connection,
9487
07:01:32,720 --> 07:01:36,720
encodes the GET request, and then it actually retrieves,
9488
07:01:36,720 --> 07:01:38,840
at this moment, it retrieves all the headers
9489
07:01:38,840 --> 07:01:40,080
and keeps them for you for later.
9490
07:01:40,080 --> 07:01:42,560
You can get the headers, but we're not gonna see the headers.
9491
07:01:42,560 --> 07:01:45,560
And it returns to you an object that looks
9492
07:01:45,560 --> 07:01:47,440
pretty much like a file handle.
9493
07:01:47,440 --> 07:01:51,480
Because you can put this in the for clause after the end.
9494
07:01:51,480 --> 07:01:56,480
Now it's going to read, run that loop one time
9495
07:01:56,900 --> 07:01:58,540
for every line of this file.
9496
07:01:59,480 --> 07:02:02,020
And so the lines we're gonna get back are bytes
9497
07:02:02,020 --> 07:02:03,240
and so we have to say decode.
9498
07:02:03,240 --> 07:02:05,160
It doesn't do that for us automatically.
9499
07:02:05,160 --> 07:02:06,720
We are gonna have to decode them
9500
07:02:06,720 --> 07:02:08,780
and that's because we might need to decode them
9501
07:02:08,780 --> 07:02:10,560
with a particular character set here.
9502
07:02:10,560 --> 07:02:11,880
And then we're gonna do our strip
9503
07:02:11,880 --> 07:02:12,920
and we're gonna just print this out.
9504
07:02:12,920 --> 07:02:15,380
So that's just, that's like open a file,
9505
07:02:15,380 --> 07:02:16,440
read through it and print it.
9506
07:02:16,440 --> 07:02:18,880
This is open a URL, read through and print it.
9507
07:02:18,880 --> 07:02:22,040
And that's as simple as it is.
9508
07:02:22,040 --> 07:02:23,080
And so that's what happens.
9509
07:02:23,080 --> 07:02:26,400
This is Romeo.txt and it prints out.
9510
07:02:26,400 --> 07:02:29,440
Now the thing to notice is that there are no headers here.
9511
07:02:29,440 --> 07:02:32,760
The headers have been sort of consumed in the URL open.
9512
07:02:32,760 --> 07:02:36,080
Again, there is a way to say, hey, give me my headers.
9513
07:02:36,080 --> 07:02:38,920
But for now, this is just gonna eat the headers
9514
07:02:38,920 --> 07:02:41,660
and keep them and then you get to read all the data
9515
07:02:41,660 --> 07:02:44,240
and the loop runs and this loop runs four times
9516
07:02:44,240 --> 07:02:45,220
and I'll count the four lines.
9517
07:02:45,220 --> 07:02:47,000
You can go ahead and run this one.
9518
07:02:47,000 --> 07:02:48,680
It's super easy.
9519
07:02:48,680 --> 07:02:51,360
I mean literally super easy.
9520
07:02:51,360 --> 07:02:53,640
And if you, you can do anything you want.
9521
07:02:53,640 --> 07:02:55,040
I mean treat it like a file.
9522
07:02:55,040 --> 07:02:57,400
You just have to remember to do the decode bit
9523
07:02:57,400 --> 07:02:59,160
when you treat it like a file.
9524
07:02:59,160 --> 07:03:01,360
And so we, that code import it.
9525
07:03:01,360 --> 07:03:02,320
We're gonna open it.
9526
07:03:02,320 --> 07:03:04,240
We're going to make a dictionary.
9527
07:03:04,240 --> 07:03:05,560
We're gonna loop through.
9528
07:03:05,560 --> 07:03:06,400
We're gonna split it.
9529
07:03:06,400 --> 07:03:08,260
We have to add the decode just to make sure
9530
07:03:08,260 --> 07:03:10,680
because that line is bytes, not string.
9531
07:03:11,560 --> 07:03:14,140
And then we're gonna go, you know, our words.
9532
07:03:14,140 --> 07:03:16,980
We're gonna go through the line and then each line
9533
07:03:16,980 --> 07:03:18,220
we're gonna bounce through the words.
9534
07:03:18,220 --> 07:03:20,500
The inner for loop is bouncing through the words
9535
07:03:20,500 --> 07:03:22,320
and then we're gonna go to the next line
9536
07:03:22,320 --> 07:03:25,360
and then we make ourselves a dictionary
9537
07:03:25,360 --> 07:03:26,620
and we print that dictionary out.
9538
07:03:26,620 --> 07:03:31,340
Now this is, this in effect, other than, you know,
9539
07:03:31,340 --> 07:03:34,140
importing this, opening it differently and doing the decode,
9540
07:03:34,140 --> 07:03:36,820
this is exactly how we would process a file.
9541
07:03:36,820 --> 07:03:39,980
And so by using URL lib, you really sort of reduce
9542
07:03:39,980 --> 07:03:44,500
the complexity of retrieving and reading network resources
9543
07:03:44,500 --> 07:03:46,420
to the same complexity of reading
9544
07:03:46,420 --> 07:03:49,260
and dealing with a file locally on your hard drive,
9545
07:03:49,260 --> 07:03:51,460
which is kind of pretty.
9546
07:03:51,460 --> 07:03:55,960
So one of the things then we can do is read web pages.
9547
07:03:55,960 --> 07:03:58,900
That was a text file but you can get HTML
9548
07:03:58,900 --> 07:04:01,540
and so here's how you read a web page.
9549
07:04:01,540 --> 07:04:03,380
And it's the same kind of code.
9550
07:04:03,380 --> 07:04:07,300
We open a, we open a URL.
9551
07:04:07,300 --> 07:04:10,060
This one happens to have HTML in it and we read through it
9552
07:04:10,060 --> 07:04:11,300
and out comes the HTML.
9553
07:04:11,300 --> 07:04:13,660
Remember that the headers are there
9554
07:04:13,660 --> 07:04:16,440
but they've been eaten by URL open for us.
9555
07:04:16,440 --> 07:04:18,500
And now we could write a browser that would parse
9556
07:04:18,500 --> 07:04:19,840
these less thans and greater thans
9557
07:04:19,840 --> 07:04:23,180
and make links, et cetera, et cetera, et cetera.
9558
07:04:25,260 --> 07:04:30,120
So if you can come up with ways to find these links,
9559
07:04:30,120 --> 07:04:32,820
you could actually write a bit of code
9560
07:04:32,820 --> 07:04:34,340
that would then have a loop that would go up
9561
07:04:34,340 --> 07:04:35,300
and open a new one.
9562
07:04:35,300 --> 07:04:36,740
Pull out the links, open a new one.
9563
07:04:36,740 --> 07:04:38,700
Pull out the links, open a new one.
9564
07:04:38,700 --> 07:04:39,540
And so you could.
9565
07:04:39,540 --> 07:04:42,020
You could make a thing that would retrieve a program
9566
07:04:42,020 --> 07:04:45,340
that would retrieve a page, find the links in the page
9567
07:04:45,340 --> 07:04:46,780
and then retrieve those links.
9568
07:04:46,780 --> 07:04:49,580
And we'll actually do that before the end of the class.
9569
07:04:50,540 --> 07:04:53,020
And so Python is a very popular language at Google
9570
07:04:53,020 --> 07:04:57,220
and I wonder if I'm gonna, I think it's a pretty safe bet
9571
07:04:57,220 --> 07:04:59,480
that the first crawler that they wrote
9572
07:04:59,480 --> 07:05:02,380
to crawl the web to build the index was Python
9573
07:05:02,380 --> 07:05:06,700
because literally that's all it takes to read web pages.
9574
07:05:06,700 --> 07:05:10,900
And pull those web pages into your web crawler database.
9575
07:05:10,900 --> 07:05:12,020
So I don't know.
9576
07:05:12,020 --> 07:05:14,460
Are those the first four lines ever written to Google?
9577
07:05:14,460 --> 07:05:16,340
Who knows?
9578
07:05:16,340 --> 07:05:18,240
So the next thing that we'll talk about
9579
07:05:18,240 --> 07:05:21,420
is how you handle that HTML.
9580
07:05:21,420 --> 07:05:23,480
HTML is kind of yucky and nasty
9581
07:05:23,480 --> 07:05:26,220
and so it's not as simple as regular expressions.
9582
07:05:26,220 --> 07:05:28,080
Regular expressions might help.
9583
07:05:28,080 --> 07:05:29,980
Strength parsing and split might help
9584
07:05:29,980 --> 07:05:31,520
but it's just too crazy.
9585
07:05:31,520 --> 07:05:34,100
So we'll talk a little bit about how to use a library
9586
07:05:34,100 --> 07:05:36,760
to make HTML parsing a lot easier.
9587
07:05:41,500 --> 07:05:44,580
We are going to be talking about some code.
9588
07:05:44,580 --> 07:05:47,060
If you wanna download all the code, it's right here.
9589
07:05:47,060 --> 07:05:49,220
It's all single big zip file.
9590
07:05:49,220 --> 07:05:51,740
And of all the sample code,
9591
07:05:51,740 --> 07:05:54,740
the one I'm gonna talk about is url.1.py.
9592
07:05:54,740 --> 07:05:57,380
It is not very exciting.
9593
07:05:57,380 --> 07:05:58,980
It's short.
9594
07:05:58,980 --> 07:06:02,960
That's what's kinda nice about Python code.
9595
07:06:02,960 --> 07:06:05,700
And it's really, if we go and take a look at the code
9596
07:06:05,700 --> 07:06:08,700
we played with just previously, which is socket,
9597
07:06:08,700 --> 07:06:11,660
the idea here is url.lib is something that Python
9598
07:06:11,660 --> 07:06:15,160
has produced for us to make socket communications
9599
07:06:15,160 --> 07:06:17,660
and HTTP communications a lot better.
9600
07:06:17,660 --> 07:06:21,440
So socket, this is making socket calls underneath it
9601
07:06:21,440 --> 07:06:24,360
but there's a library that makes this quite simple.
9602
07:06:24,360 --> 07:06:27,020
And so we have to do some imports.
9603
07:06:27,020 --> 07:06:29,300
So instead of importing socket, we'll import these.
9604
07:06:29,300 --> 07:06:30,580
We are going to create a handle.
9605
07:06:30,580 --> 07:06:34,180
The url request url open and just pass in a string.
9606
07:06:34,180 --> 07:06:35,780
So we're not encoding this.
9607
07:06:35,780 --> 07:06:37,140
We're not sending the get command.
9608
07:06:37,140 --> 07:06:39,820
All the stuff we did in the previous sockets example
9609
07:06:39,820 --> 07:06:42,700
is gone and then we can just put this as a for loop.
9610
07:06:42,700 --> 07:06:45,660
And so we're not using this lower level read and write code.
9611
07:06:45,660 --> 07:06:46,820
We're just using a for loop.
9612
07:06:46,820 --> 07:06:51,320
And so that literally is gonna read the text line by line.
9613
07:06:51,320 --> 07:06:53,940
And the line does come back as an array of bytes
9614
07:06:53,940 --> 07:06:56,580
so we have to do a decode but then we got a string
9615
07:06:56,580 --> 07:06:57,880
and then we can do a strip on it.
9616
07:06:57,880 --> 07:07:02,880
So this is like a super simple, super simple.
9617
07:07:06,420 --> 07:07:07,440
So there we go.
9618
07:07:07,440 --> 07:07:10,060
Now the interesting thing is you also don't see the headers.
9619
07:07:10,060 --> 07:07:11,700
We just read the contents.
9620
07:07:11,700 --> 07:07:13,580
Now it turns out in url lib,
9621
07:07:13,580 --> 07:07:16,480
and we'll see this in later more complex application,
9622
07:07:16,480 --> 07:07:18,520
you can get the headers if you want.
9623
07:07:18,520 --> 07:07:20,860
You can get various other things.
9624
07:07:20,860 --> 07:07:27,860
So that's url lib, a simple url lib tool.
9625
07:07:30,380 --> 07:07:33,180
Now we can also use this in url words
9626
07:07:33,180 --> 07:07:35,400
to show you something quite interesting.
9627
07:07:35,400 --> 07:07:38,860
And that is if you look at this from right here
9628
07:07:38,860 --> 07:07:42,260
other than the decode, this is exactly the code we wrote
9629
07:07:42,260 --> 07:07:45,140
to compute the words, right?
9630
07:07:45,140 --> 07:07:47,360
So other than this line.decode,
9631
07:07:47,360 --> 07:07:49,880
this is just a open something up.
9632
07:07:49,880 --> 07:07:51,880
In this case, we're gonna open a url.
9633
07:07:51,880 --> 07:07:52,920
We're gonna create a dictionary.
9634
07:07:52,920 --> 07:07:55,900
We're gonna loop through each of the lines in that thing.
9635
07:07:55,900 --> 07:07:57,560
We're gonna decode them and then split them.
9636
07:07:57,560 --> 07:07:59,100
So once you do line.decode,
9637
07:07:59,100 --> 07:08:02,160
this is now a legitimate internal Python string.
9638
07:08:02,160 --> 07:08:05,640
We split it, we run through the words, and run the counts.
9639
07:08:05,640 --> 07:08:09,320
And so this is exactly like code that we did before
9640
07:08:09,320 --> 07:08:10,680
to run counts.
9641
07:08:10,680 --> 07:08:13,700
And so Python three,
9642
07:08:14,620 --> 07:08:16,400
url words.
9643
07:08:16,400 --> 07:08:21,040
And so that gives us a dictionary
9644
07:08:21,040 --> 07:08:22,640
which is the word frequency.
9645
07:08:22,640 --> 07:08:25,920
And we could do all kinds of crazy stuff in here
9646
07:08:25,920 --> 07:08:28,280
with sorting and all the kinds of things.
9647
07:08:28,280 --> 07:08:31,000
The important thing is once you've done this and this,
9648
07:08:31,000 --> 07:08:33,440
the code doesn't need to decode these lines
9649
07:08:33,440 --> 07:08:34,680
when you first get them.
9650
07:08:35,760 --> 07:08:39,800
It really works just like makes the url lib
9651
07:08:39,800 --> 07:08:44,800
makes url's function inside Python very much like files.
9652
07:08:44,800 --> 07:08:48,320
So these are shortened to the point and very simple
9653
07:08:48,320 --> 07:08:50,320
and I hope that they were useful to you.
9654
07:08:53,640 --> 07:08:55,720
So now we're going to talk about what you would do
9655
07:08:55,720 --> 07:08:57,560
with a web page once you've retrieved it
9656
07:08:57,560 --> 07:08:58,960
in a Python program.
9657
07:08:58,960 --> 07:09:00,500
Call this web scraping.
9658
07:09:01,480 --> 07:09:03,560
And so web scraping or web spidering
9659
07:09:03,560 --> 07:09:05,240
is the act of retrieving a web page,
9660
07:09:05,240 --> 07:09:06,880
extracting the links from those web page,
9661
07:09:06,880 --> 07:09:10,000
making a queue of unretrieved links, and then moving on.
9662
07:09:10,000 --> 07:09:12,360
And eventually the idea is if you had enough time,
9663
07:09:12,360 --> 07:09:14,200
energy, bandwidth, and storage,
9664
07:09:14,200 --> 07:09:16,960
you could find your way to most of the web pages
9665
07:09:16,960 --> 07:09:20,120
on the internet that are pointing to
9666
07:09:20,120 --> 07:09:22,340
or are pointing to by other web pages.
9667
07:09:23,320 --> 07:09:26,400
And so you might have all kinds of reasons to scrape data.
9668
07:09:26,400 --> 07:09:28,840
You might have a blog that you posted.
9669
07:09:28,840 --> 07:09:32,280
You might have, who knows, maybe you put some data
9670
07:09:32,280 --> 07:09:35,840
in a system, maybe the system's being shut down
9671
07:09:35,840 --> 07:09:38,100
because it's being retired.
9672
07:09:38,100 --> 07:09:39,660
You can do all kinds of things.
9673
07:09:39,660 --> 07:09:41,040
You could write a little thing,
9674
07:09:41,040 --> 07:09:43,000
just talking to somebody who wrote a thing
9675
07:09:43,000 --> 07:09:44,720
to retrieve something and check,
9676
07:09:44,720 --> 07:09:46,920
and then send a text when something changed.
9677
07:09:46,920 --> 07:09:47,960
All kinds of stuff.
9678
07:09:47,960 --> 07:09:50,480
Or you might make yourself a search engine.
9679
07:09:50,480 --> 07:09:52,260
But be careful.
9680
07:09:52,260 --> 07:09:55,520
Not all websites are happy about you
9681
07:09:55,520 --> 07:09:58,120
using a robot to retrieve their content.
9682
07:09:58,120 --> 07:09:59,600
Some of the websites, as we'll see,
9683
07:09:59,600 --> 07:10:01,900
demand that you log in and they track what you do,
9684
07:10:01,900 --> 07:10:03,600
and if they think you're doing something bad,
9685
07:10:03,600 --> 07:10:05,340
they will shut your account off.
9686
07:10:05,340 --> 07:10:07,800
Other websites will track what you're doing
9687
07:10:07,800 --> 07:10:11,400
without you logging in, but then shut your address off.
9688
07:10:11,400 --> 07:10:13,240
And so you have to be careful.
9689
07:10:13,240 --> 07:10:14,080
You should read up.
9690
07:10:14,080 --> 07:10:17,240
You should figure out what sites allow you to scrape them.
9691
07:10:17,240 --> 07:10:18,920
Now I have some sites that I've set up
9692
07:10:18,920 --> 07:10:23,720
that you can play with to make it so that it's legit.
9693
07:10:23,720 --> 07:10:27,360
So parsing HTML is difficult.
9694
07:10:27,360 --> 07:10:29,640
Some of the simple examples,
9695
07:10:29,640 --> 07:10:31,560
you could probably write a regular expression,
9696
07:10:31,560 --> 07:10:35,000
or certainly some splitting and some whatever.
9697
07:10:35,000 --> 07:10:37,920
And what you would find is you would write that code.
9698
07:10:37,920 --> 07:10:40,160
And you would retrieve your first five webpages
9699
07:10:40,160 --> 07:10:41,000
and it would seem to work
9700
07:10:41,000 --> 07:10:43,240
and then it would encounter some really weird
9701
07:10:43,240 --> 07:10:45,560
but legitimate HTML,
9702
07:10:45,560 --> 07:10:47,820
or maybe even sort of slightly broken HTML.
9703
07:10:47,820 --> 07:10:50,200
So the web is full of broken HTML,
9704
07:10:50,200 --> 07:10:52,000
and your browsers just look at it and go like,
9705
07:10:52,000 --> 07:10:53,920
oh wow, more broken HTML.
9706
07:10:53,920 --> 07:10:55,280
But they don't put up error messages,
9707
07:10:55,280 --> 07:10:57,440
and so people just leave broken pages up.
9708
07:10:58,360 --> 07:11:01,120
But your Python program is gonna see those broken pages.
9709
07:11:01,120 --> 07:11:02,040
So what you would do is you'd be like,
9710
07:11:02,040 --> 07:11:05,640
oh, here's a new weird way to do an anchor tag.
9711
07:11:05,640 --> 07:11:07,480
I'll change my code.
9712
07:11:07,480 --> 07:11:09,320
And then run for another 100 pages,
9713
07:11:09,320 --> 07:11:12,200
I'm like, oh no, here's a new weird way to do an anchor tag.
9714
07:11:12,200 --> 07:11:14,360
And the problem is is that you're gonna find
9715
07:11:14,360 --> 07:11:17,200
a lot of different ways to mess up an anchor tag.
9716
07:11:17,200 --> 07:11:18,840
And someone's already done that.
9717
07:11:18,840 --> 07:11:21,160
There's a software called BeautifulSoup.
9718
07:11:21,160 --> 07:11:24,040
And we have installation instructions on how to use it.
9719
07:11:24,040 --> 07:11:28,080
And really what it is is it's somebody just spent months
9720
07:11:28,080 --> 07:11:30,480
figuring out all the nasty things that could happen
9721
07:11:30,480 --> 07:11:34,840
and compensated for it and gave you a nice wrapped interface
9722
07:11:34,840 --> 07:11:36,760
that just says, look, you give me the HTML
9723
07:11:36,760 --> 07:11:39,080
and I'll give you back the tags, okay?
9724
07:11:39,080 --> 07:11:40,680
And so it's called BeautifulSoup.
9725
07:11:41,560 --> 07:11:43,280
And so you have to install this.
9726
07:11:43,280 --> 07:11:46,040
There's a couple of ways that you can install this.
9727
07:11:46,040 --> 07:11:47,960
If you're good at extending your Python,
9728
07:11:47,960 --> 07:11:51,680
you can just extend and install BeautifulSoup
9729
07:11:51,680 --> 07:11:53,240
for all Python programs.
9730
07:11:53,240 --> 07:11:57,360
If you can't change your computer's configuration
9731
07:11:57,360 --> 07:12:00,040
because you're on a school computer
9732
07:12:00,040 --> 07:12:02,680
or you're using a USB stick or something,
9733
07:12:02,680 --> 07:12:05,720
then there's a way to download this file that I've created
9734
07:12:05,720 --> 07:12:07,080
called bs4.zip.
9735
07:12:07,080 --> 07:12:08,760
And so what you do is you end up with your file
9736
07:12:08,760 --> 07:12:13,000
called urllinks.py
9737
07:12:13,000 --> 07:12:15,440
and then a little folder called bs4,
9738
07:12:15,440 --> 07:12:17,920
which is a folder that has a bunch of files in it
9739
07:12:17,920 --> 07:12:21,080
from the zip file, and then you can run it.
9740
07:12:21,080 --> 07:12:24,960
And so it'll pull it in and you'll import from bs4,
9741
07:12:24,960 --> 07:12:27,000
BeautifulSoup, and that's either gonna pull it in
9742
07:12:27,000 --> 07:12:30,440
from the folder you do, or if you have installed it
9743
07:12:30,440 --> 07:12:34,280
using the Python installer, it will also just,
9744
07:12:34,280 --> 07:12:36,000
you don't have to put this file in.
9745
07:12:36,000 --> 07:12:36,840
So it's up to you.
9746
07:12:36,840 --> 07:12:38,760
You can either do it one or two ways.
9747
07:12:39,680 --> 07:12:41,600
So this is a little bit of code.
9748
07:12:41,600 --> 07:12:44,120
Now BeautifulSoup is a complex library,
9749
07:12:44,120 --> 07:12:46,720
and so just because this looks easy,
9750
07:12:46,720 --> 07:12:48,080
doing things in BeautifulSoup,
9751
07:12:48,080 --> 07:12:51,320
you might have to actually read a bit more to figure it out.
9752
07:12:51,320 --> 07:12:53,560
But we're going to just read this.
9753
07:12:53,560 --> 07:12:54,400
We're going to
9754
07:12:57,800 --> 07:12:59,000
import BeautifulSoup.
9755
07:12:59,000 --> 07:13:01,440
We're gonna ask for a url right here.
9756
07:13:01,440 --> 07:13:03,000
We're going to take that url.
9757
07:13:03,000 --> 07:13:03,840
We're gonna open it.
9758
07:13:03,840 --> 07:13:07,400
The url open, they give the url and read the whole thing.
9759
07:13:07,400 --> 07:13:08,600
That means we're not writing a loop.
9760
07:13:08,600 --> 07:13:09,640
We've read the whole thing.
9761
07:13:09,640 --> 07:13:13,720
That's okay as long as you know that the file's not so large.
9762
07:13:13,720 --> 07:13:16,920
And then we're going to pass the data we got back.
9763
07:13:16,920 --> 07:13:19,080
And this is gonna be bytes, but BeautifulSoup knows
9764
07:13:19,080 --> 07:13:21,200
all about bytes and all about UTF-8,
9765
07:13:21,200 --> 07:13:22,400
and it figures that out.
9766
07:13:22,400 --> 07:13:25,520
And you just say, hey, take that stuff I just got
9767
07:13:25,520 --> 07:13:27,600
and tear it apart using HTML.
9768
07:13:27,600 --> 07:13:30,760
And give me back an object, a soup object.
9769
07:13:30,760 --> 07:13:32,520
Now the soup object is something
9770
07:13:32,520 --> 07:13:33,920
that you can run queries against.
9771
07:13:33,920 --> 07:13:34,800
So it parses it.
9772
07:13:34,800 --> 07:13:38,020
It deals with all the imperfections and inconsistencies
9773
07:13:38,020 --> 07:13:40,360
in this HTML byte array.
9774
07:13:42,560 --> 07:13:44,440
And it fixes that and gives that back.
9775
07:13:44,440 --> 07:13:45,680
And so there's various things you can do.
9776
07:13:45,680 --> 07:13:47,920
And you gotta go look at the BeautifulSoup documentation.
9777
07:13:47,920 --> 07:13:50,720
It could be a whole class on BeautifulSoup.
9778
07:13:50,720 --> 07:13:52,880
So here's a thing you can do is this object,
9779
07:13:54,200 --> 07:13:56,440
you can sort of call it like a function
9780
07:13:56,440 --> 07:13:59,160
and say, hey, give me back the anchor tags.
9781
07:13:59,160 --> 07:14:00,640
And anchor tags, of course, are the tags.
9782
07:14:00,640 --> 07:14:05,640
Say href equals blah blah blah slash a.
9783
07:14:06,280 --> 07:14:08,720
So all of this is an anchor tag.
9784
07:14:08,720 --> 07:14:10,520
And then we're gonna loop through the tags
9785
07:14:10,520 --> 07:14:11,640
because there could be more than one
9786
07:14:11,640 --> 07:14:13,920
of those anchor tags in the file.
9787
07:14:13,920 --> 07:14:15,720
And then we're going to pull out that href.
9788
07:14:15,720 --> 07:14:16,680
And that's what this does.
9789
07:14:16,680 --> 07:14:19,240
We're gonna loop through all the tags and print out the href.
9790
07:14:19,240 --> 07:14:21,400
So if you tell it to go to drchuck.com,
9791
07:14:21,400 --> 07:14:25,980
it will tell you the one external link in drchuck.com.
9792
07:14:26,880 --> 07:14:29,740
And so I've got an assignment that sort of goes into that
9793
07:14:29,740 --> 07:14:31,560
in some more detail.
9794
07:14:31,560 --> 07:14:35,180
But this chapter has been a whole bunch
9795
07:14:35,180 --> 07:14:36,080
of interesting stuff.
9796
07:14:36,080 --> 07:14:40,120
We started with the TCPIP model and talked about sockets
9797
07:14:40,120 --> 07:14:42,240
that are phone calls between computers.
9798
07:14:42,240 --> 07:14:46,200
And then how applications protocols are developed
9799
07:14:46,200 --> 07:14:48,240
to say what we say on those phone calls.
9800
07:14:48,240 --> 07:14:51,280
And we've explored then the HTTP protocol,
9801
07:14:51,280 --> 07:14:54,600
which is probably the most likely thing you're going to see.
9802
07:14:54,600 --> 07:14:56,720
And then we played with all this in Python
9803
07:14:56,720 --> 07:15:00,520
and saw that Python is really good at this.
9804
07:15:00,520 --> 07:15:03,460
You can write extremely simple and small programs
9805
07:15:03,460 --> 07:15:07,120
to do some extremely complex and powerful things.
9806
07:15:07,120 --> 07:15:09,640
And again, that's why people like Python
9807
07:15:09,640 --> 07:15:13,260
is because it makes the complex simple.
9808
07:15:18,940 --> 07:15:20,720
We're gonna do a little bit of sample code.
9809
07:15:20,720 --> 07:15:22,880
If you're interested in getting the sample code,
9810
07:15:22,880 --> 07:15:24,320
you can download this zip here
9811
07:15:24,320 --> 07:15:28,080
at Pythonforeverybody.com, materials.php.
9812
07:15:28,080 --> 07:15:31,220
And you will download and you will get all the files.
9813
07:15:32,320 --> 07:15:34,720
And all the files that I'm looking at here.
9814
07:15:34,720 --> 07:15:37,160
And so the one I'm gonna play with today
9815
07:15:37,160 --> 07:15:40,240
is the file called URL links.py.
9816
07:15:40,240 --> 07:15:44,520
So the first thing you gotta do before URL links.py works
9817
07:15:44,520 --> 07:15:47,200
is you have got to install beautiful soup.
9818
07:15:47,200 --> 07:15:48,720
And I've got some simple instructions
9819
07:15:48,720 --> 07:15:50,480
at the beginning of the file.
9820
07:15:50,480 --> 07:15:55,000
And so one way to do it is install it using Python
9821
07:15:55,000 --> 07:15:58,220
install process to install this beautiful soup
9822
07:15:58,220 --> 07:16:00,400
for all Python applications.
9823
07:16:00,400 --> 07:16:02,240
And if you are the owner of your computer
9824
07:16:02,240 --> 07:16:03,680
and you're gonna use beautiful soup a lot,
9825
07:16:03,680 --> 07:16:05,600
it's a fine idea to do that.
9826
07:16:05,600 --> 07:16:08,460
But I wanna show you a simpler way
9827
07:16:08,460 --> 07:16:10,560
that if you don't own your own computer
9828
07:16:10,560 --> 07:16:13,780
and you just wanna make it so that beautiful soup works,
9829
07:16:14,720 --> 07:16:19,280
you can download this file, this file right here.
9830
07:16:19,280 --> 07:16:22,700
Beautiful soup, for.zip, unzip it
9831
07:16:22,700 --> 07:16:25,540
and put it in the same folder as here.
9832
07:16:25,540 --> 07:16:27,780
And so if you look in this folder,
9833
07:16:27,780 --> 07:16:30,260
I have a subfolder called bs4.
9834
07:16:30,260 --> 07:16:32,940
And that's the unzipped version of this.
9835
07:16:32,940 --> 07:16:33,760
And it has these things.
9836
07:16:33,760 --> 07:16:36,280
I didn't write this code, so I'm sorry if the name is bad,
9837
07:16:36,280 --> 07:16:38,580
but this is the code to bs4.
9838
07:16:38,580 --> 07:16:40,980
And this is what's in bs4.zip.
9839
07:16:40,980 --> 07:16:43,940
And it's in the same folder as
9840
07:16:45,860 --> 07:16:48,100
URL links.py.
9841
07:16:48,100 --> 07:16:49,960
And so what happens is when you do this
9842
07:16:49,960 --> 07:16:52,340
from bs4 import beautiful soup,
9843
07:16:52,340 --> 07:16:55,180
that either can go to sort of this global magic place
9844
07:16:55,180 --> 07:16:56,820
that Python installs stuff
9845
07:16:56,820 --> 07:16:59,220
and pulls in the beautiful soup object,
9846
07:16:59,220 --> 07:17:03,820
or it can go to the folder bs4 and pull it in, okay?
9847
07:17:03,820 --> 07:17:05,960
And so that's how that works.
9848
07:17:05,960 --> 07:17:08,480
So you have to do one of these two things.
9849
07:17:08,480 --> 07:17:10,380
I prefer to keep it simple,
9850
07:17:10,380 --> 07:17:11,620
download and unzip this file
9851
07:17:11,620 --> 07:17:15,220
and put it in the same folder as this code
9852
07:17:15,220 --> 07:17:16,160
and away you go.
9853
07:17:16,160 --> 07:17:18,560
So from the previous example,
9854
07:17:18,560 --> 07:17:20,600
we're gonna use URL lib, of course,
9855
07:17:20,600 --> 07:17:22,540
and then we're going to pull in the beautiful soup.
9856
07:17:22,540 --> 07:17:24,400
From the beautiful soup for our library,
9857
07:17:24,400 --> 07:17:25,880
we're gonna get the beautiful soup object.
9858
07:17:25,880 --> 07:17:28,040
Now, if you do this with SSL,
9859
07:17:28,040 --> 07:17:30,600
if these websites we're gonna play with have SSL,
9860
07:17:30,600 --> 07:17:33,840
you pretty much have to do this little hack.
9861
07:17:33,840 --> 07:17:36,680
And these three lines, don't worry too much about it.
9862
07:17:36,680 --> 07:17:39,640
The whole idea, you can do Google on Stack Overflow
9863
07:17:39,640 --> 07:17:40,960
and figure this out.
9864
07:17:40,960 --> 07:17:42,880
But this is the way that you ignore errors
9865
07:17:42,880 --> 07:17:46,340
when you have SSL certificate errors.
9866
07:17:46,340 --> 07:17:50,240
And so we have to add this parameter context equals ctx,
9867
07:17:50,240 --> 07:17:52,000
which is this variable that we create.
9868
07:17:52,000 --> 07:17:56,040
So this part and this part sort of just do them.
9869
07:17:56,040 --> 07:17:59,040
If you don't, you can take them out, actually.
9870
07:17:59,040 --> 07:18:01,120
Otherwise, you won't be able to do HTTPS sites.
9871
07:18:01,120 --> 07:18:03,240
So let's take a look at what we're doing
9872
07:18:03,240 --> 07:18:06,860
other than dealing with the HTTPS problem.
9873
07:18:08,760 --> 07:18:10,680
Gonna ask the user for a URL.
9874
07:18:10,680 --> 07:18:14,800
We are going to retrieve all the HTML.
9875
07:18:14,800 --> 07:18:17,960
We're gonna do a URL open, just like we did before.
9876
07:18:17,960 --> 07:18:19,480
Now, this would return us something
9877
07:18:19,480 --> 07:18:21,800
we could loop through line by line with a for loop.
9878
07:18:21,800 --> 07:18:24,120
But instead, we're gonna say, hey, read the whole thing.
9879
07:18:24,120 --> 07:18:28,560
And that basically returns us the entire document
9880
07:18:28,560 --> 07:18:32,320
at that webpage in a single big string
9881
07:18:32,320 --> 07:18:34,380
with new lines at the end of each line.
9882
07:18:34,380 --> 07:18:38,760
And this is not an Unicode, but it's probably UTF-8 string.
9883
07:18:38,760 --> 07:18:41,920
But it turns out BeautifulSoup knows how to deal with UTF-8,
9884
07:18:41,920 --> 07:18:44,240
and it also knows how to deal with Unicode strings.
9885
07:18:44,240 --> 07:18:47,080
So what we're saying is BeautifulSoup read through
9886
07:18:47,080 --> 07:18:49,980
and deal with all the nasty bits, right?
9887
07:18:49,980 --> 07:18:54,240
So HTML is like very, very flexible.
9888
07:18:54,240 --> 07:18:59,240
So drchuck.com slash page one, HTML.
9889
07:19:02,760 --> 07:19:04,840
And so if we take a look at the source of this,
9890
07:19:04,840 --> 07:19:09,160
view page source, make this bigger,
9891
07:19:09,160 --> 07:19:10,960
you might be able to do regular expressions,
9892
07:19:10,960 --> 07:19:13,680
but it does things like break stuff across lines.
9893
07:19:13,680 --> 07:19:15,280
There could be a line break here.
9894
07:19:15,280 --> 07:19:17,240
There could be all kinds of things, right?
9895
07:19:17,240 --> 07:19:21,400
And so writing regular expressions or splits or whatever
9896
07:19:21,400 --> 07:19:23,720
is really hard for HTML.
9897
07:19:23,720 --> 07:19:26,280
And so what we do is someone has written this.
9898
07:19:26,280 --> 07:19:27,680
It's called BeautifulSoup.
9899
07:19:30,800 --> 07:19:34,720
And it's basically, this is the code,
9900
07:19:34,720 --> 07:19:38,100
and it's based on a joke from a children's story.
9901
07:19:40,440 --> 07:19:42,240
It basically, someone has just went through
9902
07:19:42,240 --> 07:19:44,460
and figured all the bad things that could possibly happen
9903
07:19:44,460 --> 07:19:46,960
when you're reading and parsing HTML.
9904
07:19:46,960 --> 07:19:49,200
So either you use it or you will slowly but surely
9905
07:19:49,200 --> 07:19:52,700
derive all the things that it doesn't work.
9906
07:19:52,700 --> 07:19:56,240
And so when we look at this line right here,
9907
07:19:56,240 --> 07:19:58,760
this line at a high level is saying,
9908
07:19:58,760 --> 07:20:00,860
we're giving you ugly, nasty HTML
9909
07:20:00,860 --> 07:20:03,180
that could make no sense whatsoever.
9910
07:20:03,180 --> 07:20:06,400
Please read it and have all the brains that you have
9911
07:20:06,400 --> 07:20:08,820
and all the weird stuff figure that out for us
9912
07:20:08,820 --> 07:20:10,860
and give us back an object.
9913
07:20:10,860 --> 07:20:11,800
I happen to call it soup.
9914
07:20:11,800 --> 07:20:13,240
You don't have to call it soup.
9915
07:20:13,240 --> 07:20:15,920
An object, and that is a proxy for that HTML,
9916
07:20:15,920 --> 07:20:18,560
but this soup object is clean.
9917
07:20:18,560 --> 07:20:20,840
And so what we can do is we can sort of retrieve
9918
07:20:20,840 --> 07:20:22,200
all the anchor tags.
9919
07:20:22,200 --> 07:20:25,200
So we can talk to this object and say, ask it,
9920
07:20:25,200 --> 07:20:26,780
give me the anchor tags.
9921
07:20:26,780 --> 07:20:28,060
What's an anchor tag?
9922
07:20:28,060 --> 07:20:29,640
Well, if we take a look at this source,
9923
07:20:29,640 --> 07:20:32,560
the anchor tag is the A through the slash A.
9924
07:20:32,560 --> 07:20:33,560
That is the tag.
9925
07:20:33,560 --> 07:20:36,880
It is the tag, it is attributes that are on the tag,
9926
07:20:36,880 --> 07:20:39,460
it is the text within the tag, and everything.
9927
07:20:39,460 --> 07:20:40,920
And so that's what we're gonna get.
9928
07:20:40,920 --> 07:20:43,280
Now, I call it tags plural,
9929
07:20:43,280 --> 07:20:45,360
not because plural matters at all,
9930
07:20:45,360 --> 07:20:47,000
but because we're gonna get a list of tags.
9931
07:20:47,000 --> 07:20:51,280
Because even though this web page has lots and lots of tags,
9932
07:20:51,280 --> 07:20:53,760
if we look at, say, drchuck.com,
9933
07:20:58,720 --> 07:21:01,980
and view source, whoa, that's kinda small.
9934
07:21:01,980 --> 07:21:05,480
View page source, right?
9935
07:21:05,480 --> 07:21:09,540
And we go look for a anchor tags.
9936
07:21:09,540 --> 07:21:11,200
We got 45 of them,
9937
07:21:11,200 --> 07:21:13,720
and they all kinda have weird stuff in them, right?
9938
07:21:13,720 --> 07:21:17,440
So this line will give us back a list of tags.
9939
07:21:17,440 --> 07:21:20,240
It will give us all the tags in this document.
9940
07:21:20,240 --> 07:21:22,880
So it goes, the tag goes from there to there.
9941
07:21:23,920 --> 07:21:25,280
And then what we're gonna do is we're gonna write a loop
9942
07:21:25,280 --> 07:21:26,360
to loop through all the tags.
9943
07:21:26,360 --> 07:21:28,160
So that's basically hopping,
9944
07:21:28,160 --> 07:21:29,800
like it's hopping through the document,
9945
07:21:29,800 --> 07:21:31,880
sort of like this, that's what it's doing.
9946
07:21:31,880 --> 07:21:35,560
Hop, hop, hop, hop, hop, hop.
9947
07:21:35,560 --> 07:21:38,960
And it's pulling out the text of the href attributes.
9948
07:21:38,960 --> 07:21:42,120
So it's gonna talk, pull out this bit right here.
9949
07:21:42,120 --> 07:21:44,800
Oh, whoops, oh darn, that was so cool.
9950
07:21:44,800 --> 07:21:46,680
Cause that's a flaw, look at that.
9951
07:21:46,680 --> 07:21:48,040
This is my own page.
9952
07:21:48,040 --> 07:21:49,920
There is no closing quote here,
9953
07:21:49,920 --> 07:21:52,520
but it's gonna work because HTML soup is like,
9954
07:21:52,520 --> 07:21:54,100
oh, I know what to do about that.
9955
07:21:54,100 --> 07:21:55,420
I can deal with that.
9956
07:21:55,420 --> 07:21:56,800
So let's check to see if that one works,
9957
07:21:56,800 --> 07:21:58,220
cause that's like a mistake.
9958
07:21:58,220 --> 07:22:00,600
But that's one of the things we like about beautiful soup.
9959
07:22:00,600 --> 07:22:01,520
So we're gonna read through,
9960
07:22:01,520 --> 07:22:03,240
and then we're gonna pull out all the hrefs.
9961
07:22:03,240 --> 07:22:08,240
So, this is probably thousands of lines of code
9962
07:22:08,520 --> 07:22:10,400
that you really don't want to run.
9963
07:22:10,400 --> 07:22:15,400
So python3urllinks.py.
9964
07:22:15,760 --> 07:22:17,280
And so let's start with a simple one.
9965
07:22:17,280 --> 07:22:22,280
HTTP colon slash slash www.drchuck.com.
9966
07:22:24,840 --> 07:22:26,040
And it reads it.
9967
07:22:26,040 --> 07:22:27,680
Oh, that's the, no, that's,
9968
07:22:28,680 --> 07:22:30,040
that's actually the card one,
9969
07:22:30,040 --> 07:22:30,880
cause we got a whole bunch.
9970
07:22:30,880 --> 07:22:33,520
So let's see if sugi, see the sugi one worked.
9971
07:22:33,520 --> 07:22:34,820
It found that one.
9972
07:22:36,360 --> 07:22:38,200
It's right after socaiproject.org.
9973
07:22:38,200 --> 07:22:39,040
Where is that?
9974
07:22:39,040 --> 07:22:39,880
Is there another sugi?
9975
07:22:41,560 --> 07:22:42,840
Oh, no, it didn't find that one.
9976
07:22:42,840 --> 07:22:43,680
That's kind of funky.
9977
07:22:43,680 --> 07:22:46,100
Look, it found it wrong, but that's okay.
9978
07:22:46,100 --> 07:22:47,980
So you see it found all these
9979
07:22:47,980 --> 07:22:50,440
and did a lot of nice stuff for us.
9980
07:22:50,440 --> 07:22:53,440
If we do it, python3urllinks.py
9981
07:22:53,440 --> 07:22:54,280
and do the easy one.
9982
07:22:54,280 --> 07:22:57,040
It used to be colon slash slash www.
9983
07:22:57,040 --> 07:23:02,040
dr-chuck.com page one.htm.
9984
07:23:03,400 --> 07:23:04,880
We will only see one.
9985
07:23:06,440 --> 07:23:07,480
And there we go.
9986
07:23:07,480 --> 07:23:12,480
Now, the SSL is if you are looking at a page
9987
07:23:12,480 --> 07:23:15,820
that has SSL, python,
9988
07:23:15,820 --> 07:23:19,480
and then you can see that there's a lot of code
9989
07:23:19,480 --> 07:23:24,480
at a page that has SSL, python, URL links too.
9990
07:23:24,680 --> 07:23:28,680
So I'll go to like https colon wwwsi.umich.edu
9991
07:23:34,640 --> 07:23:35,960
and that will get a bunch of links.
9992
07:23:35,960 --> 07:23:37,640
And so you'll see.
9993
07:23:37,640 --> 07:23:41,400
If it wasn't for that, so all kinds of stuff coming back.
9994
07:23:41,400 --> 07:23:43,520
And if it wasn't for this bit right here
9995
07:23:43,520 --> 07:23:46,600
and this bit right here, this HTTPS wouldn't have worked.
9996
07:23:46,600 --> 07:23:49,400
And it's not that that website had a bad URL.
9997
07:23:49,400 --> 07:23:54,280
It has a certificate that's not in python's official list.
9998
07:23:55,120 --> 07:23:56,860
And so the URL is okay.
9999
07:23:57,720 --> 07:24:02,160
So that gives you a quick summary
10000
07:24:02,160 --> 07:24:06,520
of using the beautiful soup library in python
10001
07:24:06,520 --> 07:24:08,480
along with the URL lib.
10002
07:24:12,520 --> 07:24:15,400
Hello and welcome to chapter 13, web services.
10003
07:24:15,400 --> 07:24:17,320
So what we've been doing so far
10004
07:24:17,320 --> 07:24:19,440
is we've been using the request response cycle.
10005
07:24:19,440 --> 07:24:20,760
We've learned about sockets.
10006
07:24:20,760 --> 07:24:22,560
We've learned about URL lib.
10007
07:24:22,560 --> 07:24:24,520
And we've actually learned how to pull HTML
10008
07:24:24,520 --> 07:24:27,160
and even flat text off the internet.
10009
07:24:27,160 --> 07:24:28,900
But what we're gonna talk about now
10010
07:24:28,900 --> 07:24:31,000
is using that same request response cycle
10011
07:24:31,000 --> 07:24:35,240
to retrieve information that is specifically designed
10012
07:24:35,240 --> 07:24:36,800
for programmatic consumption.
10013
07:24:36,800 --> 07:24:39,600
So that we had to have this beautiful soup
10014
07:24:39,600 --> 07:24:42,240
which sort of did a hack job
10015
07:24:42,240 --> 07:24:45,200
or solved the hard problem of parsing HTML.
10016
07:24:45,200 --> 07:24:47,980
Well, why not produce data in a format
10017
07:24:47,980 --> 07:24:49,680
that makes good sense to a program
10018
07:24:49,680 --> 07:24:51,740
because programs wanna talk to each other.
10019
07:24:51,740 --> 07:24:54,240
If you recall, the whole idea of a socket
10020
07:24:54,240 --> 07:24:57,320
is to have one application process sending data
10021
07:24:57,320 --> 07:24:59,440
to another application process.
10022
07:24:59,440 --> 07:25:02,600
And so if we think about this for a moment
10023
07:25:02,600 --> 07:25:05,600
and we realize that we have all these programs,
10024
07:25:05,600 --> 07:25:07,580
they could be written in different programming languages
10025
07:25:07,580 --> 07:25:08,720
and they're all connected.
10026
07:25:08,720 --> 07:25:11,360
And so they might wanna send data back and forth
10027
07:25:11,360 --> 07:25:13,200
or through the network.
10028
07:25:13,200 --> 07:25:16,520
PHP programs, JavaScript programs, Java programs.
10029
07:25:16,520 --> 07:25:20,200
And so we have to decide on a protocol
10030
07:25:20,200 --> 07:25:22,320
that is independent of any programming language.
10031
07:25:22,320 --> 07:25:24,320
And then we call that the wire protocol
10032
07:25:24,320 --> 07:25:26,420
because if you were to sort of take some connection
10033
07:25:26,420 --> 07:25:30,060
and watch the exact characters that go back and forth,
10034
07:25:30,060 --> 07:25:32,880
that's what you would see if you were monitoring the wire.
10035
07:25:32,880 --> 07:25:35,540
So that's why we call that the wire protocol.
10036
07:25:35,540 --> 07:25:40,440
And so the idea is is that we have to agree on a format
10037
07:25:40,440 --> 07:25:41,800
that is going to represent the data
10038
07:25:41,800 --> 07:25:44,080
and we can't make it a Python specific format
10039
07:25:44,080 --> 07:25:45,640
or a Java format.
10040
07:25:45,640 --> 07:25:49,580
And when we take the data from the internal representation,
10041
07:25:49,580 --> 07:25:52,120
maybe a Python dictionary, to send it to the wire,
10042
07:25:52,120 --> 07:25:53,960
we call that act serialization.
10043
07:25:53,960 --> 07:25:56,600
And that is going from sort of the internal representation
10044
07:25:56,600 --> 07:25:59,600
to the serial representation or the wire representation.
10045
07:25:59,600 --> 07:26:02,520
And then here is an example of a person
10046
07:26:02,520 --> 07:26:03,960
with a name and phone number
10047
07:26:03,960 --> 07:26:05,760
with using less thans and greater thans.
10048
07:26:05,760 --> 07:26:07,360
This is an XML example.
10049
07:26:07,360 --> 07:26:08,480
And then in the far end,
10050
07:26:08,480 --> 07:26:10,480
in a different programming language, it receives this
10051
07:26:10,480 --> 07:26:12,920
and then deserializes it and then turns it
10052
07:26:12,920 --> 07:26:16,560
into some useful structure inside that programming language.
10053
07:26:16,560 --> 07:26:19,800
And so this is an example of a wire protocol
10054
07:26:19,800 --> 07:26:21,160
that's using XML.
10055
07:26:21,160 --> 07:26:23,840
And that's one of the formats we're going to talk about.
10056
07:26:23,840 --> 07:26:25,640
Another format that we're gonna talk about
10057
07:26:25,640 --> 07:26:29,500
is a format called JSON, JavaScript Object Notation.
10058
07:26:29,500 --> 07:26:31,960
And it is simpler and easier,
10059
07:26:31,960 --> 07:26:36,320
but it's not as precise and descriptive as XML is.
10060
07:26:36,320 --> 07:26:38,880
And so while you'll find that most of the things you run
10061
07:26:38,880 --> 07:26:40,760
into, especially if you're talking to APIs
10062
07:26:40,760 --> 07:26:42,120
of one form or another,
10063
07:26:42,120 --> 07:26:44,840
you'll find that JSON is very common.
10064
07:26:44,840 --> 07:26:47,880
XML still holds sway in places like documents.
10065
07:26:47,880 --> 07:26:49,960
So if you look at docx
10066
07:26:49,960 --> 07:26:52,360
at the end of a Microsoft Word document,
10067
07:26:52,360 --> 07:26:54,840
docx means that it's an XML version
10068
07:26:54,840 --> 07:26:58,520
of the representation of a word processing document.
10069
07:26:58,520 --> 07:27:01,060
So the first thing we'll talk about is XML.
10070
07:27:04,780 --> 07:27:07,920
So one of the two ways that we mark up data is XML.
10071
07:27:07,920 --> 07:27:10,160
The other of JSON, first we'll talk about XML.
10072
07:27:10,160 --> 07:27:12,840
We'll talk about XML more for a longer time
10073
07:27:12,840 --> 07:27:14,480
than we talk about JSON.
10074
07:27:14,480 --> 07:27:17,640
XML stands for Extensible Markup Language.
10075
07:27:18,760 --> 07:27:21,480
There was a number of markup languages in the 90s
10076
07:27:21,480 --> 07:27:25,040
that were out there, ways to send data between computers.
10077
07:27:25,040 --> 07:27:28,720
And none of them was like amazingly better than the other,
10078
07:27:28,720 --> 07:27:33,040
but in the late early 1990s, as HTML came out,
10079
07:27:33,040 --> 07:27:36,000
the idea that we could use less thans and greater thans,
10080
07:27:36,000 --> 07:27:38,860
you know, or angle brackets, some people call them.
10081
07:27:41,160 --> 07:27:43,040
Once HTML made angle brackets popular
10082
07:27:43,040 --> 07:27:45,200
as a representation format,
10083
07:27:45,200 --> 07:27:46,920
it was pretty natural that we would find
10084
07:27:46,920 --> 07:27:48,560
a data representation format
10085
07:27:48,560 --> 07:27:50,580
that would take a similar approach.
10086
07:27:50,580 --> 07:27:54,520
And so inside XML, we're gonna talk about tags,
10087
07:27:54,520 --> 07:27:55,600
we're gonna talk about attributes,
10088
07:27:55,600 --> 07:27:56,520
we're gonna talk about data,
10089
07:27:56,520 --> 07:27:58,800
and we've already talked about serialization
10090
07:27:58,800 --> 07:28:00,200
and deserialization.
10091
07:28:00,200 --> 07:28:01,920
Serialization is the act of taking data
10092
07:28:01,920 --> 07:28:04,540
inside of a computer in one programming language,
10093
07:28:04,540 --> 07:28:07,040
setting it up for transport, transporting it across,
10094
07:28:07,040 --> 07:28:09,120
and then taking it back apart
10095
07:28:09,120 --> 07:28:11,440
and turning it back into the data in,
10096
07:28:11,440 --> 07:28:13,880
whatever internal data it needs to be
10097
07:28:13,880 --> 07:28:16,020
in the destination system.
10098
07:28:16,020 --> 07:28:17,600
So here's some basic XML,
10099
07:28:17,600 --> 07:28:19,120
so we can take a look at the various things
10100
07:28:19,120 --> 07:28:20,240
that make up the XML.
10101
07:28:20,240 --> 07:28:23,860
So it's very much like HTML in that we have tags,
10102
07:28:23,860 --> 07:28:24,700
less than, greater than.
10103
07:28:24,700 --> 07:28:26,280
The difference is we get to name the tags
10104
07:28:26,280 --> 07:28:28,920
anything we want rather than the A tag
10105
07:28:28,920 --> 07:28:30,880
or the P tag or the H1 tag.
10106
07:28:30,880 --> 07:28:32,820
And there is a beginning tag and an ending tag,
10107
07:28:32,820 --> 07:28:33,820
and they're bracketed together.
10108
07:28:33,820 --> 07:28:36,120
And there's syntax errors in XML.
10109
07:28:36,120 --> 07:28:38,360
Syntax errors in XML are more severe
10110
07:28:38,360 --> 07:28:40,800
than syntax errors in HTML.
10111
07:28:40,800 --> 07:28:41,960
It's supposed to be right.
10112
07:28:41,960 --> 07:28:44,280
And if you send that XML,
10113
07:28:44,280 --> 07:28:46,980
it's likely that the far end will not understand it.
10114
07:28:47,880 --> 07:28:50,040
So we have a beginning tag and ending tag,
10115
07:28:50,040 --> 07:28:51,580
and so like name and slash name
10116
07:28:51,580 --> 07:28:53,120
are a beginning and ending pair.
10117
07:28:53,120 --> 07:28:55,440
Then there is the actual textual content,
10118
07:28:55,440 --> 07:28:58,040
and that is the material between it.
10119
07:28:58,040 --> 07:28:59,520
And then here's a phone and slash phone,
10120
07:28:59,520 --> 07:29:01,440
and we have this thing called the attribute.
10121
07:29:01,440 --> 07:29:03,320
Key equals value.
10122
07:29:03,320 --> 07:29:04,640
The key doesn't have double quotes.
10123
07:29:04,640 --> 07:29:06,140
The value always has double quotes.
10124
07:29:06,140 --> 07:29:10,980
And this is like href equals on an anchor tag.
10125
07:29:10,980 --> 07:29:13,540
And sometimes you have what's called a self-closing tag
10126
07:29:13,540 --> 07:29:15,360
where you don't actually have a closing tag.
10127
07:29:15,360 --> 07:29:18,640
You have all the data that you need in the attributes,
10128
07:29:18,640 --> 07:29:20,200
and so you don't even bother putting
10129
07:29:20,200 --> 07:29:22,720
an empty text area in in a closing tag.
10130
07:29:22,720 --> 07:29:26,120
So that is a start tag, an end tag, attribute,
10131
07:29:26,120 --> 07:29:27,520
and then a self-closing tag.
10132
07:29:27,520 --> 07:29:31,080
Those are some basics of XML.
10133
07:29:31,080 --> 07:29:35,240
In general, XML doesn't care too much about white space.
10134
07:29:35,240 --> 07:29:39,000
It does in the text areas, so in here it matters,
10135
07:29:39,000 --> 07:29:40,680
and in here it matters, but things like
10136
07:29:40,680 --> 07:29:42,360
we can indent this a little bit differently,
10137
07:29:42,360 --> 07:29:44,200
and we tend to indent it in a way
10138
07:29:44,200 --> 07:29:45,760
to make it look reasonable.
10139
07:29:45,760 --> 07:29:47,840
Although once you have programs sending it back and forth,
10140
07:29:47,840 --> 07:29:50,600
they tend to send it more compacted
10141
07:29:50,600 --> 07:29:52,440
just for efficiency purposes.
10142
07:29:53,920 --> 07:29:57,640
So one of the concepts is that there is
10143
07:29:57,640 --> 07:30:00,900
a hierarchical structure within an XML document,
10144
07:30:00,900 --> 07:30:03,200
and there are parent nodes and child nodes,
10145
07:30:03,200 --> 07:30:05,720
and you can think of these as simple nodes
10146
07:30:05,720 --> 07:30:10,000
that is a tag in some data, or a complex element
10147
07:30:10,000 --> 07:30:14,240
that has a tag that includes other tags, some child tags.
10148
07:30:14,240 --> 07:30:15,520
And there's a couple of different ways
10149
07:30:15,520 --> 07:30:17,120
we can take a look at this.
10150
07:30:18,240 --> 07:30:20,860
The simple and more natural way to think about this
10151
07:30:20,860 --> 07:30:23,640
is a tree with parent-child relationships.
10152
07:30:23,640 --> 07:30:25,920
So here we have this A tag on the outside,
10153
07:30:25,920 --> 07:30:27,720
and that's the top level one.
10154
07:30:27,720 --> 07:30:29,820
You can only have one outer tag,
10155
07:30:29,820 --> 07:30:32,880
and you can only, you can't have another tag down here,
10156
07:30:32,880 --> 07:30:35,520
so you have to have one tag that's sort of the root tag
10157
07:30:35,520 --> 07:30:37,840
for everything in this XML document,
10158
07:30:37,840 --> 07:30:41,400
and it has two children, so the C tag and the B tag
10159
07:30:41,400 --> 07:30:45,000
are two children, so the B tag is a child of A,
10160
07:30:45,000 --> 07:30:48,840
and then C has a D and an E tag that are children there,
10161
07:30:48,840 --> 07:30:53,840
and then the textual data we model as a child
10162
07:30:54,220 --> 07:30:56,880
of each of those tags, and you'll see in a bit
10163
07:30:56,880 --> 07:30:58,740
why it's best to do that.
10164
07:30:58,740 --> 07:31:02,520
So that is the way to think about this as a tree,
10165
07:31:02,520 --> 07:31:05,720
to represent that XML as a tree.
10166
07:31:05,720 --> 07:31:08,200
If we add attributes to it, and this is where you kind of
10167
07:31:08,200 --> 07:31:10,800
see why it's nice to take the text area
10168
07:31:10,800 --> 07:31:12,580
and make that be a child of the node,
10169
07:31:12,580 --> 07:31:14,360
an attribute is a different.
10170
07:31:14,360 --> 07:31:17,120
So the text is a special kind of child,
10171
07:31:17,120 --> 07:31:18,960
and you can literally have more than one attribute.
10172
07:31:18,960 --> 07:31:23,600
You could have X equals two, you know, zap equals whatever,
10173
07:31:23,600 --> 07:31:26,480
and these could have a couple of different attributes.
10174
07:31:26,480 --> 07:31:29,200
The W attribute is a value of five,
10175
07:31:29,200 --> 07:31:30,720
and that's the five down there,
10176
07:31:30,720 --> 07:31:32,120
and so you could have multiple ones.
10177
07:31:32,120 --> 07:31:34,200
You can only have one text node.
10178
07:31:34,200 --> 07:31:37,000
Now, in the case of A, you have a whole bunch of text nodes,
10179
07:31:37,000 --> 07:31:39,720
but these are because there are child nodes.
10180
07:31:39,720 --> 07:31:43,800
Within one simple node, you can only have one text element.
10181
07:31:43,800 --> 07:31:46,140
You can also think of XML as paths,
10182
07:31:46,140 --> 07:31:48,560
and the easiest way is to sort of look down
10183
07:31:48,560 --> 07:31:52,000
this tree version and look at from the path from the parent.
10184
07:31:52,000 --> 07:31:54,840
So you go to A, then the child B, and then X.
10185
07:31:54,840 --> 07:31:58,040
So at position AB, you find X.
10186
07:31:58,040 --> 07:32:01,040
So AB is the path up to the root,
10187
07:32:01,040 --> 07:32:04,780
so ACD, that's this one, is the path to Y,
10188
07:32:05,760 --> 07:32:08,660
and ACE is the path to Z,
10189
07:32:08,660 --> 07:32:10,840
and so you can think of these as paths.
10190
07:32:10,840 --> 07:32:12,840
Part of what we're doing is we're coming up with ways
10191
07:32:12,840 --> 07:32:16,840
to walk through and parse trees of XML data.
10192
07:32:18,000 --> 07:32:20,880
So the next thing we'll talk about is how we determine
10193
07:32:20,880 --> 07:32:24,620
if a particular XML document is legal
10194
07:32:24,620 --> 07:32:28,780
or meets the contracts that two applications have set up.
10195
07:32:34,160 --> 07:32:36,060
We're going to do a little bit of code.
10196
07:32:36,060 --> 07:32:38,120
If you want to get your hands on the code,
10197
07:32:38,120 --> 07:32:41,620
go to the materials website, materials.php,
10198
07:32:43,660 --> 07:32:48,460
actually materials.php, and download the sample code.
10199
07:32:48,460 --> 07:32:51,880
The code that we're going to work on today is the XML code,
10200
07:32:51,880 --> 07:32:56,720
and we need to be able to talk XML to work with web services.
10201
07:32:56,720 --> 07:33:01,560
So here's one of the examples from the book, it's XML1.py.
10202
07:33:01,560 --> 07:33:05,480
And so later we'll be pulling XML and JSON from the web,
10203
07:33:05,480 --> 07:33:06,760
but for now we're just going to put it
10204
07:33:06,760 --> 07:33:10,420
in a triple-coded string, so data,
10205
07:33:10,420 --> 07:33:13,440
and we're going to use a built-in XML parser
10206
07:33:13,440 --> 07:33:15,840
in Python called element tree,
10207
07:33:15,840 --> 07:33:19,320
and when we say import XML E-tree element tree,
10208
07:33:19,320 --> 07:33:23,480
this as ET gives us basically a shortcut handle for it.
10209
07:33:25,620 --> 07:33:27,560
And so the idea, this is a string,
10210
07:33:27,560 --> 07:33:28,880
it has less thans and greater thans,
10211
07:33:28,880 --> 07:33:31,400
it looks like structured information, and it is,
10212
07:33:31,400 --> 07:33:33,760
but really at this point it's only a string.
10213
07:33:33,760 --> 07:33:35,800
Now we have to call this ET from string
10214
07:33:35,800 --> 07:33:38,960
to read this and give us back a tree object.
10215
07:33:38,960 --> 07:33:40,660
And what it does is this might blow up,
10216
07:33:40,660 --> 07:33:42,520
this code might blow up right here
10217
07:33:42,520 --> 07:33:45,620
if there was a mistake in it.
10218
07:33:45,620 --> 07:33:47,540
Matter of fact, I can probably put a mistake in,
10219
07:33:47,540 --> 07:33:50,200
let's see if I can delete this and save it
10220
07:33:50,200 --> 07:33:54,360
and run this code, and we'll see that it will blow up.
10221
07:33:59,800 --> 07:34:03,040
Right, and so it blew up, here in line eight,
10222
07:34:03,040 --> 07:34:06,040
element tree blew up, I mean it blew up
10223
07:34:07,920 --> 07:34:10,720
in line 12 of the code, which is right here.
10224
07:34:10,720 --> 07:34:15,160
This failed because the line eight of the XML string
10225
07:34:15,160 --> 07:34:17,840
was wrong, so let's put that back in.
10226
07:34:17,840 --> 07:34:20,120
So now it's properly formed XML.
10227
07:34:20,120 --> 07:34:22,600
So this tree we get back, I name it tree
10228
07:34:22,600 --> 07:34:24,480
just because I always name it tree,
10229
07:34:24,480 --> 07:34:26,200
but you could name it X.
10230
07:34:26,200 --> 07:34:30,480
So the key is tree.find goes and looks for a tag name find,
10231
07:34:30,480 --> 07:34:33,900
and tree has no longer got less thans and greater thans in it,
10232
07:34:33,900 --> 07:34:36,940
it is went and turned these into objects
10233
07:34:36,940 --> 07:34:39,800
within objects within objects.
10234
07:34:39,800 --> 07:34:44,560
So tree find name says I would like to find the tag name,
10235
07:34:44,560 --> 07:34:46,880
and that's what this bit is right here,
10236
07:34:46,880 --> 07:34:49,960
and then.tx.txt is going within that
10237
07:34:49,960 --> 07:34:52,400
and grabbing that text, okay?
10238
07:34:52,400 --> 07:34:55,320
And if we say tree find dot email,
10239
07:34:55,320 --> 07:34:57,780
then that's going to give us this,
10240
07:34:57,780 --> 07:35:01,200
and then that's that object, and then.get
10241
07:35:01,200 --> 07:35:04,400
asks for the contents of the hide attribute,
10242
07:35:04,400 --> 07:35:07,200
which is the string yes, okay?
10243
07:35:07,200 --> 07:35:10,240
And so if we run this, now that it's fixed,
10244
07:35:10,240 --> 07:35:13,080
Python 3XML1.py, it will pull in
10245
07:35:13,080 --> 07:35:16,820
and get the at the name and the attributes.
10246
07:35:16,820 --> 07:35:19,880
So it pulled the chuck out, and so you get this object
10247
07:35:19,880 --> 07:35:21,960
and then you kind of dive into that object.
10248
07:35:21,960 --> 07:35:24,540
And so that's XML1.py.
10249
07:35:24,540 --> 07:35:28,120
If you've got a tag, you can either get the text
10250
07:35:28,120 --> 07:35:31,880
out of the tag, or you can get an attribute out of the tag.
10251
07:35:31,880 --> 07:35:34,480
So now let's take a look at XML2.py.
10252
07:35:34,480 --> 07:35:37,760
Again, we import element tree, and we have a tag,
10253
07:35:37,760 --> 07:35:41,320
and XML's always got to have a single outer tag.
10254
07:35:41,320 --> 07:35:45,400
But this time we're going to have, in effect, a list.
10255
07:35:45,400 --> 07:35:48,740
Now, let's line this up a little better.
10256
07:35:48,740 --> 07:35:52,280
There we go, that looks a little prettier.
10257
07:35:52,280 --> 07:35:56,920
And so users, the fact that it's users doesn't mean anything,
10258
07:35:56,920 --> 07:36:00,080
but we often come up with semantically meaningful names
10259
07:36:00,080 --> 07:36:01,420
for these things.
10260
07:36:01,420 --> 07:36:05,580
Users is going to have, as a children, a list of user tags.
10261
07:36:05,580 --> 07:36:08,720
Okay, so the children under user,
10262
07:36:08,720 --> 07:36:13,720
user under user, and then this has each of these as a tag.
10263
07:36:13,720 --> 07:36:16,720
So we want to parse this, and this is a common thing
10264
07:36:16,720 --> 07:36:17,720
we want to do.
10265
07:36:19,720 --> 07:36:22,720
And so, again, the first thing we do is we read the string
10266
07:36:22,720 --> 07:36:24,720
to just take this, it's a triple-coded string
10267
07:36:24,720 --> 07:36:26,720
going from here to here.
10268
07:36:26,720 --> 07:36:28,720
And then we're going to, instead of doing find,
10269
07:36:28,720 --> 07:36:31,720
which gives us one tag, we're going to do find all
10270
07:36:31,720 --> 07:36:36,720
the user's tag, the user tag that is a child of users.
10271
07:36:36,720 --> 07:36:40,720
And we get back a Python list of the tags,
10272
07:36:40,720 --> 07:36:43,720
not of the text, but of the tags.
10273
07:36:43,720 --> 07:36:46,720
So there's a one tag, and there is another tag.
10274
07:36:46,720 --> 07:36:49,720
And so we can do len of that, so we can see that we got two.
10275
07:36:49,720 --> 07:36:53,720
And then we can write a for loop, and this item is going
10276
07:36:53,720 --> 07:36:56,720
to iterate through the tags that are, the user tags
10277
07:36:56,720 --> 07:36:58,720
that are children of users.
10278
07:36:58,720 --> 07:37:01,720
So the first time item is going to be this tag, a tag,
10279
07:37:01,720 --> 07:37:04,720
remember, and then the second time is going to be this tag.
10280
07:37:04,720 --> 07:37:07,720
And so we can do things like find and get,
10281
07:37:07,720 --> 07:37:11,720
just like we did with the, in XML1.
10282
07:37:11,720 --> 07:37:16,720
So running this is not too exciting, Python 3, XML2.py.
10283
07:37:17,720 --> 07:37:22,720
You see that there are two users that comes from this print
10284
07:37:22,720 --> 07:37:24,720
right here, there are two users in there.
10285
07:37:24,720 --> 07:37:28,720
And the first one, if we go into name, and we go find
10286
07:37:28,720 --> 07:37:33,720
the text within the name tag, within user, then we get Chuck
10287
07:37:33,720 --> 07:37:37,720
and then we get the ID, which is 001, so we find the ID
10288
07:37:37,720 --> 07:37:39,720
within that item, and then we get the text.
10289
07:37:39,720 --> 07:37:44,720
And then we look and we grab the x attribute off of that.
10290
07:37:44,720 --> 07:37:51,720
And so we see Chuck, Chuck 001 and 2, and then in the next
10291
07:37:51,720 --> 07:37:55,720
tag, the for loop continues, and we print that out, okay?
10292
07:37:55,720 --> 07:38:02,720
And so that's just a basic run through of the XML from
10293
07:38:02,720 --> 07:38:06,720
the chapter in the Python book, okay?
10294
07:38:06,720 --> 07:38:07,720
Thanks.
10295
07:38:10,720 --> 07:38:12,720
So now we're going to talk a little bit about XML schema.
10296
07:38:12,720 --> 07:38:17,720
XML schema is a language that allows you to decide on
10297
07:38:17,720 --> 07:38:21,720
whether or not a particular XML document meets a contract
10298
07:38:21,720 --> 07:38:22,720
and arrangement.
10299
07:38:22,720 --> 07:38:25,720
So you have two pieces of software exchanging data using XML
10300
07:38:25,720 --> 07:38:28,720
and what if one of them, if they're all working, nobody
10301
07:38:28,720 --> 07:38:30,720
really worries too much about it, but if all of a sudden
10302
07:38:30,720 --> 07:38:33,720
one breaks, you change one side and another one breaks,
10303
07:38:33,720 --> 07:38:34,720
whose fault was it, right?
10304
07:38:34,720 --> 07:38:37,720
Was it the side that got changed or the other side?
10305
07:38:37,720 --> 07:38:38,720
And so you could argue.
10306
07:38:38,720 --> 07:38:41,720
So what you like to do is before you set up these arrangements
10307
07:38:41,720 --> 07:38:44,720
between these applications, set up a contract, in a way
10308
07:38:44,720 --> 07:38:48,720
they're kind of like the RFCs are, except that their scope
10309
07:38:48,720 --> 07:38:51,720
is between pairs of applications.
10310
07:38:51,720 --> 07:38:58,720
And so it itself is XML, and it basically, what we do is we
10311
07:38:58,720 --> 07:39:02,720
take an XML document and an XML schema contract, and then we
10312
07:39:02,720 --> 07:39:05,720
either say that's good or that that is bad, and that's called
10313
07:39:05,720 --> 07:39:10,720
validation, a piece of software that validates XML when given
10314
07:39:10,720 --> 07:39:13,720
a schema is called a validator.
10315
07:39:13,720 --> 07:39:17,720
And so an XML document, here we have our little XML document.
10316
07:39:17,720 --> 07:39:19,720
We're passing it to the validator.
10317
07:39:19,720 --> 07:39:23,720
And then we have a schema contract, which is a itself XML.
10318
07:39:23,720 --> 07:39:27,720
It's kind of a particular kind of XML, that XS colon complex
10319
07:39:27,720 --> 07:39:29,720
type, that's just a tag.
10320
07:39:29,720 --> 07:39:32,720
Colon is a legitimate character for the name of a tag.
10321
07:39:32,720 --> 07:39:34,720
Name equals person, that's just an attribute.
10322
07:39:34,720 --> 07:39:39,720
And so XML schema is a particular format of XML that
10323
07:39:39,720 --> 07:39:43,720
renders an opinion about what XML is supposed to look like.
10324
07:39:43,720 --> 07:39:47,720
So there's a number of different XML schema languages, the one
10325
07:39:47,720 --> 07:39:50,720
we're going to look at as one that kind of came a little bit
10326
07:39:50,720 --> 07:39:55,720
later, that's very common, called XSD, which is the
10327
07:39:55,720 --> 07:39:58,720
World Wide Web Consortium's schema specification.
10328
07:39:58,720 --> 07:40:02,720
Often you'll find files that have suffixes of.XSD that
10329
07:40:02,720 --> 07:40:06,720
actually contain the XML just like we're going to show you.
10330
07:40:06,720 --> 07:40:10,720
So if you recall, there are simple elements which have text
10331
07:40:10,720 --> 07:40:14,720
children, and then there are complex elements where other
10332
07:40:14,720 --> 07:40:16,720
nodes are children of other nodes.
10333
07:40:16,720 --> 07:40:18,720
And so we can say this.
10334
07:40:18,720 --> 07:40:21,720
And so here we have a little bit of XML, and the XML schema,
10335
07:40:21,720 --> 07:40:23,720
that makes sense with that.
10336
07:40:23,720 --> 07:40:27,720
So what we're saying is the outer tag of this legitimate
10337
07:40:27,720 --> 07:40:31,720
XML is supposed to be a complex tag with a name of person.
10338
07:40:31,720 --> 07:40:34,720
And so there we go, that looks good, good, good.
10339
07:40:34,720 --> 07:40:38,720
Then there is a sequence, and then there is a simple element,
10340
07:40:38,720 --> 07:40:41,720
a name of last name, looks good.
10341
07:40:41,720 --> 07:40:44,720
And it's a string, that looks good.
10342
07:40:44,720 --> 07:40:48,720
Another tag that's of named age, that's of type integer,
10343
07:40:48,720 --> 07:40:49,720
that's good.
10344
07:40:49,720 --> 07:40:53,720
And then a thing that's called date born, and then it looks
10345
07:40:53,720 --> 07:40:54,720
like a date.
10346
07:40:54,720 --> 07:40:57,720
So we check all these things, and we can basically say,
10347
07:40:57,720 --> 07:41:02,720
yup, that is a good XML document according to this schema.
10348
07:41:02,720 --> 07:41:05,720
And you don't have to write this generally, but there is
10349
07:41:05,720 --> 07:41:08,720
software that reads these two things and comes back with a
10350
07:41:08,720 --> 07:41:12,720
true or a false, and not even have some detail as to what
10351
07:41:12,720 --> 07:41:17,720
went wrong with this particular schema.
10352
07:41:17,720 --> 07:41:21,720
Here's some more that you can do with a schema.
10353
07:41:21,720 --> 07:41:24,720
We can do things like have a complex type, we have a
10354
07:41:24,720 --> 07:41:25,720
sequence.
10355
07:41:25,720 --> 07:41:29,720
Here we have a string, full name, and a string child name.
10356
07:41:29,720 --> 07:41:31,720
But we have this min occurs and max occurs.
10357
07:41:31,720 --> 07:41:34,720
So min occurs is the minimum number of times it can occur,
10358
07:41:34,720 --> 07:41:36,720
and maximum is the maximum.
10359
07:41:36,720 --> 07:41:39,720
So min occurs equals one, max occurs equals one means it's
10360
07:41:39,720 --> 07:41:40,720
required.
10361
07:41:40,720 --> 07:41:42,720
And so this is required, and we don't have two of them.
10362
07:41:42,720 --> 07:41:44,720
Two of them would be an error.
10363
07:41:44,720 --> 07:41:46,720
One of them is fine, so that's good.
10364
07:41:46,720 --> 07:41:50,720
Here the child name is min occurs zero, max occurs ten.
10365
07:41:50,720 --> 07:41:53,720
So we have four here, and so that's good too.
10366
07:41:53,720 --> 07:41:57,720
And so that is another kind of XML schema constraint that
10367
07:41:57,720 --> 07:41:59,720
you can have.
10368
07:41:59,720 --> 07:42:03,720
Here's a few other data types that we can do.
10369
07:42:03,720 --> 07:42:06,720
We've done the string, we've done the date.
10370
07:42:06,720 --> 07:42:07,720
The date looks like this.
10371
07:42:07,720 --> 07:42:12,720
Dates are four digit year, two digit month, two digit day
10372
07:42:12,720 --> 07:42:13,720
with dashes.
10373
07:42:13,720 --> 07:42:15,720
Now there's lots of different ways to represent dates, but
10374
07:42:15,720 --> 07:42:19,720
the nice thing about this, and you have to put the zeros in.
10375
07:42:19,720 --> 07:42:21,720
So zero, nine for September.
10376
07:42:21,720 --> 07:42:23,720
It means that these are sortable as strings.
10377
07:42:23,720 --> 07:42:26,720
So that if you do all your dates this way, they're
10378
07:42:26,720 --> 07:42:27,720
sortable as strings.
10379
07:42:27,720 --> 07:42:29,720
So you could argue what is prettier, but for computers we
10380
07:42:29,720 --> 07:42:30,720
don't worry about that.
10381
07:42:30,720 --> 07:42:33,720
We're arguing about what's the most functional.
10382
07:42:33,720 --> 07:42:37,720
And then the date time is that same date format with zeros
10383
07:42:37,720 --> 07:42:41,720
followed by the letter T, and then followed by hours,
10384
07:42:41,720 --> 07:42:44,720
minutes, seconds, zero filled, right?
10385
07:42:44,720 --> 07:42:49,720
So nine o'clock is zero, nine, and then the time zone,
10386
07:42:49,720 --> 07:42:51,720
which we'll talk about a second in the next slide.
10387
07:42:51,720 --> 07:42:54,720
You can have decimal numbers and you can have integer
10388
07:42:54,720 --> 07:42:55,720
numbers as well.
10389
07:42:55,720 --> 07:42:58,720
And so we are able to sort of render an opinion as to what
10390
07:42:58,720 --> 07:43:03,720
is good and what is bad in the resulting XML.
10391
07:43:03,720 --> 07:43:05,720
So dates are kind of interesting.
10392
07:43:05,720 --> 07:43:08,720
There's, again, we have lots of different formats of dates,
10393
07:43:08,720 --> 07:43:15,720
you know, nine slash 10 slash 2002, right?
10394
07:43:15,720 --> 07:43:18,720
You know, that's a format of date, but that's one.
10395
07:43:18,720 --> 07:43:21,720
There's another format of the date, which is, you know,
10396
07:43:21,720 --> 07:43:24,720
12 December, whatever.
10397
07:43:24,720 --> 07:43:26,720
And so this is how people show dates.
10398
07:43:26,720 --> 07:43:29,720
Computers don't want to have all those different dates
10399
07:43:29,720 --> 07:43:31,720
and don't want to figure those out.
10400
07:43:31,720 --> 07:43:34,720
They have libraries that produce dates and make them look
10401
07:43:34,720 --> 07:43:36,720
pretty for particular locales.
10402
07:43:36,720 --> 07:43:40,720
But computers really want dates that work best for them.
10403
07:43:40,720 --> 07:43:42,720
So we just say, okay, we're going to have this year,
10404
07:43:42,720 --> 07:43:46,720
month, day, time, and then zero fill, hours, minutes,
10405
07:43:46,720 --> 07:43:50,720
seconds, h, m, s, and then time zone.
10406
07:43:50,720 --> 07:43:53,720
Now computers even prefer a time zone.
10407
07:43:53,720 --> 07:43:56,720
I don't know if you've used something like your Google
10408
07:43:56,720 --> 07:43:59,720
calendar and you take a flight or take a train trip and you
10409
07:43:59,720 --> 07:44:02,720
have a different time zone, everything switches.
10410
07:44:02,720 --> 07:44:05,720
And that's because Google Calendar is not really storing
10411
07:44:05,720 --> 07:44:09,720
the time zone that you're, it's not storing the dates
10412
07:44:09,720 --> 07:44:13,720
in your current time zone, it's storing them in what we call
10413
07:44:13,720 --> 07:44:16,720
universal time or Greenwich Mean Time.
10414
07:44:16,720 --> 07:44:18,720
Zulu Time is another word for that.
10415
07:44:18,720 --> 07:44:22,720
And Z means this time that is the time in, you know,
10416
07:44:22,720 --> 07:44:25,720
London, England, Greenwich Mean Time.
10417
07:44:25,720 --> 07:44:29,720
And so the thing is that that means if this data moves
10418
07:44:29,720 --> 07:44:32,720
between time zones or crosses the international date line
10419
07:44:32,720 --> 07:44:35,720
or standard data like savings time or anything like that,
10420
07:44:35,720 --> 07:44:37,720
none of that changes.
10421
07:44:37,720 --> 07:44:41,720
And so we have this internal date and time that's very
10422
07:44:41,720 --> 07:44:45,720
common in situations where computers are exchanging data
10423
07:44:45,720 --> 07:44:49,720
that then gets shown with a time zone converted to the
10424
07:44:49,720 --> 07:44:52,720
time zone or the local format that's the right way to do that.
10425
07:44:52,720 --> 07:44:55,720
And there's a standard for how dates and times are supposed
10426
07:44:55,720 --> 07:44:56,720
to look.
10427
07:44:56,720 --> 07:44:59,720
So here's another little example of some stuff.
10428
07:44:59,720 --> 07:45:00,720
Let's see what we got.
10429
07:45:00,720 --> 07:45:03,720
Now, if you see this little question mark XML,
10430
07:45:03,720 --> 07:45:04,720
that's not a problem.
10431
07:45:04,720 --> 07:45:07,720
That just is a way of sort of putting a header on the whole
10432
07:45:07,720 --> 07:45:09,720
document that says it's an XML document,
10433
07:45:09,720 --> 07:45:12,720
telling it that it's a UTF-8 document.
10434
07:45:12,720 --> 07:45:14,720
And that's not really a tag.
10435
07:45:14,720 --> 07:45:17,720
That's sort of like a marker on the file so that you can put
10436
07:45:17,720 --> 07:45:20,720
that there but it doesn't harm the XML.
10437
07:45:20,720 --> 07:45:24,720
The outer tag is this tag right here, XS colon schema.
10438
07:45:24,720 --> 07:45:27,720
And then what else we got?
10439
07:45:27,720 --> 07:45:28,720
We got an address.
10440
07:45:28,720 --> 07:45:30,720
We got a string, string, string, string, string.
10441
07:45:30,720 --> 07:45:32,720
We've seen all those.
10442
07:45:32,720 --> 07:45:35,720
Here we have country and we're going to have a restriction that
10443
07:45:35,720 --> 07:45:39,720
basically says this is a simple string but we're going to make
10444
07:45:39,720 --> 07:45:44,720
it so that you have to list one of these four as the country
10445
07:45:44,720 --> 07:45:45,720
code.
10446
07:45:45,720 --> 07:45:49,720
And so here we are down here and that's UK and that's UK and
10447
07:45:49,720 --> 07:45:53,720
so that is valid XML.
10448
07:45:53,720 --> 07:45:56,720
Another couple of examples here.
10449
07:45:56,720 --> 07:46:00,720
Let's see, string, string, string, string, string.
10450
07:46:00,720 --> 07:46:02,720
Max occurs unbounded.
10451
07:46:02,720 --> 07:46:04,720
That means infinite number.
10452
07:46:04,720 --> 07:46:05,720
There's no limit on the number.
10453
07:46:05,720 --> 07:46:06,720
You can do that.
10454
07:46:06,720 --> 07:46:08,720
It occurs of zero.
10455
07:46:08,720 --> 07:46:09,720
Excess positive integer.
10456
07:46:09,720 --> 07:46:12,720
We've seen integer but you can also say it's got to be positive
10457
07:46:12,720 --> 07:46:13,720
integer.
10458
07:46:13,720 --> 07:46:14,720
Decimal, we've seen that.
10459
07:46:14,720 --> 07:46:17,720
And then use equals required is just another statement that you
10460
07:46:17,720 --> 07:46:18,720
can make.
10461
07:46:18,720 --> 07:46:21,720
I'm not trying to get you to the point where you can do XML
10462
07:46:21,720 --> 07:46:22,720
schema.
10463
07:46:22,720 --> 07:46:25,720
Just get you a sense of the kinds of statements that we can
10464
07:46:25,720 --> 07:46:28,720
speak about when we're talking about what is and is not
10465
07:46:28,720 --> 07:46:31,720
legitimate XML.
10466
07:46:31,720 --> 07:46:34,720
So let's talk a little bit about how we might talk XML inside
10467
07:46:34,720 --> 07:46:35,720
Python.
10468
07:46:35,720 --> 07:46:40,720
And so like most things that are in this extended part of Python
10469
07:46:40,720 --> 07:46:42,720
we have to import something.
10470
07:46:42,720 --> 07:46:45,720
And so this is the name of a library XML E-tree element tree
10471
07:46:45,720 --> 07:46:48,720
and then as ET this ends up being a shortcut.
10472
07:46:48,720 --> 07:46:51,720
So we don't have to type these long things.
10473
07:46:51,720 --> 07:46:53,720
And so ET is the same as typing that.
10474
07:46:53,720 --> 07:46:55,720
It's almost like a macro.
10475
07:46:55,720 --> 07:46:58,720
Now normally this XML is going to come somewhere from the
10476
07:46:58,720 --> 07:47:01,720
network but I'm just going to put this in a string.
10477
07:47:01,720 --> 07:47:04,720
I'm using a triple quoted string and so that means that this
10478
07:47:04,720 --> 07:47:07,720
triple quoted string starts here and ends here and all these
10479
07:47:07,720 --> 07:47:10,720
new lines that are here are actually part of the string.
10480
07:47:10,720 --> 07:47:12,720
So this is kind of like I opened a file and read the whole
10481
07:47:12,720 --> 07:47:13,720
thing in.
10482
07:47:13,720 --> 07:47:16,720
But just to keep this totally self-contained I'm putting it
10483
07:47:16,720 --> 07:47:17,720
in a string.
10484
07:47:17,720 --> 07:47:21,720
So the XML would come from some server on the other side of the
10485
07:47:21,720 --> 07:47:23,720
network we would get this XML.
10486
07:47:23,720 --> 07:47:25,720
So that's how it would normally work.
10487
07:47:25,720 --> 07:47:26,720
Okay?
10488
07:47:26,720 --> 07:47:31,720
So this is the XML right there.
10489
07:47:31,720 --> 07:47:37,720
And we parse a string of data and we call ET from string.
10490
07:47:37,720 --> 07:47:40,720
So we're passing in the less thans, the new lines, the
10491
07:47:40,720 --> 07:47:43,720
greater thans, all of this stuff we're passing in.
10492
07:47:43,720 --> 07:47:45,720
And this could have syntax errors in it.
10493
07:47:45,720 --> 07:47:50,720
So this might blow up if this had a syntax error like we forgot
10494
07:47:50,720 --> 07:47:51,720
the little slash or something.
10495
07:47:51,720 --> 07:47:52,720
There was a syntax error.
10496
07:47:52,720 --> 07:47:54,720
But this doesn't have a syntax error.
10497
07:47:54,720 --> 07:47:58,720
So then what we do is we get back an object.
10498
07:47:58,720 --> 07:48:01,720
I just happen to call it tree because it kind of is like that
10499
07:48:01,720 --> 07:48:04,720
tree version of the XML.
10500
07:48:04,720 --> 07:48:07,720
That is an object that we can then query to pull data out of
10501
07:48:07,720 --> 07:48:08,720
it.
10502
07:48:08,720 --> 07:48:13,720
So we say tree.find and look for a tag name name.
10503
07:48:13,720 --> 07:48:16,720
So that finds the tag name name is this.
10504
07:48:16,720 --> 07:48:18,720
It's everything.
10505
07:48:18,720 --> 07:48:20,720
It's the tag and the text.
10506
07:48:20,720 --> 07:48:22,720
If we want the text, we add dot text.
10507
07:48:22,720 --> 07:48:26,720
And then that dot text, that dot text, that actually
10508
07:48:26,720 --> 07:48:29,720
refines it to only the word chuck.
10509
07:48:29,720 --> 07:48:36,720
And similarly, if we do tree.findemail, that tree.findemail,
10510
07:48:36,720 --> 07:48:39,720
that finds the email tag which is this tag.
10511
07:48:39,720 --> 07:48:42,720
It has a child attribute and you can get any of the
10512
07:48:42,720 --> 07:48:43,720
attributes.
10513
07:48:43,720 --> 07:48:44,720
You say dot get.
10514
07:48:44,720 --> 07:48:46,720
There's only one text child.
10515
07:48:46,720 --> 07:48:48,720
But there are many attribute children.
10516
07:48:48,720 --> 07:48:50,720
And so you have to tell it which one you want.
10517
07:48:50,720 --> 07:48:55,720
And so this here, this bit right here, all of that
10518
07:48:55,720 --> 07:48:58,720
will resolve down to that string yes.
10519
07:48:58,720 --> 07:49:00,720
That's what you're going to get there.
10520
07:49:00,720 --> 07:49:01,720
Yes.
10521
07:49:01,720 --> 07:49:05,720
And so you kind of build up these little finds and
10522
07:49:05,720 --> 07:49:06,720
call methods.
10523
07:49:06,720 --> 07:49:10,720
This is not clearly a full introduction to element tree.
10524
07:49:10,720 --> 07:49:13,720
But you get the idea that you sort of dive down in with
10525
07:49:13,720 --> 07:49:16,720
these methods, the call methods, the call methods,
10526
07:49:16,720 --> 07:49:20,720
to get little pieces out and parse all of that.
10527
07:49:20,720 --> 07:49:23,720
Here is a different example.
10528
07:49:23,720 --> 07:49:26,720
In this one, again, we're using triple quoted string.
10529
07:49:26,720 --> 07:49:30,720
We always have a single tag on the outside.
10530
07:49:30,720 --> 07:49:32,720
And then I have a complex type of users.
10531
07:49:32,720 --> 07:49:35,720
And in it, there are two user objects.
10532
07:49:35,720 --> 07:49:37,720
So this is kind of like a list.
10533
07:49:37,720 --> 07:49:39,720
So this is more than one of these things.
10534
07:49:39,720 --> 07:49:42,720
So this user can occur more than one time.
10535
07:49:42,720 --> 07:49:47,720
And again, we take this, we pass that into from string
10536
07:49:47,720 --> 07:49:51,720
and get back an object that represents the name stuff
10537
07:49:51,720 --> 07:49:55,720
is not necessarily have to be the same as this outer tag.
10538
07:49:55,720 --> 07:49:56,720
Just a variable.
10539
07:49:56,720 --> 07:50:00,720
This could just be as easily as X if I wanted.
10540
07:50:00,720 --> 07:50:03,720
So now what I'm going to say is, hey, stuff,
10541
07:50:03,720 --> 07:50:07,720
I want to find the tag, the path users slash user.
10542
07:50:07,720 --> 07:50:10,720
I want to find all tags that match users slash user.
10543
07:50:10,720 --> 07:50:14,720
So that's going to give me a list of two tags, one tag,
10544
07:50:14,720 --> 07:50:18,720
two tags in a list.
10545
07:50:18,720 --> 07:50:22,720
Tag, tag.
10546
07:50:22,720 --> 07:50:23,720
Oops.
10547
07:50:23,720 --> 07:50:24,720
So two tags.
10548
07:50:24,720 --> 07:50:26,720
Now I can print out how many I get.
10549
07:50:26,720 --> 07:50:29,720
That'll be two in this case because I got two tags.
10550
07:50:29,720 --> 07:50:33,720
And I can actually iterate through the list.
10551
07:50:33,720 --> 07:50:36,720
So I can iterate through the list.
10552
07:50:36,720 --> 07:50:39,720
So this item is going to iterate first to this tag
10553
07:50:39,720 --> 07:50:43,720
and that tag, now it's like in the previous example,
10554
07:50:43,720 --> 07:50:46,720
we can look for the name tag within there
10555
07:50:46,720 --> 07:50:48,720
and pull the text out.
10556
07:50:48,720 --> 07:50:50,720
So we pull that text out, find the name tag,
10557
07:50:50,720 --> 07:50:53,720
find the name tag, and then within that find the text.
10558
07:50:53,720 --> 07:50:58,720
And we can find the ID tag and pull the text of that out.
10559
07:50:58,720 --> 07:51:05,720
So that pulls out this 001 and I've scribbled too much.
10560
07:51:05,720 --> 07:51:10,720
And then we can item, which is, this is item,
10561
07:51:10,720 --> 07:51:13,720
is that whole tag, dot get x.
10562
07:51:13,720 --> 07:51:16,720
So that gets the attribute, that gets the two,
10563
07:51:16,720 --> 07:51:18,720
that two comes down here.
10564
07:51:18,720 --> 07:51:24,720
And then item goes to the next one
10565
07:51:24,720 --> 07:51:28,720
because item is looping through so item iterates down to that one
10566
07:51:28,720 --> 07:51:33,720
and pulls out the name dot text, the ID dot text,
10567
07:51:33,720 --> 07:51:37,720
and the attribute dot x and pulls all those pieces out.
10568
07:51:37,720 --> 07:51:39,720
So this is the basic pattern.
10569
07:51:39,720 --> 07:51:43,720
You saw one where you're tearing into a single thing
10570
07:51:43,720 --> 07:51:45,720
and here you're tearing into something
10571
07:51:45,720 --> 07:51:48,720
that is expected to occur more than one time.
10572
07:51:48,720 --> 07:51:54,720
So that's a quick summary of how you talk to XML in Python.
10573
07:51:54,720 --> 07:51:57,720
Up next we're going to talk about the other serialization format,
10574
07:51:57,720 --> 07:52:03,720
JavaScript Object Notation.
10575
07:52:03,720 --> 07:52:06,720
So now we're going to talk about the other serialization format,
10576
07:52:06,720 --> 07:52:08,720
JavaScript Object Notation.
10577
07:52:08,720 --> 07:52:10,720
Chances are good as you go out there,
10578
07:52:10,720 --> 07:52:13,720
you will very likely encounter more JSON than you will XML.
10579
07:52:13,720 --> 07:52:15,720
Not that XML is bad.
10580
07:52:15,720 --> 07:52:19,720
XML is better for rich and hierarchical documents,
10581
07:52:19,720 --> 07:52:23,720
whereas JSON is best for just pulling data out of a system
10582
07:52:23,720 --> 07:52:27,720
and moving it between two systems with the minimum of fuss.
10583
07:52:27,720 --> 07:52:29,720
This is Douglas Crockford.
10584
07:52:29,720 --> 07:52:31,720
I have a great interview from him.
10585
07:52:31,720 --> 07:52:34,720
He's a funny guy, very, very smart.
10586
07:52:34,720 --> 07:52:37,720
He claims he didn't invent JSON, he discovered it
10587
07:52:37,720 --> 07:52:41,720
because it really is based on the literal notation for JavaScript.
10588
07:52:41,720 --> 07:52:44,720
And it actually looks a lot like the Python literal notation
10589
07:52:44,720 --> 07:52:47,720
for objects and for lists.
10590
07:52:47,720 --> 07:52:50,720
Now Douglas Crockford has quite a sense of humor.
10591
07:52:50,720 --> 07:52:53,720
He wrote this book called JavaScript the Good Parts,
10592
07:52:53,720 --> 07:52:54,720
that's the little one right there,
10593
07:52:54,720 --> 07:52:56,720
and then JavaScript the Comprehensive Guide,
10594
07:52:56,720 --> 07:52:59,720
and the sense of humor is all the stuff that's in JavaScript
10595
07:52:59,720 --> 07:53:00,720
that's not too useful.
10596
07:53:00,720 --> 07:53:02,720
And while this is sort of a tongue in cheek,
10597
07:53:02,720 --> 07:53:04,720
it also is trying to say that JavaScript,
10598
07:53:04,720 --> 07:53:08,720
what Crockford is really saying here is JavaScript is a great language
10599
07:53:08,720 --> 07:53:10,720
as long as you avoid the tricky bits
10600
07:53:10,720 --> 07:53:12,720
and sort of keep it very, very simple.
10601
07:53:12,720 --> 07:53:15,720
And JavaScript is indeed a great language.
10602
07:53:15,720 --> 07:53:17,720
But JSON comes from JavaScript.
10603
07:53:17,720 --> 07:53:20,720
You can read about JSON at JSON.org.
10604
07:53:20,720 --> 07:53:23,720
JSON is not an international standard.
10605
07:53:23,720 --> 07:53:25,720
It's not like an RFC.
10606
07:53:25,720 --> 07:53:26,720
It really is.
10607
07:53:26,720 --> 07:53:29,720
Douglas Crockford decided to register JSON.org
10608
07:53:29,720 --> 07:53:32,720
and typed in some pages, and people started reading it
10609
07:53:32,720 --> 07:53:33,720
and people started using it.
10610
07:53:33,720 --> 07:53:36,720
And partly that was because it was truly derived
10611
07:53:36,720 --> 07:53:42,720
from the JavaScript literal syntax.
10612
07:53:42,720 --> 07:53:44,720
So we're all ready to code.
10613
07:53:44,720 --> 07:53:48,720
Here is some Python that's going to process some JSON.
10614
07:53:48,720 --> 07:53:49,720
Keep it straight.
10615
07:53:49,720 --> 07:53:51,720
Python process JSON.
10616
07:53:51,720 --> 07:53:54,720
So again, I'm using the triple-quoted string here.
10617
07:53:54,720 --> 07:53:56,720
Now you'll notice the syntax that we are using
10618
07:53:56,720 --> 07:53:59,720
is not angle brackets, but instead curly braces.
10619
07:53:59,720 --> 07:54:02,720
And so the curly brace, and then within the curly brace
10620
07:54:02,720 --> 07:54:05,720
you have key value pairs, name colon chuck,
10621
07:54:05,720 --> 07:54:08,720
and the key colon value, and both sides have quotes.
10622
07:54:08,720 --> 07:54:12,720
You can also have objects within objects, curly brace,
10623
07:54:12,720 --> 07:54:14,720
key value pairs, key value, key value.
10624
07:54:14,720 --> 07:54:16,720
Looks a lot like Python.
10625
07:54:16,720 --> 07:54:18,720
And then you can do this.
10626
07:54:18,720 --> 07:54:21,720
And so this is a structure that has one key value pair
10627
07:54:21,720 --> 07:54:23,720
that's a string, another key value pair that's an object,
10628
07:54:23,720 --> 07:54:25,720
another key value pair that's an object,
10629
07:54:25,720 --> 07:54:28,720
and then these are key values within those contained objects.
10630
07:54:28,720 --> 07:54:33,720
So this is a string that again probably was retrieved
10631
07:54:33,720 --> 07:54:36,720
across the network from some other place.
10632
07:54:36,720 --> 07:54:40,720
And we're going to pass that string into the JSON library
10633
07:54:40,720 --> 07:54:42,720
called loadS, loadS stands for load from string.
10634
07:54:42,720 --> 07:54:45,720
So it reads this, parses it, looks at all the white space.
10635
07:54:45,720 --> 07:54:47,720
White space again doesn't matter too much here
10636
07:54:47,720 --> 07:54:49,720
unless it's in between double quotes.
10637
07:54:49,720 --> 07:54:51,720
The white space doesn't matter.
10638
07:54:51,720 --> 07:54:56,720
And so it parses it and then returns us a dictionary.
10639
07:54:56,720 --> 07:54:58,720
So the thing that's different about JSON
10640
07:54:58,720 --> 07:55:03,720
is that its structure and representation are simpler than XML.
10641
07:55:03,720 --> 07:55:07,720
So in Python, everything either comes back as a dictionary
10642
07:55:07,720 --> 07:55:09,720
or a list, or a dictionary within a dictionary
10643
07:55:09,720 --> 07:55:12,720
or a list within a dictionary, but it's all dictionaries.
10644
07:55:12,720 --> 07:55:15,720
It's not a separate structure that you have to do gets
10645
07:55:15,720 --> 07:55:17,720
and finds and findalls and lookups.
10646
07:55:17,720 --> 07:55:18,720
So it's right there.
10647
07:55:18,720 --> 07:55:23,720
So when we get this back, because this is a curly brace,
10648
07:55:23,720 --> 07:55:25,720
info is a dictionary.
10649
07:55:27,720 --> 07:55:32,720
And so we can just use the standard syntax of Python,
10650
07:55:32,720 --> 07:55:34,720
info sub name.
10651
07:55:34,720 --> 07:55:39,720
Well, that will bring, let's clear this.
10652
07:55:41,720 --> 07:55:44,720
So info sub name, we'll go find Chuck.
10653
07:55:44,720 --> 07:55:47,720
So if you compare that with the XML, that's just a lot easier.
10654
07:55:47,720 --> 07:55:51,720
Now, when we have info sub email, that's this thing.
10655
07:55:51,720 --> 07:55:53,720
So info sub email is that thing.
10656
07:55:53,720 --> 07:55:57,720
And then sub hide is this.
10657
07:55:57,720 --> 07:55:59,720
So that's what comes out here.
10658
07:55:59,720 --> 07:56:02,720
So it's really nested dictionaries and lists.
10659
07:56:02,720 --> 07:56:05,720
We haven't seen a list yet, but this is a set of nested
10660
07:56:05,720 --> 07:56:08,720
dictionaries that it parses.
10661
07:56:08,720 --> 07:56:11,720
And it's equally simple in other programming languages.
10662
07:56:11,720 --> 07:56:15,720
This is a little more complex version where the outer element
10663
07:56:15,720 --> 07:56:20,720
is a square bracket, which means it's going to be a list.
10664
07:56:20,720 --> 07:56:24,720
And so we have a list of one, comma, two things.
10665
07:56:24,720 --> 07:56:28,720
So this is a list of two dictionaries.
10666
07:56:28,720 --> 07:56:31,720
So there's two dictionaries inside that list.
10667
07:56:31,720 --> 07:56:35,720
So again, we take this string and we load it into,
10668
07:56:35,720 --> 07:56:39,720
use the JSON parser to read the string and give us back.
10669
07:56:39,720 --> 07:56:42,720
In this case, info is a list.
10670
07:56:42,720 --> 07:56:43,720
It's got two items.
10671
07:56:43,720 --> 07:56:46,720
If we print out info, it'll give us two.
10672
07:56:46,720 --> 07:56:48,720
And we're going to iterate through.
10673
07:56:48,720 --> 07:56:51,720
And so if we're going to iterate through,
10674
07:56:51,720 --> 07:56:55,720
item is going to first be this,
10675
07:56:55,720 --> 07:56:58,720
and then it's going to iterate to this.
10676
07:56:58,720 --> 07:57:00,720
And it's going to print out item sub name,
10677
07:57:00,720 --> 07:57:02,720
which is going to print out chuck, item sub id,
10678
07:57:02,720 --> 07:57:05,720
which is going to print out 001.
10679
07:57:05,720 --> 07:57:08,720
Now you'll notice that there is no attributes.
10680
07:57:08,720 --> 07:57:11,720
And that's because JSON is simpler.
10681
07:57:11,720 --> 07:57:14,720
But we can have the x just as another item.
10682
07:57:14,720 --> 07:57:18,720
So we say item sub x, and that's going to print the two out.
10683
07:57:18,720 --> 07:57:20,720
And then it'll iterate to the next one,
10684
07:57:20,720 --> 07:57:23,720
and it'll print out the same thing for those guys.
10685
07:57:23,720 --> 07:57:26,720
And so JSON is simpler because it is,
10686
07:57:26,720 --> 07:57:30,720
you can't represent as complex a data structure,
10687
07:57:30,720 --> 07:57:32,720
or you have to compromise and map it
10688
07:57:32,720 --> 07:57:34,720
into a simpler data structure.
10689
07:57:34,720 --> 07:57:37,720
But then it is lists and dictionaries.
10690
07:57:37,720 --> 07:57:39,720
And so once you've got it parsed,
10691
07:57:39,720 --> 07:57:44,720
it is easier to understand and to make use of.
10692
07:57:44,720 --> 07:57:45,720
So that was quick.
10693
07:57:45,720 --> 07:57:48,720
So that's partly why everyone likes JSON better,
10694
07:57:48,720 --> 07:57:50,720
is once you have come up with the format
10695
07:57:50,720 --> 07:57:52,720
that you're going to send it back and forth,
10696
07:57:52,720 --> 07:57:54,720
it's easy to make it, and it's easy to read it.
10697
07:57:54,720 --> 07:57:57,720
Now what we're going to talk about is sort of moving up a level.
10698
07:57:57,720 --> 07:58:00,720
If you've got all these data formats
10699
07:58:00,720 --> 07:58:03,720
and URLs that you can hit to pull those data formats down,
10700
07:58:03,720 --> 07:58:08,720
what approach do you do as you start to construct applications
10701
07:58:08,720 --> 07:58:11,720
that increasingly go from a single application
10702
07:58:11,720 --> 07:58:13,720
to a networked application?
10703
07:58:18,720 --> 07:58:21,720
We're playing with the web services chapter right now.
10704
07:58:21,720 --> 07:58:28,720
And if you want to get the materials for this course,
10705
07:58:28,720 --> 07:58:33,720
you can go here and download the sample zip,
10706
07:58:33,720 --> 07:58:34,720
samplecode.zip.
10707
07:58:34,720 --> 07:58:37,720
I've got this all sitting already on my computer.
10708
07:58:37,720 --> 07:58:39,720
I also have the whole thing in GitHub
10709
07:58:39,720 --> 07:58:40,720
if you want to get it out of GitHub.
10710
07:58:40,720 --> 07:58:42,720
So the thing we're talking about now
10711
07:58:42,720 --> 07:58:46,720
is we're talking about the JSON 1.py example from the book.
10712
07:58:46,720 --> 07:58:50,720
And so JSON is kind of like XML except a lot simpler.
10713
07:58:50,720 --> 07:58:52,720
And that's why a lot of people like it.
10714
07:58:52,720 --> 07:58:54,720
It's not that JSON is always better,
10715
07:58:54,720 --> 07:58:57,720
but JSON is better in a lot of situations
10716
07:58:57,720 --> 07:59:00,720
that don't require the complexity of XML.
10717
07:59:00,720 --> 07:59:02,720
So we start to import JSON.
10718
07:59:02,720 --> 07:59:05,720
JSON is built into Python, but we have to ask to import it.
10719
07:59:05,720 --> 07:59:10,720
Again, we're using a triple-coded string to put the JSON in there.
10720
07:59:10,720 --> 07:59:14,720
And JSON looks a lot like Python dictionaries, key-value pairs.
10721
07:59:14,720 --> 07:59:15,720
Key-value pairs.
10722
07:59:15,720 --> 07:59:17,720
In this case, this is a key,
10723
07:59:17,720 --> 07:59:21,720
and the value itself is another dictionary,
10724
07:59:21,720 --> 07:59:23,720
or in JSON terms, an object.
10725
07:59:23,720 --> 07:59:25,720
But again, key-value pairs within key-value pairs
10726
07:59:25,720 --> 07:59:27,720
within key-value pairs.
10727
07:59:27,720 --> 07:59:29,720
And all these little cursor guys have to,
10728
07:59:29,720 --> 07:59:33,720
all these little curly-brace guys have to line up properly.
10729
07:59:33,720 --> 07:59:38,720
And so, like all the time, this is a string,
10730
07:59:38,720 --> 07:59:41,720
which we normally would read and decode from the Internet.
10731
07:59:41,720 --> 07:59:43,720
But for now, we're just going to have it in there.
10732
07:59:43,720 --> 07:59:47,720
Load JSON.loadS says go into the JSON library, pull out load string,
10733
07:59:47,720 --> 07:59:53,720
and parse this, which turns this set of curly braces, spaces, commas,
10734
07:59:53,720 --> 07:59:56,720
and perhaps syntax errors into a structured object.
10735
07:59:56,720 --> 08:00:00,720
And if we'd made a syntax error in here, then this would blow up.
10736
08:00:00,720 --> 08:00:02,720
But if this doesn't make a syntax error,
10737
08:00:02,720 --> 08:00:06,720
if this doesn't blow up, then we have a structured representation.
10738
08:00:06,720 --> 08:00:10,720
Now, the difference between XML and Python JSON
10739
08:00:10,720 --> 08:00:15,720
is that this turns into a Python dictionary with key-value pairs.
10740
08:00:15,720 --> 08:00:19,720
And so, once we have this, this is a dictionary.
10741
08:00:19,720 --> 08:00:22,720
And we can say info sub name,
10742
08:00:22,720 --> 08:00:25,720
and that's the exact syntax that we would use to get the dictionary.
10743
08:00:25,720 --> 08:00:28,720
And that's going to extract this value out of there.
10744
08:00:28,720 --> 08:00:33,720
And if we want to go in deeper, we can say info sub email,
10745
08:00:33,720 --> 08:00:37,720
and that's what info sub email is right there, and then sub hide.
10746
08:00:37,720 --> 08:00:40,720
So that's a dictionary within a dictionary.
10747
08:00:40,720 --> 08:00:50,720
So if we run this, Python 3 JSON 1.py, it digs in really fast.
10748
08:00:50,720 --> 08:00:52,720
And so this is why people tend to like JSON,
10749
08:00:52,720 --> 08:00:54,720
is because you'll read the JSON,
10750
08:00:54,720 --> 08:00:57,720
which is actually a syntax derived from JavaScript,
10751
08:00:57,720 --> 08:01:00,720
but it looks just like the syntax for a Python.
10752
08:01:00,720 --> 08:01:04,720
So that's moving an object, a JSON object
10753
08:01:04,720 --> 08:01:07,720
that turns in directly into a Python dictionary
10754
08:01:07,720 --> 08:01:09,720
with nested dictionaries.
10755
08:01:09,720 --> 08:01:11,720
Now we're going to look at JSON 2.
10756
08:01:11,720 --> 08:01:14,720
And so JSON 2, we're going to see a list,
10757
08:01:14,720 --> 08:01:16,720
or an array in JSON terms,
10758
08:01:16,720 --> 08:01:18,720
but it turns into a list in Python terms.
10759
08:01:18,720 --> 08:01:22,720
So this is a list of dictionaries.
10760
08:01:22,720 --> 08:01:25,720
In JavaScript, that would be an array of objects,
10761
08:01:25,720 --> 08:01:27,720
but in Python, it's a list of dictionaries.
10762
08:01:27,720 --> 08:01:30,720
So we'll just pretend that it's a list of dictionaries.
10763
08:01:30,720 --> 08:01:35,720
Again, we load the string, parsing, looking for syntax errors.
10764
08:01:35,720 --> 08:01:42,720
So let's just make a syntax error here and run Python JSON 2.py,
10765
08:01:42,720 --> 08:01:44,720
and you'll see where it blows up.
10766
08:01:44,720 --> 08:01:47,720
It blows up at line 15, which is right here.
10767
08:01:47,720 --> 08:01:49,720
It's like this load s blows up.
10768
08:01:49,720 --> 08:01:51,720
Now you could put a try accept around it to save it,
10769
08:01:51,720 --> 08:01:53,720
but we're not going to do that.
10770
08:01:53,720 --> 08:01:54,720
And it even complains.
10771
08:01:54,720 --> 08:01:57,720
It says, look, we're expecting something here in line 11.
10772
08:01:57,720 --> 08:01:59,720
And that's line 11 of the JSON,
10773
08:01:59,720 --> 08:02:02,720
which starts at line 4.
10774
08:02:02,720 --> 08:02:05,720
And so I'll put my little square brace back in
10775
08:02:05,720 --> 08:02:07,720
so it's not syntactically broken.
10776
08:02:07,720 --> 08:02:10,720
So let's run it again and make sure that she runs,
10777
08:02:10,720 --> 08:02:11,720
and yes, she does.
10778
08:02:11,720 --> 08:02:16,720
So this parses it and converts from the JSON syntax
10779
08:02:16,720 --> 08:02:19,720
into a Python, in this case, list,
10780
08:02:19,720 --> 08:02:22,720
because it's got square braces instead of curly braces.
10781
08:02:22,720 --> 08:02:25,720
The previous example had square braces.
10782
08:02:25,720 --> 08:02:28,720
And we can then take a len of it, and it's an array,
10783
08:02:28,720 --> 08:02:31,720
it's a list, and we see that there are two things in there.
10784
08:02:31,720 --> 08:02:33,720
And then we're going to iterate through,
10785
08:02:33,720 --> 08:02:36,720
and this item is going to iterate through these dictionaries,
10786
08:02:36,720 --> 08:02:38,720
that dictionary followed by that dictionary.
10787
08:02:38,720 --> 08:02:42,720
So the first time it's item sub name,
10788
08:02:42,720 --> 08:02:44,720
which is this value right here,
10789
08:02:44,720 --> 08:02:47,720
and then item sub id, which is this value.
10790
08:02:47,720 --> 08:02:49,720
So you can dig right into this,
10791
08:02:49,720 --> 08:02:51,720
but you're using, you're not using get
10792
08:02:51,720 --> 08:02:55,720
and you're not using the weird extra find or find all or anything.
10793
08:02:55,720 --> 08:02:59,720
You just are going at these structures directly.
10794
08:02:59,720 --> 08:03:02,720
And so you can quickly extract this stuff out,
10795
08:03:02,720 --> 08:03:07,720
and we read through id's name is Chuck.
10796
08:03:07,720 --> 08:03:09,720
Oops, name is Chuck.
10797
08:03:09,720 --> 08:03:12,720
There are no attributes, by the way.
10798
08:03:12,720 --> 08:03:15,720
x is two, and so we had to make x.
10799
08:03:15,720 --> 08:03:16,720
So if you look at the XML,
10800
08:03:16,720 --> 08:03:19,720
we had this concept of attributes on the outer tag.
10801
08:03:19,720 --> 08:03:21,720
These things are also not named.
10802
08:03:21,720 --> 08:03:23,720
We just have to know what we're looking for.
10803
08:03:23,720 --> 08:03:25,720
JSON represents simple structures,
10804
08:03:25,720 --> 08:03:29,720
but it's much simpler to use.
10805
08:03:29,720 --> 08:03:34,720
So I hope this has been useful to you
10806
08:03:34,720 --> 08:03:37,720
and talk to you in a bit about some more JSON.
10807
08:03:41,720 --> 08:03:45,720
So the service-oriented approach is a way we approach solving
10808
08:03:45,720 --> 08:03:47,720
a complex application problem
10809
08:03:47,720 --> 08:03:50,720
where all the data really isn't present in one computer system.
10810
08:03:50,720 --> 08:03:53,720
It's somehow spread out over the internet,
10811
08:03:53,720 --> 08:03:56,720
connected via the internet or internal network.
10812
08:03:56,720 --> 08:04:00,720
And so the idea is that some applications
10813
08:04:00,720 --> 08:04:02,720
just can't contain everything.
10814
08:04:02,720 --> 08:04:05,720
The perfect example is a travel website
10815
08:04:05,720 --> 08:04:08,720
that can book you a flight, book you a car,
10816
08:04:08,720 --> 08:04:11,720
buy tickets, book you a hotel, and do all these things.
10817
08:04:11,720 --> 08:04:15,720
Well, that travel website is neither a hotel
10818
08:04:15,720 --> 08:04:17,720
nor a rental car company nor an airline,
10819
08:04:17,720 --> 08:04:19,720
but what it really does is it talks to all these services
10820
08:04:19,720 --> 08:04:21,720
somewhere else on the web on your behalf,
10821
08:04:21,720 --> 08:04:23,720
and it makes reservations for you.
10822
08:04:23,720 --> 08:04:26,720
And so you have this convenient user interface that says,
10823
08:04:26,720 --> 08:04:28,720
oh, here's your whole vacation.
10824
08:04:28,720 --> 08:04:30,720
I'm going to figure all this stuff out.
10825
08:04:30,720 --> 08:04:33,720
Now you say go, and it goes book, book, book, book,
10826
08:04:33,720 --> 08:04:35,720
and books on all these other systems.
10827
08:04:35,720 --> 08:04:38,720
Now it requires a lot of infrastructure,
10828
08:04:38,720 --> 08:04:41,720
a lot of coordination, and a lot of effort
10829
08:04:41,720 --> 08:04:44,720
to make sure that your application can talk.
10830
08:04:44,720 --> 08:04:46,720
And these other services that are out there in the internet
10831
08:04:46,720 --> 08:04:49,720
have good contracts, and you know exactly
10832
08:04:49,720 --> 08:04:53,720
how to send data to them and get data back from them.
10833
08:04:53,720 --> 08:04:55,720
And so initially, when you're building a service
10834
08:04:55,720 --> 08:04:58,720
under architecture, often you have one application,
10835
08:04:58,720 --> 08:05:01,720
and it's all internal, often it's all one language,
10836
08:05:01,720 --> 08:05:03,720
and then maybe you'll say, oh, wait a sec.
10837
08:05:03,720 --> 08:05:05,720
We want to take part of what we do and put it
10838
08:05:05,720 --> 08:05:08,720
in a second system, and then sort of come up
10839
08:05:08,720 --> 08:05:10,720
with a set of rules between the systems,
10840
08:05:10,720 --> 08:05:15,720
and then more and more and more.
10841
08:05:15,720 --> 08:05:17,720
So now that we're solving our problem
10842
08:05:17,720 --> 08:05:20,720
using a series of cooperating applications
10843
08:05:20,720 --> 08:05:22,720
communicating across the network,
10844
08:05:22,720 --> 08:05:24,720
we're going to talk a little bit more detail
10845
08:05:24,720 --> 08:05:27,720
about the notion of what we call web services.
10846
08:05:27,720 --> 08:05:30,720
And in this, we're going to take a different perspective.
10847
08:05:30,720 --> 08:05:32,720
Instead of building our application
10848
08:05:32,720 --> 08:05:34,720
and breaking it into pieces, we're
10849
08:05:34,720 --> 08:05:36,720
going to have an application that's going to really consume
10850
08:05:36,720 --> 08:05:37,720
an API from somebody else.
10851
08:05:37,720 --> 08:05:43,720
So there is some other provider of this API that's not us.
10852
08:05:43,720 --> 08:05:46,720
And so if you're going to talk to somebody's data,
10853
08:05:46,720 --> 08:05:50,720
like Google or Amazon or Twitter,
10854
08:05:50,720 --> 08:05:52,720
they're going to say, you have to use our API.
10855
08:05:52,720 --> 08:05:53,720
So what's that?
10856
08:05:53,720 --> 08:05:57,720
So an API is a contract that says, look, if you do this,
10857
08:05:57,720 --> 08:05:59,720
and this and this and this, we're going to give you data this way.
10858
08:05:59,720 --> 08:06:01,720
And they set the rules.
10859
08:06:01,720 --> 08:06:03,720
They tell you what the URLs are.
10860
08:06:03,720 --> 08:06:05,720
They'll tell you if it's XML or JSON.
10861
08:06:05,720 --> 08:06:07,720
And this is called the Application Program Interface.
10862
08:06:07,720 --> 08:06:11,720
And it's something you read and you understand.
10863
08:06:11,720 --> 08:06:14,720
And so you go look at the documentation.
10864
08:06:14,720 --> 08:06:17,720
This is the documentation for the Google Maps API.
10865
08:06:17,720 --> 08:06:20,720
So it turns out that Google knows a lot about maps.
10866
08:06:20,720 --> 08:06:21,720
It knows a lot of data.
10867
08:06:21,720 --> 08:06:23,720
It knows how to search maps.
10868
08:06:23,720 --> 08:06:26,720
And it actually provides some of those features to you
10869
08:06:26,720 --> 08:06:29,720
that your application can take advantage of.
10870
08:06:29,720 --> 08:06:31,720
I took advantage of this at one point
10871
08:06:31,720 --> 08:06:35,720
by asking all the students in one section of one of my online courses
10872
08:06:35,720 --> 08:06:36,720
where they were from.
10873
08:06:36,720 --> 08:06:38,720
And I just let them type in where it was.
10874
08:06:38,720 --> 08:06:41,720
And then I said, well, I don't know how to code any of that.
10875
08:06:41,720 --> 08:06:44,720
So I used this API doing what's called geocoding
10876
08:06:44,720 --> 08:06:48,720
to look all those places up and get precise latitudes and longitudes
10877
08:06:48,720 --> 08:06:50,720
for the ones Google could figure out.
10878
08:06:50,720 --> 08:06:52,720
And that saved me a lot of work.
10879
08:06:52,720 --> 08:06:54,720
Now, these are expensive resources,
10880
08:06:54,720 --> 08:06:58,720
but I could be patient and make use of these resources,
10881
08:06:58,720 --> 08:07:02,720
which as long as you use them not too much, they can be free.
10882
08:07:02,720 --> 08:07:04,720
We'll talk a little bit more about rate limiting
10883
08:07:04,720 --> 08:07:06,720
and what's free and what's not in a bit.
10884
08:07:06,720 --> 08:07:08,720
But you start by reading documentation.
10885
08:07:08,720 --> 08:07:12,720
It says, do this, hit this URL, hit that URL.
10886
08:07:12,720 --> 08:07:14,720
So if you read that documentation,
10887
08:07:14,720 --> 08:07:18,720
you will find that there is a URL that you can hit.
10888
08:07:18,720 --> 08:07:20,720
And they tell you where to go.
10889
08:07:20,720 --> 08:07:22,720
And then you go to this URL.
10890
08:07:22,720 --> 08:07:23,720
You add a question mark.
10891
08:07:23,720 --> 08:07:25,720
And then you say address equals.
10892
08:07:25,720 --> 08:07:27,720
And then an hour plus.
10893
08:07:27,720 --> 08:07:28,720
And there's all these rules.
10894
08:07:28,720 --> 08:07:30,720
These are called URL encoding rules.
10895
08:07:30,720 --> 08:07:32,720
When you have key values on URLs,
10896
08:07:32,720 --> 08:07:36,720
the plus means space and percent two C means comma.
10897
08:07:36,720 --> 08:07:40,720
So these are called URL encoded.
10898
08:07:40,720 --> 08:07:42,720
But don't worry too much about that
10899
08:07:42,720 --> 08:07:44,720
because we're going to have a magic library
10900
08:07:44,720 --> 08:07:46,720
like we always do in Python that takes care of this.
10901
08:07:46,720 --> 08:07:49,720
And so if you were to hit this URL,
10902
08:07:49,720 --> 08:07:52,720
you type it in the exact right way in your browser,
10903
08:07:52,720 --> 08:07:54,720
you will get back a JSON document.
10904
08:07:54,720 --> 08:07:56,720
It's an object that has key value pairs.
10905
08:07:56,720 --> 08:07:58,720
The first value is the status,
10906
08:07:58,720 --> 08:08:00,720
then it has these results and it's a list.
10907
08:08:00,720 --> 08:08:02,720
And you dive down and eventually you can kind of find
10908
08:08:02,720 --> 08:08:04,720
the latitude and longitude of the thing
10909
08:08:04,720 --> 08:08:06,720
that you are looking for.
10910
08:08:06,720 --> 08:08:10,720
And so the idea is can we write a program that can read this?
10911
08:08:10,720 --> 08:08:13,720
And so here's our little program that reads this.
10912
08:08:13,720 --> 08:08:16,720
And a lot of this is sort of comfortable.
10913
08:08:16,720 --> 08:08:19,720
You've already seen some of this.
10914
08:08:19,720 --> 08:08:21,720
You import URL lib.
10915
08:08:21,720 --> 08:08:23,720
We have to parse some JSON.
10916
08:08:23,720 --> 08:08:24,720
We grab the URL.
10917
08:08:24,720 --> 08:08:26,720
And then we're going to write a little while loop
10918
08:08:26,720 --> 08:08:28,720
that's going to ask for a location.
10919
08:08:28,720 --> 08:08:30,720
And we can type that location in.
10920
08:08:30,720 --> 08:08:34,720
And we've got to concatenate with this URL
10921
08:08:34,720 --> 08:08:36,720
the location equals.
10922
08:08:36,720 --> 08:08:38,720
And there is a bit of code, a library,
10923
08:08:38,720 --> 08:08:40,720
that's called parse URL and code
10924
08:08:40,720 --> 08:08:42,720
that takes the key and the value.
10925
08:08:42,720 --> 08:08:45,720
So the address equals and then whatever this text is
10926
08:08:45,720 --> 08:08:48,720
that we read in from the user, that goes in here.
10927
08:08:48,720 --> 08:08:50,720
And it does that URL encoding with the pluses
10928
08:08:50,720 --> 08:08:51,720
and the percent to C.
10929
08:08:51,720 --> 08:08:53,720
And all that stuff is taken care of.
10930
08:08:53,720 --> 08:08:57,720
And that is our URL that we're going to pass to URL open.
10931
08:08:57,720 --> 08:08:59,720
So we print out that we're going to retrieve it.
10932
08:08:59,720 --> 08:09:00,720
Prints this out.
10933
08:09:00,720 --> 08:09:02,720
And if you look at this, it's too long.
10934
08:09:02,720 --> 08:09:04,720
It has all that fancy stuff on it.
10935
08:09:04,720 --> 08:09:06,720
And then we read it.
10936
08:09:06,720 --> 08:09:08,720
I mean, we open it with URL open.
10937
08:09:08,720 --> 08:09:10,720
And then we read it and decode it.
10938
08:09:10,720 --> 08:09:14,720
So these two things, hit this URL, decode it.
10939
08:09:14,720 --> 08:09:17,720
And then we retrieved 1669 characters
10940
08:09:17,720 --> 08:09:19,720
because it's just a, in this case,
10941
08:09:19,720 --> 08:09:22,720
because we've decoded it, data is a string now
10942
08:09:22,720 --> 08:09:26,720
that read as bytes and data is a string.
10943
08:09:26,720 --> 08:09:30,720
So we read that many characters, 1669 characters.
10944
08:09:30,720 --> 08:09:32,720
And then we're going to take this data
10945
08:09:32,720 --> 08:09:34,720
and we're going to parse it with JSON.
10946
08:09:34,720 --> 08:09:37,720
And we might get to bad data here.
10947
08:09:37,720 --> 08:09:39,720
It might blow up, but it might work.
10948
08:09:39,720 --> 08:09:41,720
So in this case, it works.
10949
08:09:41,720 --> 08:09:43,720
We have an error that basically says,
10950
08:09:43,720 --> 08:09:45,720
if we got a bad thing, we're going to blow up.
10951
08:09:45,720 --> 08:09:47,720
But in this case, it doesn't blow up.
10952
08:09:47,720 --> 08:09:49,720
And so now we're going to sort of dig through.
10953
08:09:49,720 --> 08:09:55,720
And if you go back, let me just go back.
10954
08:09:55,720 --> 08:09:58,720
So the results sub-zero geometry.
10955
08:09:58,720 --> 08:10:00,720
Let's show you how that works.
10956
08:10:00,720 --> 08:10:03,720
So results is the first key.
10957
08:10:03,720 --> 08:10:06,720
So this is a dictionary with a key of results.
10958
08:10:06,720 --> 08:10:08,720
But then it has a list.
10959
08:10:08,720 --> 08:10:12,720
And the zero item, this list starts here and goes there.
10960
08:10:12,720 --> 08:10:14,720
And I'm only going to show part of it,
10961
08:10:14,720 --> 08:10:16,720
but there's many things here.
10962
08:10:16,720 --> 08:10:18,720
So the zero item is this.
10963
08:10:18,720 --> 08:10:20,720
This is the sub-zero.
10964
08:10:20,720 --> 08:10:23,720
And then geometry within that sub-zero item.
10965
08:10:23,720 --> 08:10:29,720
So if we look at that, it is the outer dictionary,
10966
08:10:29,720 --> 08:10:32,720
the first item in the list, sub-geometry.
10967
08:10:32,720 --> 08:10:35,720
So that grabs one part.
10968
08:10:35,720 --> 08:10:40,720
That grabs this part right here.
10969
08:10:40,720 --> 08:10:43,720
And then we're going to go into location and lat.
10970
08:10:43,720 --> 08:10:45,720
And those are just keys within keys,
10971
08:10:45,720 --> 08:10:47,720
a dictionary within a dictionary.
10972
08:10:47,720 --> 08:10:50,720
And so you see it says sub-location, sub-lat.
10973
08:10:50,720 --> 08:10:53,720
And so that is literally going to pull out
10974
08:10:53,720 --> 08:10:55,720
of that complex structure.
10975
08:10:55,720 --> 08:10:57,720
That will pull the latitude out.
10976
08:10:57,720 --> 08:11:00,720
And then in the next line, pull the longitude out.
10977
08:11:00,720 --> 08:11:03,720
So we can pull the latitude and longitude out.
10978
08:11:03,720 --> 08:11:04,720
And then we print it out.
10979
08:11:04,720 --> 08:11:07,720
And we can go into results sub-zero formatted address.
10980
08:11:07,720 --> 08:11:13,720
And that goes into results zero formatted address.
10981
08:11:13,720 --> 08:11:14,720
And that pulls this little bit out.
10982
08:11:14,720 --> 08:11:16,720
Now it takes a little while to write this stuff.
10983
08:11:16,720 --> 08:11:18,720
And you have to put a lot of debug.
10984
08:11:18,720 --> 08:11:20,720
And you don't necessarily figure out
10985
08:11:20,720 --> 08:11:22,720
this complex bit here at the end.
10986
08:11:22,720 --> 08:11:24,720
But you print it.
10987
08:11:24,720 --> 08:11:25,720
You don't get what you want.
10988
08:11:25,720 --> 08:11:26,720
You say, oh, wait a sec.
10989
08:11:26,720 --> 08:11:27,720
That was an array.
10990
08:11:27,720 --> 08:11:29,720
So I've got to add a little sub-zero there
10991
08:11:29,720 --> 08:11:30,720
to get the first one out of the array.
10992
08:11:30,720 --> 08:11:32,720
But eventually you figure it out.
10993
08:11:32,720 --> 08:11:34,720
And it's not all that difficult.
10994
08:11:34,720 --> 08:11:36,720
It's the first time, first few times you do it.
10995
08:11:36,720 --> 08:11:37,720
I'm like, what am I doing?
10996
08:11:37,720 --> 08:11:39,720
But after a while, you realize, oh, I'm just
10997
08:11:39,720 --> 08:11:42,720
sort of tearing this apart and digging deeper and deeper
10998
08:11:42,720 --> 08:11:45,720
into this data structure, which I just retrieved
10999
08:11:45,720 --> 08:11:47,720
over the internet from Google.
11000
08:11:47,720 --> 08:11:51,720
And I learned something good from that.
11001
08:11:51,720 --> 08:11:54,720
So up next, we're going to talk about how sometimes these APIs
11002
08:11:54,720 --> 08:11:58,720
protect themselves with keys or signatures
11003
08:11:58,720 --> 08:12:01,720
and why that happens and how to solve those problems.
11004
08:12:07,720 --> 08:12:09,720
We are doing some code samples here.
11005
08:12:09,720 --> 08:12:12,720
If you want to follow along, you can download the sample code.
11006
08:12:12,720 --> 08:12:14,720
All is in a big zip file.
11007
08:12:14,720 --> 08:12:15,720
I've got it.
11008
08:12:15,720 --> 08:12:19,720
We are going to be working with the Google Maps API.
11009
08:12:19,720 --> 08:12:22,720
In the old days, this Maps API was free
11010
08:12:22,720 --> 08:12:26,720
and did 2,500 requests per day.
11011
08:12:26,720 --> 08:12:28,720
But now they've made it so that parts of it
11012
08:12:28,720 --> 08:12:31,720
are behind API keys, and you start
11013
08:12:31,720 --> 08:12:33,720
having to be using OAuth and stuff.
11014
08:12:33,720 --> 08:12:36,720
But they haven't put it all behind this one address service
11015
08:12:36,720 --> 08:12:37,720
that we've been using.
11016
08:12:37,720 --> 08:12:39,720
That continues to work.
11017
08:12:39,720 --> 08:12:41,720
And the basically idea of an API is
11018
08:12:41,720 --> 08:12:42,720
you go read the documentation.
11019
08:12:42,720 --> 08:12:45,720
You find a URL.
11020
08:12:45,720 --> 08:12:47,720
And this is going to Google servers.
11021
08:12:47,720 --> 08:12:49,720
And you pass in the address.
11022
08:12:49,720 --> 08:12:51,720
And we have to pass in the address
11023
08:12:51,720 --> 08:12:53,720
using what's called URL encoding.
11024
08:12:53,720 --> 08:12:55,720
So spaces are pluses.
11025
08:12:55,720 --> 08:12:56,720
That's a comma.
11026
08:12:56,720 --> 08:12:57,720
And then that's a space.
11027
08:12:57,720 --> 08:13:00,720
And so we have to pass this in a certain way.
11028
08:13:00,720 --> 08:13:02,720
But if we do it right, we hit this.
11029
08:13:02,720 --> 08:13:04,720
We're going to get ourselves some JSON back.
11030
08:13:04,720 --> 08:13:05,720
And that's really cool.
11031
08:13:05,720 --> 08:13:09,720
And so deep inside here, we get the real address, a good address.
11032
08:13:09,720 --> 08:13:12,720
We get a geometry.
11033
08:13:12,720 --> 08:13:14,720
We have the location.
11034
08:13:14,720 --> 08:13:15,720
We got the latitude and longitude.
11035
08:13:15,720 --> 08:13:17,720
And we can extract stuff out of here.
11036
08:13:17,720 --> 08:13:18,720
And so we're talking.
11037
08:13:18,720 --> 08:13:21,720
And this one here is still rate limited to 2,500.
11038
08:13:21,720 --> 08:13:23,720
But it's one of the few parts of the Google Maps API
11039
08:13:23,720 --> 08:13:26,720
that is not hidden behind an API key.
11040
08:13:26,720 --> 08:13:27,720
In a later chapter, we'll show you
11041
08:13:27,720 --> 08:13:32,720
how to actually talk with the API key in the geodata code.
11042
08:13:32,720 --> 08:13:36,720
The geoload shows you how to use an API key
11043
08:13:36,720 --> 08:13:39,720
if you want to jump ahead and take a look at that.
11044
08:13:39,720 --> 08:13:42,720
But for now, we're just going to take a look at GeoJSON,
11045
08:13:42,720 --> 08:13:44,720
which is going to retrieve one page and tear it apart.
11046
08:13:44,720 --> 08:13:46,720
So let's take a look.
11047
08:13:46,720 --> 08:13:50,720
So we're going to grab the URL.lib stuff and import JSON.
11048
08:13:50,720 --> 08:13:51,720
So now we're going to use JSON.
11049
08:13:51,720 --> 08:13:56,720
But we're going to actually pull the data out of the internet.
11050
08:13:56,720 --> 08:14:00,720
And so I just take that service URL for Google Maps API.
11051
08:14:00,720 --> 08:14:02,720
I found that somewhere in the documentation.
11052
08:14:02,720 --> 08:14:05,720
And then I'm going to have a loop that's going to run forever.
11053
08:14:05,720 --> 08:14:08,720
I'm going to add for the location.
11054
08:14:08,720 --> 08:14:11,720
And then if I hit enter, that's what this is saying,
11055
08:14:11,720 --> 08:14:12,720
get out of the loop.
11056
08:14:12,720 --> 08:14:15,720
And then what I'm going to do is I'm going to concatenate
11057
08:14:15,720 --> 08:14:18,720
the service URL, which is this.
11058
08:14:18,720 --> 08:14:24,720
And this URL.lib parse URL encode gives a dictionary of address equals.
11059
08:14:24,720 --> 08:14:29,720
And this bit right here gives me the string
11060
08:14:29,720 --> 08:14:32,720
that leads to putting this address equals
11061
08:14:32,720 --> 08:14:34,720
but then coding these spaces the right way.
11062
08:14:34,720 --> 08:14:38,720
So if you type a space, that bit of code turns it into the plus.
11063
08:14:38,720 --> 08:14:39,720
So that's important.
11064
08:14:39,720 --> 08:14:42,720
And I've got the question mark sitting here at the end of that.
11065
08:14:42,720 --> 08:14:45,720
Then what we're going to do is we're just going to do a URL open
11066
08:14:45,720 --> 08:14:46,720
to get a handle.
11067
08:14:46,720 --> 08:14:48,720
We're going to read the whole document.
11068
08:14:48,720 --> 08:14:51,720
And because it's UTF-8 coming from the outside world
11069
08:14:51,720 --> 08:14:54,720
and we want it turned into Unicode inside our application,
11070
08:14:54,720 --> 08:14:56,720
we say.decode.
11071
08:14:56,720 --> 08:14:58,720
We can ask how many characters we got.
11072
08:14:58,720 --> 08:15:00,720
And we put our JSON load s.
11073
08:15:00,720 --> 08:15:02,720
Now up till now we've been just doing load s's
11074
08:15:02,720 --> 08:15:03,720
from internal strings.
11075
08:15:03,720 --> 08:15:07,720
But this is now a string that came from the outside world.
11076
08:15:07,720 --> 08:15:11,720
And we'll put a try accept in.
11077
08:15:11,720 --> 08:15:14,720
And we'll set JS to be none and that'll be our little trigger.
11078
08:15:14,720 --> 08:15:18,720
Now we can look for, they give us, if we take a look at the output,
11079
08:15:18,720 --> 08:15:20,720
they give us this okay.
11080
08:15:20,720 --> 08:15:23,720
And that status can be a problem and it can complain about things.
11081
08:15:23,720 --> 08:15:26,720
So we have to check to see if we got a good status.
11082
08:15:26,720 --> 08:15:30,720
So at this point, if you look at the outer bit of this,
11083
08:15:30,720 --> 08:15:35,720
the outer bit that we get is a curly brace, so it's a dictionary.
11084
08:15:35,720 --> 08:15:40,720
Then there is, within that dictionary, a key results, which is a list.
11085
08:15:40,720 --> 08:15:43,720
But then the second thing in the outer dictionary is status.
11086
08:15:43,720 --> 08:15:54,720
And so we can ask if the word, if we got a false,
11087
08:15:54,720 --> 08:15:56,720
if we got nothing, that will quit.
11088
08:15:56,720 --> 08:16:02,720
If we don't have a status key in that object, or that dictionary,
11089
08:16:02,720 --> 08:16:06,720
or it's not equal to okay, any number of those things,
11090
08:16:06,720 --> 08:16:14,720
if this, or this, or this, either of those are true, we're going to quit.
11091
08:16:14,720 --> 08:16:16,720
Failure to retrieve and print the data out.
11092
08:16:16,720 --> 08:16:18,720
And when you're starting to read stuff all over the net,
11093
08:16:18,720 --> 08:16:20,720
you often have to put debugging in here like this,
11094
08:16:20,720 --> 08:16:22,720
like oh, something quit, I've got to figure out.
11095
08:16:22,720 --> 08:16:24,720
And so debugging it.
11096
08:16:24,720 --> 08:16:27,720
Next thing we're going to do is call JSON dump s,
11097
08:16:27,720 --> 08:16:33,720
which is the opposite of load s, which takes this dictionary that includes arrays,
11098
08:16:33,720 --> 08:16:36,720
and we're going to pretty print it with an indent of four.
11099
08:16:36,720 --> 08:16:38,720
And then we're going to print that out.
11100
08:16:38,720 --> 08:16:41,720
And so if you look at my code, we'll see that the first thing we do,
11101
08:16:41,720 --> 08:16:44,720
once we've parsed it, is we print it back out so we can see it.
11102
08:16:44,720 --> 08:16:46,720
And then we're going to dig into it.
11103
08:16:46,720 --> 08:16:48,720
So let's go ahead and run this code.
11104
08:16:48,720 --> 08:16:55,720
Python geo JSON.py.
11105
08:16:55,720 --> 08:16:58,720
One of these days, I will always type Python three.
11106
08:16:58,720 --> 08:17:01,720
And arbor comma Michigan.
11107
08:17:01,720 --> 08:17:03,720
Okay, so it ran.
11108
08:17:03,720 --> 08:17:05,720
And so you see that it retrieved this URL.
11109
08:17:05,720 --> 08:17:10,720
This URL was constructed and retrieved 1736 characters.
11110
08:17:10,720 --> 08:17:13,720
And it's JSON pretty printed with an indent of four.
11111
08:17:13,720 --> 08:17:17,720
And this is that JSON dump s all the way down to here.
11112
08:17:17,720 --> 08:17:19,720
So that's just JSON dump s.
11113
08:17:19,720 --> 08:17:21,720
And then it starts extracting.
11114
08:17:21,720 --> 08:17:23,720
So it's going to pull things out.
11115
08:17:23,720 --> 08:17:26,720
Now, when you write this code, it's really easy to look at this and say,
11116
08:17:26,720 --> 08:17:27,720
oh, great, it's easy.
11117
08:17:27,720 --> 08:17:30,720
I tend to have to print this stuff out over and over and over
11118
08:17:30,720 --> 08:17:32,720
as I kind of construct this expression.
11119
08:17:32,720 --> 08:17:36,720
But if we look at it, the outer dictionary,
11120
08:17:36,720 --> 08:17:41,720
the outer dictionary sub results leads to this array.
11121
08:17:41,720 --> 08:17:43,720
And if you go look at this array carefully,
11122
08:17:43,720 --> 08:17:46,720
you find there is only one thing in it.
11123
08:17:46,720 --> 08:17:48,720
And so the results is an array.
11124
08:17:48,720 --> 08:17:53,720
Sub zero gets us this dictionary.
11125
08:17:53,720 --> 08:17:56,720
I keep wanting to say object because that's what it's called.
11126
08:17:56,720 --> 08:17:58,720
And that goes all the way down to here.
11127
08:17:58,720 --> 08:18:00,720
So that's what we get there.
11128
08:18:00,720 --> 08:18:03,720
And then within that, we now have an object.
11129
08:18:03,720 --> 08:18:08,720
And we look for geometry within that object.
11130
08:18:08,720 --> 08:18:10,720
Where is geometry?
11131
08:18:10,720 --> 08:18:11,720
Right there.
11132
08:18:11,720 --> 08:18:13,720
Geometry.
11133
08:18:13,720 --> 08:18:17,720
Geometry goes from there to there.
11134
08:18:17,720 --> 08:18:19,720
There's geometry in there.
11135
08:18:19,720 --> 08:18:20,720
You've got to get used to it.
11136
08:18:20,720 --> 08:18:22,720
That's why it's nice to have this stuff indented.
11137
08:18:22,720 --> 08:18:25,720
Geometry sub low.
11138
08:18:25,720 --> 08:18:26,720
Oops, come back.
11139
08:18:26,720 --> 08:18:27,720
Come back.
11140
08:18:27,720 --> 08:18:30,720
And then we go to location within that.
11141
08:18:30,720 --> 08:18:32,720
So location within geometry.
11142
08:18:32,720 --> 08:18:35,720
And then within location, we have lat and long.
11143
08:18:35,720 --> 08:18:41,720
And so this is pulling out this 42 and 83.
11144
08:18:41,720 --> 08:18:43,720
And then so we print that out.
11145
08:18:43,720 --> 08:18:45,720
Take a look.
11146
08:18:45,720 --> 08:18:46,720
And that prints that out.
11147
08:18:46,720 --> 08:18:48,720
Pulls that right out of the JSON.
11148
08:18:48,720 --> 08:18:50,720
These are tricky to write, but after a while you win
11149
08:18:50,720 --> 08:18:53,720
and you get it right and it's just fine.
11150
08:18:53,720 --> 08:18:54,720
Okay.
11151
08:18:54,720 --> 08:18:56,720
And so we do the same thing.
11152
08:18:56,720 --> 08:19:00,720
Results of zero formatted address gets us this.
11153
08:19:00,720 --> 08:19:03,720
And so that's how we print the location out.
11154
08:19:03,720 --> 08:19:08,720
And so that's a real quick look at how we would do that
11155
08:19:08,720 --> 08:19:16,720
with the JSON talking to the Google Maps API.
11156
08:19:16,720 --> 08:19:18,720
Okay. Hope this helps.
11157
08:19:21,720 --> 08:19:24,720
Now we're going to talk about API rate limiting and security.
11158
08:19:24,720 --> 08:19:27,720
The key thing is that the Google API
11159
08:19:27,720 --> 08:19:29,720
and the Google data is super valuable.
11160
08:19:29,720 --> 08:19:31,720
And you could build a website that did nothing
11161
08:19:31,720 --> 08:19:34,720
but sort of like asked the person for something
11162
08:19:34,720 --> 08:19:37,720
and then showed them that place and make them be a map searcher.
11163
08:19:37,720 --> 08:19:41,720
And you added so little value and Google did all the hard work.
11164
08:19:41,720 --> 08:19:43,720
And so they protect these somewhat.
11165
08:19:43,720 --> 08:19:46,720
Sometimes they'll say you can only do 50 of these a day
11166
08:19:46,720 --> 08:19:48,720
or 500 a day or whatever.
11167
08:19:48,720 --> 08:19:49,720
That's called rate limiting.
11168
08:19:49,720 --> 08:19:51,720
And sometimes they say you've got to log in.
11169
08:19:51,720 --> 08:19:54,720
You've got to create an account and get a key with us
11170
08:19:54,720 --> 08:19:55,720
and then present your key.
11171
08:19:55,720 --> 08:19:58,720
So that means that your account only gets so many.
11172
08:19:58,720 --> 08:20:00,720
And they keep track of who's using their service
11173
08:20:00,720 --> 08:20:02,720
and how much they're using it.
11174
08:20:02,720 --> 08:20:04,720
Google gives you even sort of a dashboard
11175
08:20:04,720 --> 08:20:05,720
that tells you some of this stuff.
11176
08:20:05,720 --> 08:20:07,720
It's kind of nice.
11177
08:20:07,720 --> 08:20:12,720
And so the other thing is that sometimes an API is free
11178
08:20:12,720 --> 08:20:14,720
and then it becomes popular and they decide
11179
08:20:14,720 --> 08:20:17,720
they're going to put a key on it or a rate limit on it.
11180
08:20:17,720 --> 08:20:19,720
So you've got to kind of play this game with them
11181
08:20:19,720 --> 08:20:23,720
and the rules kind of change as things progress.
11182
08:20:23,720 --> 08:20:26,720
So that geocoding API that we're talking about
11183
08:20:26,720 --> 08:20:31,720
has at one point in time 2500 requests a day.
11184
08:20:31,720 --> 08:20:34,720
You can get more requests if you get a key.
11185
08:20:34,720 --> 08:20:38,720
Now another API that we can talk about is the Twitter API.
11186
08:20:38,720 --> 08:20:41,720
Now Twitter API started out as a free public API
11187
08:20:41,720 --> 08:20:44,720
but then Twitter realized that people were making more money
11188
08:20:44,720 --> 08:20:47,720
off of Twitter's data than Twitter was making off of Twitter's data.
11189
08:20:47,720 --> 08:20:51,720
And so Twitter makes it so that you have to have an account.
11190
08:20:51,720 --> 08:20:54,720
You can only request data from their API
11191
08:20:54,720 --> 08:20:57,720
if you use your account key to sign that.
11192
08:20:57,720 --> 08:21:01,720
And so there's a whole series of getting and issuing keys
11193
08:21:01,720 --> 08:21:03,720
and then using those keys.
11194
08:21:03,720 --> 08:21:07,720
And I'll just give you a short summary of the kind of code
11195
08:21:07,720 --> 08:21:13,720
that it takes to build those requests up that have to be signed.
11196
08:21:13,720 --> 08:21:16,720
So you'll look through the Twitter documentation
11197
08:21:16,720 --> 08:21:21,720
and it'll say, oh, this URL to get the tweets, et cetera, et cetera.
11198
08:21:21,720 --> 08:21:24,720
And it says do a get request to this URL and that URL
11199
08:21:24,720 --> 08:21:26,720
and maybe substitute a little bit of things here
11200
08:21:26,720 --> 08:21:28,720
for the screen name you're looking for
11201
08:21:28,720 --> 08:21:30,720
or how many tweets you want.
11202
08:21:30,720 --> 08:21:35,720
And they tell you how to carefully construct these URLs.
11203
08:21:35,720 --> 08:21:39,720
And so here's an example bit of code that talks to the Twitter.
11204
08:21:39,720 --> 08:21:42,720
For now, I'll ignore the security bit.
11205
08:21:42,720 --> 08:21:45,720
That's all hidden in this TW URL.
11206
08:21:45,720 --> 08:21:46,720
So it looks a lot like the last one.
11207
08:21:46,720 --> 08:21:48,720
We're going to use JSON and URL lib.
11208
08:21:48,720 --> 08:21:52,720
And we have found that this is the API name, blah, blah, blah, blah, blah,
11209
08:21:52,720 --> 08:21:56,720
list.json, getting a friend list for a particular person.
11210
08:21:56,720 --> 08:22:01,720
And so that is the base URL that we're going to do.
11211
08:22:01,720 --> 08:22:04,720
And we're going to ask for a Twitter account.
11212
08:22:04,720 --> 08:22:06,720
If we hit enter, we're going to break out.
11213
08:22:06,720 --> 08:22:08,720
And TW URL augment, we're going to say,
11214
08:22:08,720 --> 08:22:12,720
give me the first five friends of this particular screen name,
11215
08:22:12,720 --> 08:22:14,720
the one we just read in from input.
11216
08:22:14,720 --> 08:22:16,720
And this TW URL you'll see in a second,
11217
08:22:16,720 --> 08:22:20,720
it adds a bunch of stuff to prove that you are who you are.
11218
08:22:20,720 --> 08:22:21,720
It's signing that URL.
11219
08:22:21,720 --> 08:22:23,720
So you're sending a signed URL,
11220
08:22:23,720 --> 08:22:26,720
which is nothing more than a whole bunch of crazy characters.
11221
08:22:26,720 --> 08:22:27,720
We'll see that in a second.
11222
08:22:27,720 --> 08:22:28,720
We retrieve it.
11223
08:22:28,720 --> 08:22:30,720
And this is pretty straightforward.
11224
08:22:30,720 --> 08:22:34,720
We can just open the URL, read it, and decode it.
11225
08:22:34,720 --> 08:22:37,720
Decode solves the UTF-8 thing.
11226
08:22:37,720 --> 08:22:39,720
Makes it all so that data is a real string
11227
08:22:39,720 --> 08:22:41,720
and it's in the Unicode internally.
11228
08:22:41,720 --> 08:22:43,720
Now we can actually get the headers.
11229
08:22:43,720 --> 08:22:48,720
Remember I told you earlier that URL open bypasses the headers,
11230
08:22:48,720 --> 08:22:49,720
but it's stored them for later.
11231
08:22:49,720 --> 08:22:51,720
And we can say, hey, give me back those headers.
11232
08:22:51,720 --> 08:22:54,720
And that gives us back a dictionary of headers.
11233
08:22:54,720 --> 08:22:56,720
And the headers, if you go all the way back,
11234
08:22:56,720 --> 08:22:59,720
are a bunch of key value pairs.
11235
08:22:59,720 --> 08:23:02,720
Key colon value in the headers.
11236
08:23:02,720 --> 08:23:04,720
And in Twitter, if you read the documentation,
11237
08:23:04,720 --> 08:23:06,720
there's this x dash rate limit remaining
11238
08:23:06,720 --> 08:23:09,720
that tells you each time it returns to the API,
11239
08:23:09,720 --> 08:23:11,720
response to the API call that you made,
11240
08:23:11,720 --> 08:23:13,720
it says, look, you've got 12 left.
11241
08:23:13,720 --> 08:23:14,720
You've got 11 left.
11242
08:23:14,720 --> 08:23:15,720
You've got 10.
11243
08:23:15,720 --> 08:23:16,720
So you can print that out.
11244
08:23:16,720 --> 08:23:19,720
So this prints out how many you've got left.
11245
08:23:19,720 --> 08:23:21,720
Then we parse the JSON data.
11246
08:23:21,720 --> 08:23:24,720
We're going to print it so we can debug it.
11247
08:23:24,720 --> 08:23:27,720
This dump to string and then print it.
11248
08:23:27,720 --> 08:23:29,720
Indent equals four.
11249
08:23:29,720 --> 08:23:31,720
This is called pretty printing.
11250
08:23:31,720 --> 08:23:34,720
And it's indenting things really nicely
11251
08:23:34,720 --> 08:23:36,720
so that you can make more sense of it.
11252
08:23:36,720 --> 08:23:38,720
Whereas when these things are talking,
11253
08:23:38,720 --> 08:23:39,720
when programs are talking to each other,
11254
08:23:39,720 --> 08:23:44,720
they don't really make the output look particularly pretty.
11255
08:23:44,720 --> 08:23:48,720
And then if you, we're going to go through,
11256
08:23:48,720 --> 08:23:50,720
we have the outer thing of users.
11257
08:23:50,720 --> 08:23:52,720
And we're going to print out the screen name
11258
08:23:52,720 --> 08:23:55,720
and go grab the, for each user and users,
11259
08:23:55,720 --> 08:23:56,720
we're going to print their screen name.
11260
08:23:56,720 --> 08:23:59,720
We're going to grab their status text and print that out.
11261
08:23:59,720 --> 08:24:02,720
And so this is what that data looks like.
11262
08:24:02,720 --> 08:24:04,720
Kind of chopped a bit.
11263
08:24:04,720 --> 08:24:08,720
So the thing we get is an outer layer.
11264
08:24:08,720 --> 08:24:11,720
We get users and then we get a list.
11265
08:24:11,720 --> 08:24:13,720
And here's the first user.
11266
08:24:13,720 --> 08:24:14,720
Now, if you look at the actual data,
11267
08:24:14,720 --> 08:24:15,720
it's much larger than this.
11268
08:24:15,720 --> 08:24:18,720
Here's the second user and then we have status text,
11269
08:24:18,720 --> 08:24:22,720
status text and the screen name.
11270
08:24:22,720 --> 08:24:26,720
And so those are the bits that we're extracting from that.
11271
08:24:26,720 --> 08:24:28,720
If you look, we're going to grab the screen name.
11272
08:24:28,720 --> 08:24:31,720
We're going to grab the status text and away you go.
11273
08:24:31,720 --> 08:24:37,720
So you can start with this,
11274
08:24:37,720 --> 08:24:39,720
but you realize that once you're looking at this
11275
08:24:39,720 --> 08:24:41,720
and you're printing this out with pretty printing,
11276
08:24:41,720 --> 08:24:43,720
you can sort of work your way in
11277
08:24:43,720 --> 08:24:46,720
knowing that it's either a dictionary or a list.
11278
08:24:46,720 --> 08:24:47,720
If it's a dictionary, you look up the key.
11279
08:24:47,720 --> 08:24:50,720
If it's a list, you say which position it is
11280
08:24:50,720 --> 08:24:52,720
and then you get more dictionaries within dictionaries
11281
08:24:52,720 --> 08:24:54,720
within dictionaries and away you go.
11282
08:24:54,720 --> 08:24:58,720
And so this code actually, when it runs,
11283
08:24:58,720 --> 08:25:01,720
it prints out the screen name and then that status
11284
08:25:01,720 --> 08:25:02,720
and the next person.
11285
08:25:02,720 --> 08:25:04,720
So it's my first five, in that case,
11286
08:25:04,720 --> 08:25:07,720
my first five friends and their most recent status,
11287
08:25:07,720 --> 08:25:09,720
the first five people.
11288
08:25:09,720 --> 08:25:12,720
Now, let's talk a little bit about how this security works.
11289
08:25:12,720 --> 08:25:15,720
And so you have to go to the website.
11290
08:25:15,720 --> 08:25:16,720
You have to have a Twitter account.
11291
08:25:16,720 --> 08:25:19,720
You can't talk to Twitter API without a Twitter account.
11292
08:25:19,720 --> 08:25:23,720
And then you go to this website and then you set up a key.
11293
08:25:23,720 --> 08:25:26,720
You say, I'm going to build an application
11294
08:25:26,720 --> 08:25:28,720
that is going to consume the Twitter API.
11295
08:25:28,720 --> 08:25:30,720
And then you go in, you have to work through.
11296
08:25:30,720 --> 08:25:33,720
There's documentation on how all this stuff works.
11297
08:25:33,720 --> 08:25:35,720
You set up an API key.
11298
08:25:35,720 --> 08:25:36,720
You set the application.
11299
08:25:36,720 --> 08:25:39,720
So I made a key called Python on my laptop.
11300
08:25:39,720 --> 08:25:42,720
And it gives us some values.
11301
08:25:42,720 --> 08:25:44,720
It gives us a consumer key, a consumer secret,
11302
08:25:44,720 --> 08:25:46,720
a token key, and a token secret.
11303
08:25:46,720 --> 08:25:48,720
And you get to regenerate these.
11304
08:25:48,720 --> 08:25:51,720
And there's this file called hidden.py.
11305
08:25:51,720 --> 08:25:54,720
And you edit them and copy and paste all the stuff
11306
08:25:54,720 --> 08:25:58,720
from those pages, those four values, into these strings.
11307
08:25:58,720 --> 08:26:02,720
Now, if you download my code, I don't have my keys in there.
11308
08:26:02,720 --> 08:26:04,720
I got some placeholders for this stuff.
11309
08:26:04,720 --> 08:26:07,720
So you've got to get to this web page that's on Twitter,
11310
08:26:07,720 --> 08:26:09,720
copy these things in,
11311
08:26:09,720 --> 08:26:13,720
and then the TWRL code will start to work.
11312
08:26:13,720 --> 08:26:16,720
It uses a technology called OAuth,
11313
08:26:16,720 --> 08:26:19,720
which is a way to sign a URL
11314
08:26:19,720 --> 08:26:22,720
in a way that proves that you have the key and the secret
11315
08:26:22,720 --> 08:26:25,720
and the tokens.
11316
08:26:25,720 --> 08:26:27,720
And it can't be modified in the middle.
11317
08:26:27,720 --> 08:26:29,720
So once you send this URL,
11318
08:26:29,720 --> 08:26:31,720
they can check the key and the secret
11319
08:26:31,720 --> 08:26:32,720
to make sure that you truly signed it
11320
08:26:32,720 --> 08:26:34,720
without actually sending the key and the secret.
11321
08:26:34,720 --> 08:26:36,720
It's actually kind of cool and fascinating,
11322
08:26:36,720 --> 08:26:39,720
but we won't go into it in great detail here.
11323
08:26:39,720 --> 08:26:44,720
And so if you look at the code in TWRL.py,
11324
08:26:44,720 --> 08:26:46,720
this is the code that does it.
11325
08:26:46,720 --> 08:26:51,720
It actually pulls in an OAuth library, that hidden.py.
11326
08:26:51,720 --> 08:26:54,720
That is that code that you've got.
11327
08:26:54,720 --> 08:26:58,720
And it's got the consumer key, the consumer secret.
11328
08:26:58,720 --> 08:26:59,720
Secrets.
11329
08:26:59,720 --> 08:27:03,720
This is pulling that from hidden.py.
11330
08:27:03,720 --> 08:27:06,720
This is a lot of stuff that's using this OAuth library.
11331
08:27:06,720 --> 08:27:08,720
Don't worry too much about that.
11332
08:27:08,720 --> 08:27:11,720
Eventually it produces a URL that looks like this.
11333
08:27:11,720 --> 08:27:13,720
And what happens is this was the base URL
11334
08:27:13,720 --> 08:27:14,720
you were told to use.
11335
08:27:14,720 --> 08:27:16,720
Then you have count equals two
11336
08:27:16,720 --> 08:27:18,720
and screen name equals Dr. Chuck.
11337
08:27:18,720 --> 08:27:22,720
Those parts are your parameters to that web service call.
11338
08:27:22,720 --> 08:27:26,720
And then all this OAuth stuff is produced
11339
08:27:26,720 --> 08:27:28,720
by this OAuth code
11340
08:27:28,720 --> 08:27:30,720
and the consumer key and the secret.
11341
08:27:30,720 --> 08:27:33,720
What happens is the key gets sent,
11342
08:27:33,720 --> 08:27:37,720
the key gets sent and the secret does not get sent,
11343
08:27:37,720 --> 08:27:40,720
but they send the signature which is based on the secret
11344
08:27:40,720 --> 08:27:43,720
and then what it does is it rechecks the signature
11345
08:27:43,720 --> 08:27:44,720
on the far end.
11346
08:27:44,720 --> 08:27:48,720
Signature is a long string by regenerating the signature
11347
08:27:48,720 --> 08:27:51,720
because the secret is available to both you
11348
08:27:51,720 --> 08:27:54,720
to generate the signature and to them to check the signature.
11349
08:27:54,720 --> 08:27:57,720
So it's kind of like a hash, et cetera, et cetera.
11350
08:27:57,720 --> 08:27:59,720
You don't have to worry about all this.
11351
08:27:59,720 --> 08:28:01,720
These URLs get really long
11352
08:28:01,720 --> 08:28:03,720
and your values that you need are in,
11353
08:28:03,720 --> 08:28:06,720
the name of the URL is in and you call this routine.
11354
08:28:06,720 --> 08:28:09,720
That's called augment that takes a URL and then parameters
11355
08:28:09,720 --> 08:28:12,720
and then augments it by adding all this OAuth stuff.
11356
08:28:12,720 --> 08:28:16,720
And so that's why it's called augment to augment the URL.
11357
08:28:16,720 --> 08:28:18,720
And once you got this set up and hidden working,
11358
08:28:18,720 --> 08:28:21,720
then you sort of just augment the URL and then hit it.
11359
08:28:21,720 --> 08:28:24,720
Now, you know, if you don't have the right keys or secrets
11360
08:28:24,720 --> 08:28:25,720
or you don't have an account on Twitter,
11361
08:28:25,720 --> 08:28:26,720
then it's going to blow up.
11362
08:28:26,720 --> 08:28:27,720
But if you get it set up,
11363
08:28:27,720 --> 08:28:31,720
you will be able to talk to the Twitter API with this.
11364
08:28:31,720 --> 08:28:33,720
So this whole web services section,
11365
08:28:33,720 --> 08:28:36,720
we've done quite a bit of stuff, right?
11366
08:28:36,720 --> 08:28:40,720
We've looked at how instead of reading HTML or flat text,
11367
08:28:40,720 --> 08:28:44,720
we are creating structured data according to contracts,
11368
08:28:44,720 --> 08:28:46,720
whether it be XML or JSON.
11369
08:28:46,720 --> 08:28:48,720
We can retrieve and parse that information
11370
08:28:48,720 --> 08:28:50,720
in a deterministic way.
11371
08:28:50,720 --> 08:28:53,720
We talked about schemas that define the contracts
11372
08:28:53,720 --> 08:28:56,720
so that you know if the data you're getting is wrong,
11373
08:28:56,720 --> 08:28:57,720
you could know who to blame
11374
08:28:57,720 --> 08:28:59,720
because the schema gets violated.
11375
08:28:59,720 --> 08:29:03,720
And we've played with APIs where you're talking to someone else
11376
08:29:03,720 --> 08:29:05,720
who's defining what the rules are
11377
08:29:05,720 --> 08:29:07,720
and how to read their documentation.
11378
08:29:07,720 --> 08:29:11,720
And even if they have an API key or need to sign URLs,
11379
08:29:11,720 --> 08:29:14,720
showed a little bit about how to do that.
11380
08:29:19,720 --> 08:29:21,720
We're doing some code, sample code,
11381
08:29:21,720 --> 08:29:24,720
playing through with some sample code samples.
11382
08:29:24,720 --> 08:29:27,720
And you can get this by downloading it.
11383
08:29:27,720 --> 08:29:29,720
I've got this whole thing downloaded.
11384
08:29:29,720 --> 08:29:32,720
And I've got all the files here.
11385
08:29:32,720 --> 08:29:35,720
And these are the files we're going to play with today.
11386
08:29:35,720 --> 08:29:38,720
Today what we're going to do is talk to about the Twitter API.
11387
08:29:38,720 --> 08:29:41,720
And the one thing we've got to learn about the Twitter API
11388
08:29:41,720 --> 08:29:43,720
is we have to authorize ourselves.
11389
08:29:43,720 --> 08:29:48,720
And so we have to make sure that we have a Twitter account
11390
08:29:48,720 --> 08:29:50,720
and then we get some keys.
11391
08:29:50,720 --> 08:29:52,720
And so in this particular application,
11392
08:29:52,720 --> 08:29:54,720
if you want to duplicate what I'm doing,
11393
08:29:54,720 --> 08:29:56,720
you have to go to apps.twitter.com,
11394
08:29:56,720 --> 08:29:58,720
click this create new application button,
11395
08:29:58,720 --> 08:30:00,720
and then get some codes.
11396
08:30:00,720 --> 08:30:03,720
And the codes show up as soon as you hit this button
11397
08:30:03,720 --> 08:30:04,720
and then one more button,
11398
08:30:04,720 --> 08:30:07,720
which I'm not going to do on screen.
11399
08:30:07,720 --> 08:30:10,720
And so what happens is there are four codes
11400
08:30:10,720 --> 08:30:13,720
that you've got to put in this file hidden.py.
11401
08:30:13,720 --> 08:30:15,720
The consumer key, the consumer secret, the token key,
11402
08:30:15,720 --> 08:30:16,720
and token secret.
11403
08:30:16,720 --> 08:30:18,720
These are just messed up,
11404
08:30:18,720 --> 08:30:20,720
so I'll show you how this works
11405
08:30:20,720 --> 08:30:22,720
and it blows up if first,
11406
08:30:22,720 --> 08:30:25,720
and then I'll put my keys in here without showing you.
11407
08:30:25,720 --> 08:30:28,720
But basically, this is a little file you've got to edit
11408
08:30:28,720 --> 08:30:30,720
or these Twitter ones don't work.
11409
08:30:30,720 --> 08:30:32,720
You'll see what happens.
11410
08:30:32,720 --> 08:30:35,720
So the first one I'm going to do is do the simplest one of all.
11411
08:30:35,720 --> 08:30:38,720
And that is I call this thing Twitter Test
11412
08:30:38,720 --> 08:30:42,720
and it just is going to go ask for the user timeline.
11413
08:30:42,720 --> 08:30:43,720
And we can take a look at this.
11414
08:30:43,720 --> 08:30:46,720
And we're going to take the URL
11415
08:30:46,720 --> 08:30:48,720
and we're going to augment the URL.
11416
08:30:48,720 --> 08:30:49,720
This is the base.
11417
08:30:49,720 --> 08:30:52,720
We found this looking at the Twitter API documentation.
11418
08:30:52,720 --> 08:30:54,720
We're going to pass a parameter of screen name,
11419
08:30:54,720 --> 08:30:56,720
Dr. Chuck, and a count of two.
11420
08:30:56,720 --> 08:30:58,720
So this is just a Python dictionary.
11421
08:30:58,720 --> 08:31:03,720
And augment comes from this little bit of code called twurl.
11422
08:31:03,720 --> 08:31:07,720
And this uses a bit of code called oauth,
11423
08:31:07,720 --> 08:31:12,720
which is built into Python as well, right?
11424
08:31:12,720 --> 08:31:14,720
Yeah, that's built into Python as well.
11425
08:31:14,720 --> 08:31:18,720
And it augments the URL and it takes the key,
11426
08:31:18,720 --> 08:31:22,720
the secret, the token key, and does a thing and signs it
11427
08:31:22,720 --> 08:31:24,720
and then makes this big, long, ugly URL,
11428
08:31:24,720 --> 08:31:26,720
which you will soon see,
11429
08:31:26,720 --> 08:31:29,720
and it's a signature of the URL.
11430
08:31:29,720 --> 08:31:32,720
So we pass this data back and forth to Twitter
11431
08:31:32,720 --> 08:31:36,720
with a signature and then they recheck the signature
11432
08:31:36,720 --> 08:31:38,720
and it's a digital signature that knows that
11433
08:31:38,720 --> 08:31:41,720
this URL came from a program that knows the key,
11434
08:31:41,720 --> 08:31:43,720
secret, and token and token secret.
11435
08:31:43,720 --> 08:31:47,720
And so this augment basically is something
11436
08:31:47,720 --> 08:31:52,720
that I wrote, twurl, augment, is something I wrote
11437
08:31:52,720 --> 08:31:55,720
to make it easier to add all these oauth parameters.
11438
08:31:55,720 --> 08:32:00,720
And you feed this code by putting your data into hidden.py.
11439
08:32:00,720 --> 08:32:02,720
Lots of people get this to work, so don't worry.
11440
08:32:02,720 --> 08:32:06,720
It's kind of cool when you finally get it to work.
11441
08:32:06,720 --> 08:32:08,720
So let's take a look at what it does.
11442
08:32:08,720 --> 08:32:10,720
Just know that this makes an awesome URL
11443
08:32:10,720 --> 08:32:12,720
that does all the security.
11444
08:32:12,720 --> 08:32:15,720
And we'll see one of those URLs.
11445
08:32:15,720 --> 08:32:18,720
So ignore the certificate errors.
11446
08:32:18,720 --> 08:32:22,720
This has to do with the fact that we're using HTTPS
11447
08:32:22,720 --> 08:32:24,720
and Python doesn't have enough certificates
11448
08:32:24,720 --> 08:32:26,720
put into it by default for a lot of reasons,
11449
08:32:26,720 --> 08:32:29,720
but our quick and dirty way is to turn them off.
11450
08:32:29,720 --> 08:32:32,720
Thank you, Python, for reducing security by teaching us
11451
08:32:32,720 --> 08:32:34,720
so that this is the best way to do it.
11452
08:32:34,720 --> 08:32:37,720
That's a grumpy moment from on my part.
11453
08:32:37,720 --> 08:32:40,720
So what we're going to do is we're going to do a URL open.
11454
08:32:40,720 --> 08:32:43,720
This bit here is to shut off the security checking
11455
08:32:43,720 --> 08:32:45,720
for the SSL certificate.
11456
08:32:45,720 --> 08:32:47,720
And then we're going to read all the data.
11457
08:32:47,720 --> 08:32:49,720
And then we're going to print it out.
11458
08:32:49,720 --> 08:32:53,720
And we're also going to ask the connection,
11459
08:32:53,720 --> 08:32:56,720
this URL, remember I told you a long time ago
11460
08:32:56,720 --> 08:33:00,720
that URL lib eats the headers, but you can get them back.
11461
08:33:00,720 --> 08:33:02,720
And now we're going to ask to get a dictionary
11462
08:33:02,720 --> 08:33:03,720
of the headers back.
11463
08:33:03,720 --> 08:33:05,720
And so we'll print those out.
11464
08:33:05,720 --> 08:33:08,720
So this is really kind of just testing the body
11465
08:33:08,720 --> 08:33:10,720
and the headers and printing them out
11466
08:33:10,720 --> 08:33:12,720
sort of in as raw a way we can do.
11467
08:33:12,720 --> 08:33:14,720
So let's go run this.
11468
08:33:14,720 --> 08:33:17,720
Now, this is going to fail the first time we do it
11469
08:33:17,720 --> 08:33:21,720
because we haven't put the hidden variables in there.
11470
08:33:21,720 --> 08:33:27,720
So if I say python3twtest.py, it's going to run and blow up.
11471
08:33:27,720 --> 08:33:31,720
And it's going to give you this 401 authorization required.
11472
08:33:31,720 --> 08:33:34,720
That's a good sign because that means that you haven't yet
11473
08:33:34,720 --> 08:33:37,720
updated your values in hidden.py.
11474
08:33:37,720 --> 08:33:42,720
And so this is that augmented URL.
11475
08:33:42,720 --> 08:33:45,720
And you can see the consumer key and the consumer secret
11476
08:33:45,720 --> 08:33:48,720
and the OAuth token and whatever.
11477
08:33:48,720 --> 08:33:51,720
Okay, so these tokens are like wrong.
11478
08:33:51,720 --> 08:33:54,720
These aren't, oops, control C.
11479
08:33:54,720 --> 08:33:56,720
They aren't real.
11480
08:33:56,720 --> 08:33:58,720
But you'll notice it doesn't have the key and the secret
11481
08:33:58,720 --> 08:34:01,720
of the token key, the token secret and the secret.
11482
08:34:01,720 --> 08:34:03,720
And that's all actually encoded in this signature.
11483
08:34:03,720 --> 08:34:07,720
It turns out that you need to have the key and the token,
11484
08:34:07,720 --> 08:34:11,720
I mean the secret and the token secret to generate the signature.
11485
08:34:11,720 --> 08:34:13,720
And where is the signature?
11486
08:34:13,720 --> 08:34:15,720
Oh, there's the signature, right?
11487
08:34:15,720 --> 08:34:16,720
There's the signature.
11488
08:34:16,720 --> 08:34:20,720
And so this signature combined with the nonce,
11489
08:34:20,720 --> 08:34:22,720
you can only do, this signature has a time
11490
08:34:22,720 --> 08:34:24,720
and includes all kinds of things.
11491
08:34:24,720 --> 08:34:28,720
So even if you type this in, well, you'll see these go by.
11492
08:34:28,720 --> 08:34:31,720
And it's not really breaking my security too much
11493
08:34:31,720 --> 08:34:32,720
when you see these afterwards.
11494
08:34:32,720 --> 08:34:34,720
So don't get all excited when you say,
11495
08:34:34,720 --> 08:34:37,720
oh, you revealed your token and your key.
11496
08:34:37,720 --> 08:34:39,720
Well, I can reveal my token and key,
11497
08:34:39,720 --> 08:34:41,720
but I'm not gonna reveal the secret.
11498
08:34:41,720 --> 08:34:45,720
Okay, so this adds all this OAuth stuff, OAuth nonce,
11499
08:34:45,720 --> 08:34:47,720
OAuth timestamp.
11500
08:34:47,720 --> 08:34:49,720
And these timestamps and nonces are made it
11501
08:34:49,720 --> 08:34:53,720
so that you can't replay my URL even if you see the exact URL.
11502
08:34:53,720 --> 08:34:55,720
Once I hit it, then you can't hit it again.
11503
08:34:55,720 --> 08:34:57,720
And so that's what the nonce does.
11504
08:34:57,720 --> 08:35:00,720
So I'm gonna close hidden.py here.
11505
08:35:00,720 --> 08:35:05,720
And I'm going to update hidden.py in another window.
11506
08:35:20,720 --> 08:35:24,720
Okay, so I just, in another window, I updated hidden.py.
11507
08:35:24,720 --> 08:35:26,720
I'm not gonna show you that.
11508
08:35:26,720 --> 08:35:29,720
But now I'm gonna run python-tw-test.py.
11509
08:35:29,720 --> 08:35:32,720
So TWRL is going to read hidden.
11510
08:35:32,720 --> 08:35:34,720
And now these keys and secrets are my real ones
11511
08:35:34,720 --> 08:35:36,720
that I haven't shown you.
11512
08:35:36,720 --> 08:35:37,720
So this should work.
11513
08:35:37,720 --> 08:35:38,720
Fingers crossed.
11514
08:35:40,720 --> 08:35:41,720
Yay, it worked.
11515
08:35:41,720 --> 08:35:44,720
Okay, so it worked.
11516
08:35:44,720 --> 08:35:46,720
So I'm calling Twitter.
11517
08:35:46,720 --> 08:35:47,720
Here's the URL.
11518
08:35:47,720 --> 08:35:51,720
Now don't worry, the token and the consumer key
11519
08:35:51,720 --> 08:35:53,720
are not enough to break into my account.
11520
08:35:53,720 --> 08:35:56,720
And neither is the signature because you can't replay this.
11521
08:35:56,720 --> 08:36:00,720
In about five minutes, you can't replay this anymore, okay?
11522
08:36:00,720 --> 08:36:04,720
So you can't generate the signature.
11523
08:36:04,720 --> 08:36:06,720
I've done one.
11524
08:36:06,720 --> 08:36:09,720
The signature includes the time and date.
11525
08:36:09,720 --> 08:36:14,720
So you can't, trust me, go read up on OAuth.
11526
08:36:14,720 --> 08:36:15,720
Don't worry.
11527
08:36:15,720 --> 08:36:16,720
I haven't really revealed anything.
11528
08:36:16,720 --> 08:36:18,720
But, so the first thing we see is this.
11529
08:36:18,720 --> 08:36:21,720
So we see, and we should put like the line of dashes here.
11530
08:36:21,720 --> 08:36:23,720
This is the JSON.
11531
08:36:23,720 --> 08:36:24,720
It ain't very pretty.
11532
08:36:24,720 --> 08:36:25,720
It's not very pretty.
11533
08:36:25,720 --> 08:36:28,720
Okay, and so that's the JSON from there to there.
11534
08:36:28,720 --> 08:36:30,720
It's just what most APIs give us back.
11535
08:36:30,720 --> 08:36:32,720
It's really dense JSON, right?
11536
08:36:32,720 --> 08:36:34,720
And so this is a byte array.
11537
08:36:34,720 --> 08:36:37,720
Remember how you have to do a.decode?
11538
08:36:37,720 --> 08:36:39,720
I didn't do a.decode here.
11539
08:36:39,720 --> 08:36:41,720
And so this is telling, and Python is telling us,
11540
08:36:41,720 --> 08:36:44,720
this is a byte array, which it's a raw set of bytes
11541
08:36:44,720 --> 08:36:49,720
that came from the internet, which probably are UTF-8.
11542
08:36:49,720 --> 08:36:52,720
And if I put a decode here, then it would decode,
11543
08:36:52,720 --> 08:36:56,720
if I say.data.decode there, then it would be fine.
11544
08:36:56,720 --> 08:36:57,720
But we don't care.
11545
08:36:57,720 --> 08:36:58,720
This was just a dump.
11546
08:36:58,720 --> 08:36:59,720
Do we get anything?
11547
08:36:59,720 --> 08:37:03,720
And so then, here, let's do this.
11548
08:37:03,720 --> 08:37:04,720
Print.
11549
08:37:04,720 --> 08:37:07,720
I'll just make this code different.
11550
08:37:07,720 --> 08:37:10,720
Put some equal signs here, a lot of equal signs.
11551
08:37:10,720 --> 08:37:16,720
So we can easily see where the thing starts and stops.
11552
08:37:16,720 --> 08:37:17,720
So we'll run that again.
11553
08:37:17,720 --> 08:37:19,720
If you look at those URLs.
11554
08:37:19,720 --> 08:37:21,720
So that was all of that stuff.
11555
08:37:21,720 --> 08:37:26,720
And then, this is the headers.
11556
08:37:26,720 --> 08:37:28,720
And so the headers, again, are not pretty.
11557
08:37:28,720 --> 08:37:30,720
If you get the headers, it's a dictionary.
11558
08:37:30,720 --> 08:37:33,720
You got cache control, no cache, comma.
11559
08:37:33,720 --> 08:37:36,720
This is the string, key value.
11560
08:37:36,720 --> 08:37:38,720
You got to find your commas key value.
11561
08:37:38,720 --> 08:37:43,720
But the one that's really interesting here is,
11562
08:37:43,720 --> 08:37:45,720
which one is it?
11563
08:37:45,720 --> 08:37:47,720
X rate limit remaining, right there.
11564
08:37:47,720 --> 08:37:49,720
X rate limit remaining.
11565
08:37:49,720 --> 08:37:52,720
So that means that for this particular API,
11566
08:37:52,720 --> 08:37:57,720
and this header tells me that I've got 898 calls left.
11567
08:37:57,720 --> 08:38:00,720
And this is when I will get more calls,
11568
08:38:00,720 --> 08:38:05,720
and yeah, so let's see, yeah.
11569
08:38:05,720 --> 08:38:06,720
So watch.
11570
08:38:06,720 --> 08:38:07,720
I'm going to do this again,
11571
08:38:07,720 --> 08:38:11,720
and you will see that I can only do this 897 more times now.
11572
08:38:11,720 --> 08:38:13,720
Do, do, do, run it.
11573
08:38:13,720 --> 08:38:15,720
I can only do this 897.
11574
08:38:15,720 --> 08:38:18,720
So I am being tracked at this point.
11575
08:38:18,720 --> 08:38:20,720
I am being tracked by Twitter.
11576
08:38:20,720 --> 08:38:23,720
Twitter knows that it's Dr. Chuck that's doing this,
11577
08:38:23,720 --> 08:38:26,720
and Dr. Chuck has done 900.
11578
08:38:26,720 --> 08:38:28,720
He's done 899, 897.
11579
08:38:28,720 --> 08:38:30,720
And if I keep running this,
11580
08:38:30,720 --> 08:38:32,720
eventually Twitter will tell me,
11581
08:38:32,720 --> 08:38:34,720
you got to wait for a while.
11582
08:38:34,720 --> 08:38:36,720
And that's because Twitter doesn't want me,
11583
08:38:36,720 --> 08:38:37,720
under my Dr. Chuck account,
11584
08:38:37,720 --> 08:38:40,720
pulling out like lots and lots of stuff out of Twitter
11585
08:38:40,720 --> 08:38:43,720
and making my own website.
11586
08:38:43,720 --> 08:38:45,720
I do actually have my own Twitter website,
11587
08:38:45,720 --> 08:38:47,720
using some cool software.
11588
08:38:47,720 --> 08:38:58,720
www.drchuck.com slash Twitter.
11589
08:38:58,720 --> 08:38:59,720
And this I have to run,
11590
08:38:59,720 --> 08:39:02,720
and it rate limits and causes all kinds of, you know, whatever.
11591
08:39:02,720 --> 08:39:08,720
So, okay, so Twitter rate limit.
11592
08:39:08,720 --> 08:39:11,720
So, I'll save that.
11593
08:39:11,720 --> 08:39:12,720
So that's tweet.
11594
08:39:12,720 --> 08:39:14,720
This is just to test it, okay?
11595
08:39:14,720 --> 08:39:16,720
Because we're doing, I want to do something interesting.
11596
08:39:16,720 --> 08:39:19,720
So we're not parsing the JSON that comes back.
11597
08:39:19,720 --> 08:39:21,720
We're not doing anything tricky with this.
11598
08:39:21,720 --> 08:39:23,720
And away we go.
11599
08:39:23,720 --> 08:39:28,720
So, let's take a look at some more code.
11600
08:39:28,720 --> 08:39:30,720
I think I don't need this anymore.
11601
08:39:30,720 --> 08:39:34,720
So now, I am going to parse this.
11602
08:39:34,720 --> 08:39:36,720
So most of this looks the same.
11603
08:39:36,720 --> 08:39:38,720
I've got that same user timeline JSON.
11604
08:39:38,720 --> 08:39:40,720
I'm going to ignore the SSL certificates.
11605
08:39:40,720 --> 08:39:41,720
I'm going to write a loop.
11606
08:39:41,720 --> 08:39:43,720
So I'm going to ask the Twitter,
11607
08:39:43,720 --> 08:39:48,720
I'm going to print,
11608
08:39:48,720 --> 08:39:50,720
I'm going to get a Twitter account and quit
11609
08:39:50,720 --> 08:39:52,720
if it's a blank line or if I had to enter it.
11610
08:39:52,720 --> 08:39:55,720
I'm going to use the Twitter URL augment the same way.
11611
08:39:55,720 --> 08:39:58,720
That's going to do all the signing using from hidden.py.
11612
08:39:58,720 --> 08:39:59,720
I retrieve it.
11613
08:39:59,720 --> 08:40:02,720
And I'm going to retrieve it, ignoring the SSL errors.
11614
08:40:02,720 --> 08:40:03,720
And then I'm going to decode it.
11615
08:40:03,720 --> 08:40:04,720
This time I'm going to decode it
11616
08:40:04,720 --> 08:40:06,720
so that I get a real Unicode string.
11617
08:40:06,720 --> 08:40:08,720
And I'm going to print the first 250 characters of it.
11618
08:40:08,720 --> 08:40:10,720
I'm going to grab the headers.
11619
08:40:10,720 --> 08:40:15,720
And I'm going to print the remaining, the right limit.
11620
08:40:15,720 --> 08:40:20,720
So this is sort of a very simple version of this same thing.
11621
08:40:20,720 --> 08:40:22,720
It really is decoding the data
11622
08:40:22,720 --> 08:40:24,720
and only printing the first 250 characters.
11623
08:40:24,720 --> 08:40:36,720
So let's run that.
11624
08:40:36,720 --> 08:40:40,720
Dr. Chuck, boom, and it's got 896.
11625
08:40:40,720 --> 08:40:43,720
So that's just a little simpler version of that
11626
08:40:43,720 --> 08:40:45,720
with a little less brutal debugging.
11627
08:40:45,720 --> 08:40:47,720
Okay, so now let's do something even more fun.
11628
08:40:47,720 --> 08:40:51,720
Let's go to Twitter2.py and tear it apart.
11629
08:40:51,720 --> 08:40:55,720
And so again, we're going to look at my friends list
11630
08:40:55,720 --> 08:40:58,720
or someone else, anybody's friends list.
11631
08:40:58,720 --> 08:41:00,720
We're going to ask for the friends
11632
08:41:00,720 --> 08:41:02,720
and ask for the screen name,
11633
08:41:02,720 --> 08:41:04,720
ask for the first five friends,
11634
08:41:04,720 --> 08:41:07,720
and then look at their statuses,
11635
08:41:07,720 --> 08:41:09,720
open it, decode it, get the headers,
11636
08:41:09,720 --> 08:41:10,720
print the right limit.
11637
08:41:10,720 --> 08:41:13,720
Remaining all this stuff is the same as in Twitter1.
11638
08:41:13,720 --> 08:41:16,720
But now we're going to parse the JavaScript.
11639
08:41:16,720 --> 08:41:18,720
I'm not even putting this in a try and accept
11640
08:41:18,720 --> 08:41:20,720
because, hey, I'm talking to Twitter.
11641
08:41:20,720 --> 08:41:24,720
I'm going to guess that Twitter's going to give me the right stuff.
11642
08:41:24,720 --> 08:41:26,720
You'll probably want to put a try and accept here.
11643
08:41:26,720 --> 08:41:28,720
Then I'm going to do a debug print.
11644
08:41:28,720 --> 08:41:31,720
I'm going to do a JSON pretty print.
11645
08:41:31,720 --> 08:41:33,720
Let's make that be 2 so it looks a little better.
11646
08:41:33,720 --> 08:41:36,720
And then, well, I'm going to run it
11647
08:41:36,720 --> 08:41:39,720
and then you're going to see how we have to parse this
11648
08:41:39,720 --> 08:41:41,720
and we're going to see that it's a list.
11649
08:41:41,720 --> 08:41:43,720
So we're done with that.
11650
08:41:43,720 --> 08:41:46,720
And now we're running Twitter2.py.
11651
08:41:46,720 --> 08:41:48,720
So I'm going to go to Dr. Chuck
11652
08:41:48,720 --> 08:41:51,720
and this is going to ask the question
11653
08:41:51,720 --> 08:41:53,720
who Dr. Chuck's friends are.
11654
08:41:53,720 --> 08:41:56,720
Okay, let's go to the top.
11655
08:41:56,720 --> 08:42:01,720
So it hit this API and it has the screen name
11656
08:42:01,720 --> 08:42:04,720
Dr. Chuck count equals 5 and all this OAuth stuff.
11657
08:42:04,720 --> 08:42:08,720
Again, this is not a security breach by showing you all of this
11658
08:42:08,720 --> 08:42:11,720
because the signature, the secrets aren't there.
11659
08:42:11,720 --> 08:42:18,720
Okay, so if we look at it, it's an outer object or dictionary
11660
08:42:18,720 --> 08:42:23,720
and then the outer has a users which is a list.
11661
08:42:23,720 --> 08:42:25,720
And then each user has some stuff in it.
11662
08:42:25,720 --> 08:42:27,720
So this one's Stephanie Teasley.
11663
08:42:27,720 --> 08:42:29,720
It's got her screen name.
11664
08:42:29,720 --> 08:42:31,720
It's got some descriptions.
11665
08:42:31,720 --> 08:42:32,720
Keep on going.
11666
08:42:32,720 --> 08:42:35,720
It's got her status, her latest status.
11667
08:42:35,720 --> 08:42:37,720
For my friend, her status.
11668
08:42:37,720 --> 08:42:40,720
Her source, where she's at.
11669
08:42:40,720 --> 08:42:43,720
I don't know, man, she's got a lot of stuff here.
11670
08:42:43,720 --> 08:42:44,720
Okay, there we go.
11671
08:42:44,720 --> 08:42:46,720
That was the first one.
11672
08:42:46,720 --> 08:42:50,720
And then the next one that I'm following is live EDU, etc.
11673
08:42:50,720 --> 08:42:52,720
So you'll see that this is an array.
11674
08:42:52,720 --> 08:42:55,720
So that outer thing is an array of users.
11675
08:42:55,720 --> 08:43:00,720
Now, JS here is a dictionary.
11676
08:43:00,720 --> 08:43:03,720
So I can say for you in JS subusers.
11677
08:43:03,720 --> 08:43:05,720
Well, JS subusers is a list.
11678
08:43:05,720 --> 08:43:08,720
So the first U is gonna be this Stephanie Teasley U
11679
08:43:08,720 --> 08:43:10,720
and the second U is gonna be live EDU.
11680
08:43:10,720 --> 08:43:14,720
So that's all it took to get through all that stuff
11681
08:43:14,720 --> 08:43:16,720
and figure that out.
11682
08:43:16,720 --> 08:43:21,720
And then I'm gonna say, get me the screen name of my person.
11683
08:43:21,720 --> 08:43:22,720
So let's go in here.
11684
08:43:22,720 --> 08:43:27,720
So that's gonna pull Stephanie Teasley out.
11685
08:43:27,720 --> 08:43:31,720
Then I'm gonna go find her status.
11686
08:43:31,720 --> 08:43:35,720
Let's find her somewhere in here.
11687
08:43:35,720 --> 08:43:39,720
U sub status subtext.
11688
08:43:39,720 --> 08:43:40,720
Come on.
11689
08:43:40,720 --> 08:43:41,720
Okay, there's sub status.
11690
08:43:41,720 --> 08:43:45,720
Sub status is all this stuff.
11691
08:43:45,720 --> 08:43:48,720
More, more, more, more, more, more, more.
11692
08:43:48,720 --> 08:43:50,720
Right there, that's status.
11693
08:43:50,720 --> 08:43:52,720
That's U sub status is that.
11694
08:43:52,720 --> 08:43:56,720
And then U sub status subtext is this stuff.
11695
08:43:56,720 --> 08:44:00,720
So it's gonna extract this bit right here.
11696
08:44:00,720 --> 08:44:03,720
And so U status text.
11697
08:44:03,720 --> 08:44:05,720
And I print out the first 50 characters
11698
08:44:05,720 --> 08:44:07,720
of the screen name status.
11699
08:44:07,720 --> 08:44:10,720
And I do that for the first five
11700
08:44:10,720 --> 08:44:15,720
because I told it I only wanted five.
11701
08:44:15,720 --> 08:44:17,720
And then of course I get to see the right limit.
11702
08:44:17,720 --> 08:44:18,720
So let's go down to the bottom.
11703
08:44:18,720 --> 08:44:21,720
So all of this is the debug print
11704
08:44:21,720 --> 08:44:23,720
of the JSON I got back.
11705
08:44:23,720 --> 08:44:25,720
Here is the program starting to print.
11706
08:44:25,720 --> 08:44:27,720
Here is the screen name of my first friend.
11707
08:44:27,720 --> 08:44:29,720
And here's the first 50 characters
11708
08:44:29,720 --> 08:44:31,720
of her most recent status.
11709
08:44:31,720 --> 08:44:34,720
Here is the screen name of my,
11710
08:44:34,720 --> 08:44:37,720
and these are in reverse order who I've been following.
11711
08:44:37,720 --> 08:44:39,720
So I've been playing with this live coding stuff.
11712
08:44:39,720 --> 08:44:42,720
So I'm following them.
11713
08:44:42,720 --> 08:44:44,720
What?
11714
08:44:44,720 --> 08:44:51,720
Key error status, that didn't work.
11715
08:44:51,720 --> 08:44:56,720
Why not?
11716
08:44:56,720 --> 08:44:59,720
Oh, that's because live coding TV
11717
08:44:59,720 --> 08:45:01,720
somehow doesn't have a status.
11718
08:45:01,720 --> 08:45:03,720
So most of these work,
11719
08:45:03,720 --> 08:45:05,720
so now you'll get to see me fix something.
11720
08:45:05,720 --> 08:45:08,720
And when you download it, it'll be fixed.
11721
08:45:08,720 --> 08:45:09,720
And so it says key error status.
11722
08:45:09,720 --> 08:45:13,720
So that means that I've got to do a thing that says,
11723
08:45:13,720 --> 08:45:34,720
if status, not in you, print, no status found.
11724
08:45:34,720 --> 08:45:36,720
Continue.
11725
08:45:36,720 --> 08:45:38,720
Since sometimes there's no statuses.
11726
08:45:38,720 --> 08:45:39,720
Who would have thought?
11727
08:45:39,720 --> 08:45:42,720
I did not know that.
11728
08:45:42,720 --> 08:45:48,720
Yeah, so you.
11729
08:45:48,720 --> 08:45:51,720
Okay, so let's run this again.
11730
08:45:51,720 --> 08:45:55,720
Did I get to see my remaining?
11731
08:45:55,720 --> 08:45:58,720
Actually, let me change the order of this.
11732
08:45:58,720 --> 08:46:01,720
Let me put this down here.
11733
08:46:01,720 --> 08:46:03,720
That'll be wrong from the slides,
11734
08:46:03,720 --> 08:46:08,720
but it'll be prettier now.
11735
08:46:08,720 --> 08:46:14,720
Let's put the headers after the dump of the data.
11736
08:46:14,720 --> 08:46:17,720
Okay, so let's run it again.
11737
08:46:17,720 --> 08:46:18,720
Did I save it?
11738
08:46:18,720 --> 08:46:19,720
Yeah.
11739
08:46:19,720 --> 08:46:21,720
Dr. Chuck.
11740
08:46:21,720 --> 08:46:22,720
Blah.
11741
08:46:22,720 --> 08:46:23,720
Whole bunch of stuff.
11742
08:46:23,720 --> 08:46:25,720
So I got 13 remaining calls on this one.
11743
08:46:25,720 --> 08:46:27,720
So it's not the same as the other one.
11744
08:46:27,720 --> 08:46:29,720
I don't get to call this too many more times,
11745
08:46:29,720 --> 08:46:32,720
so hopefully I'll get the debugging to work.
11746
08:46:32,720 --> 08:46:35,720
Sort of.
11747
08:46:35,720 --> 08:46:37,720
I got a bad space here.
11748
08:46:37,720 --> 08:46:39,720
No, not status found.
11749
08:46:39,720 --> 08:46:40,720
No status found.
11750
08:46:40,720 --> 08:46:43,720
And I need to put three spaces there.
11751
08:46:43,720 --> 08:46:44,720
No status found.
11752
08:46:44,720 --> 08:46:45,720
I'll make an asterisk.
11753
08:46:45,720 --> 08:46:48,720
So let's run it again.
11754
08:46:48,720 --> 08:46:50,720
See, I got 13 remaining.
11755
08:46:50,720 --> 08:46:53,720
So it's important you write code that's aware of your remaining.
11756
08:46:53,720 --> 08:46:56,720
That's why I made so obvious about that.
11757
08:46:56,720 --> 08:46:57,720
I'll retrieve all that.
11758
08:46:57,720 --> 08:47:00,720
I got 12 remaining, but my code starts to look.
11759
08:47:00,720 --> 08:47:01,720
Dang it.
11760
08:47:01,720 --> 08:47:03,720
I now have another space here.
11761
08:47:03,720 --> 08:47:04,720
Hang on.
11762
08:47:04,720 --> 08:47:06,720
Got to fix that.
11763
08:47:06,720 --> 08:47:08,720
I need yet another space.
11764
08:47:08,720 --> 08:47:23,720
Hopefully, I can make this as pretty as I want it to work.
11765
08:47:23,720 --> 08:47:25,720
Oh, wait a sec.
11766
08:47:25,720 --> 08:47:26,720
I didn't even do Dr. Chuck.
11767
08:47:26,720 --> 08:47:28,720
I did that wrong.
11768
08:47:28,720 --> 08:47:29,720
Typed my name wrong.
11769
08:47:29,720 --> 08:47:30,720
OK.
11770
08:47:30,720 --> 08:47:33,720
So now it works.
11771
08:47:33,720 --> 08:47:34,720
Oh, well.
11772
08:47:34,720 --> 08:47:40,720
So now I have my first, most five recent friends are this.
11773
08:47:40,720 --> 08:47:42,720
Steph Deasley, live edu official.
11774
08:47:42,720 --> 08:47:47,720
LifecodingTV, Nancy Gilby, and Greg E. Kruger.
11775
08:47:47,720 --> 08:47:49,720
And so there are their statuses.
11776
08:47:49,720 --> 08:47:55,720
And I tore all this JSON apart using twitter2.py.
11777
08:47:55,720 --> 08:48:01,720
Of course, after fixing hidden.py, which I'm not going to show you,
11778
08:48:01,720 --> 08:48:05,720
because it actually contains my real consumer key and consumer secret,
11779
08:48:05,720 --> 08:48:10,720
you're seeing the consumer key and the token key go by on each of these URLs.
11780
08:48:10,720 --> 08:48:12,720
But what you're not seeing is these two things,
11781
08:48:12,720 --> 08:48:16,720
which are the thing I'm protecting, so that it's not a problem.
11782
08:48:16,720 --> 08:48:17,720
OK.
11783
08:48:17,720 --> 08:48:19,720
So I will send that up.
11784
08:48:19,720 --> 08:48:21,720
But there you go.
11785
08:48:21,720 --> 08:48:22,720
Welcome.
11786
08:48:22,720 --> 08:48:25,720
I hope you found this useful.
11787
08:48:25,720 --> 08:48:27,720
The code will be fixed when you take a look at it
11788
08:48:27,720 --> 08:48:31,720
and download it here from samplecode.zip.
11789
08:48:34,720 --> 08:48:36,720
Hello, and welcome to Python Objects.
11790
08:48:36,720 --> 08:48:39,720
I'm Charles Severance, and we're well on our way
11791
08:48:39,720 --> 08:48:43,720
to getting through all this material in Python.
11792
08:48:43,720 --> 08:48:45,720
So this lecture is in a weird place.
11793
08:48:45,720 --> 08:48:48,720
I even debated where to put it in the book.
11794
08:48:48,720 --> 08:48:52,720
I don't really want to teach you how to write a lot of object-oriented programming,
11795
08:48:52,720 --> 08:48:55,720
but we're going to start using objects.
11796
08:48:55,720 --> 08:48:58,720
And I want to be able to use the terminology.
11797
08:48:58,720 --> 08:49:01,720
And so as much as anything, this lecture is about terminology
11798
08:49:01,720 --> 08:49:04,720
and understanding the words, things like methods
11799
08:49:04,720 --> 08:49:08,720
and method signatures and variables and inheritance.
11800
08:49:08,720 --> 08:49:10,720
And so think of this as a terminology lecture
11801
08:49:10,720 --> 08:49:14,720
rather than a learn-how-to program or learn-how-to use this.
11802
08:49:14,720 --> 08:49:17,720
It's not something you're going to figure out right away.
11803
08:49:17,720 --> 08:49:19,720
And there'll come a time when you as a programmer
11804
08:49:19,720 --> 08:49:21,720
really want to start using object-oriented programming.
11805
08:49:21,720 --> 08:49:24,720
It's really a powerful and wonderful technique.
11806
08:49:24,720 --> 08:49:27,720
But I think it's too early as a beginning programmer
11807
08:49:27,720 --> 08:49:30,720
to really say, oh, let's write a bunch of objects.
11808
08:49:30,720 --> 08:49:34,720
So just relax and enjoy and learn this material
11809
08:49:34,720 --> 08:49:38,720
and think of it as sort of a theoretical thing
11810
08:49:38,720 --> 08:49:43,720
rather than a how-to program thing.
11811
08:49:43,720 --> 08:49:47,720
And so part of this is we're going to start reading data structures
11812
08:49:47,720 --> 08:49:52,720
and data on how to use all these libraries, etc.
11813
08:49:52,720 --> 08:49:54,720
And we're going to see the word objects, right?
11814
08:49:54,720 --> 08:49:56,720
And then we're going to start hearing them.
11815
08:49:56,720 --> 08:49:58,720
And I want you to be able to read the Python documentation
11816
08:49:58,720 --> 08:50:00,720
so that you understand what's going on.
11817
08:50:00,720 --> 08:50:04,720
And so the word objects should make sense to you
11818
08:50:04,720 --> 08:50:06,720
even though you're not going to write a lot of objects
11819
08:50:06,720 --> 08:50:07,720
or any programming.
11820
08:50:07,720 --> 08:50:11,720
And so page upon page upon page, database stuff,
11821
08:50:11,720 --> 08:50:13,720
which we're going to talk about soon,
11822
08:50:13,720 --> 08:50:15,720
uses objects all over the place.
11823
08:50:15,720 --> 08:50:18,720
And the beautiful soup uses objects.
11824
08:50:18,720 --> 08:50:21,720
We've kind of been using them, and I've been waving my hands
11825
08:50:21,720 --> 08:50:23,720
and I use the word method without defining it.
11826
08:50:23,720 --> 08:50:28,720
But now it's really time to define it and go to it.
11827
08:50:28,720 --> 08:50:33,720
So I want to review from the very beginning
11828
08:50:33,720 --> 08:50:35,720
what we think of as a program.
11829
08:50:35,720 --> 08:50:38,720
So the classic program, my favorite little minimum program,
11830
08:50:38,720 --> 08:50:41,720
is our little elevator floor converter,
11831
08:50:41,720 --> 08:50:43,720
which converts from European elevator floors
11832
08:50:43,720 --> 08:50:45,720
to United States elevator floors.
11833
08:50:45,720 --> 08:50:50,720
And the key to this is that it's input, processing, and output.
11834
08:50:50,720 --> 08:50:54,720
And this is a good way to model any program.
11835
08:50:54,720 --> 08:50:57,720
And in that process, we've got variables,
11836
08:50:57,720 --> 08:51:00,720
and we've got logic, we've got algorithms,
11837
08:51:00,720 --> 08:51:02,720
we've got loops that we write, we've got all kinds of things.
11838
08:51:02,720 --> 08:51:09,720
And we construct a series of steps to achieve some goal.
11839
08:51:09,720 --> 08:51:11,720
In object-oriented, and frankly,
11840
08:51:11,720 --> 08:51:13,720
you've been using object-oriented all along,
11841
08:51:13,720 --> 08:51:16,720
the program has lots of objects.
11842
08:51:16,720 --> 08:51:18,720
And we're sort of putting stuff into these objects,
11843
08:51:18,720 --> 08:51:21,720
taking stuff out of one object and putting it into another object.
11844
08:51:21,720 --> 08:51:23,720
And you've actually been doing this all along.
11845
08:51:23,720 --> 08:51:25,720
As soon as you're looking at dictionaries and lists,
11846
08:51:25,720 --> 08:51:27,720
you're doing objects.
11847
08:51:27,720 --> 08:51:31,720
And so an object is quite a little thing.
11848
08:51:31,720 --> 08:51:34,720
It's sort of its own little space inside of a program
11849
08:51:34,720 --> 08:51:38,720
that contains code and data.
11850
08:51:38,720 --> 08:51:40,720
And so we're working together.
11851
08:51:40,720 --> 08:51:42,720
All these objects are now working together.
11852
08:51:42,720 --> 08:51:44,720
It's a bit of self-contained code and data.
11853
08:51:44,720 --> 08:51:48,720
And it is one way to take a very complex problem
11854
08:51:48,720 --> 08:51:52,720
and make it easier by breaking it into separate things
11855
08:51:52,720 --> 08:51:54,720
that can be engineered and developed separately.
11856
08:51:54,720 --> 08:51:56,720
So you'd be using string objects,
11857
08:51:56,720 --> 08:51:58,720
or maybe you'd use beautiful soup or something.
11858
08:51:58,720 --> 08:52:00,720
These are powerful capabilities,
11859
08:52:00,720 --> 08:52:02,720
and if you had to look at all of them,
11860
08:52:02,720 --> 08:52:04,720
it's just, hey, here's a thing, use this object,
11861
08:52:04,720 --> 08:52:06,720
it'll do these things for you,
11862
08:52:06,720 --> 08:52:08,720
and there's lots of details inside of it.
11863
08:52:08,720 --> 08:52:10,720
Just don't look at it, don't worry about it.
11864
08:52:10,720 --> 08:52:12,720
And so there's boundaries, the things that you can use,
11865
08:52:12,720 --> 08:52:14,720
things that you can look at,
11866
08:52:14,720 --> 08:52:16,720
and things that really you don't bother looking at.
11867
08:52:16,720 --> 08:52:18,720
You go read the documentation and use it,
11868
08:52:18,720 --> 08:52:20,720
and away it goes.
11869
08:52:20,720 --> 08:52:23,720
But then someone had to write that, and so they built an object.
11870
08:52:23,720 --> 08:52:25,720
So what we're going to do is look a little bit
11871
08:52:25,720 --> 08:52:32,720
under the covers of what it takes to build some of these objects.
11872
08:52:32,720 --> 08:52:34,720
And so if we think of this program
11873
08:52:34,720 --> 08:52:36,720
that originally just sort of did processing,
11874
08:52:36,720 --> 08:52:39,720
we can think of it as having some kind of an input, right,
11875
08:52:39,720 --> 08:52:41,720
coming into our program.
11876
08:52:41,720 --> 08:52:43,720
And we have a string object, a dictionary object,
11877
08:52:43,720 --> 08:52:46,720
maybe eventually some objects like a database object
11878
08:52:46,720 --> 08:52:48,720
or an object that we eventually define.
11879
08:52:48,720 --> 08:52:50,720
And you can think of us, we're receiving data,
11880
08:52:50,720 --> 08:52:52,720
it comes in an object, which is a string object,
11881
08:52:52,720 --> 08:52:55,720
or you start putting the strings in dictionaries
11882
08:52:55,720 --> 08:52:58,720
and do whatever, we pull out a list of them,
11883
08:52:58,720 --> 08:53:01,720
and so you can think of data as moving between these objects.
11884
08:53:01,720 --> 08:53:05,720
And like I say, even strings, in the first week,
11885
08:53:05,720 --> 08:53:08,720
first lecture, first week, first everything,
11886
08:53:08,720 --> 08:53:13,720
we were using objects, and we've been using them all along.
11887
08:53:13,720 --> 08:53:16,720
And so you can think of every string and every dictionary
11888
08:53:16,720 --> 08:53:20,720
as a little program all by itself that has a bit of code
11889
08:53:20,720 --> 08:53:22,720
and a bit of data.
11890
08:53:22,720 --> 08:53:25,720
And so a string has the data, which includes all the characters
11891
08:53:25,720 --> 08:53:28,720
that make up the string, but then there is a method called
11892
08:53:28,720 --> 08:53:31,720
upper that does uppercase, or rstrip,
11893
08:53:31,720 --> 08:53:34,720
that strips off the right white space from the right.
11894
08:53:34,720 --> 08:53:36,720
And so it's like they're almost little programs
11895
08:53:36,720 --> 08:53:38,720
that have inputs and outputs themselves,
11896
08:53:38,720 --> 08:53:40,720
and we can make lots of them.
11897
08:53:40,720 --> 08:53:45,720
And there's lots of cooperating objects that make up an application.
11898
08:53:45,720 --> 08:53:48,720
And one of the nice things about the object-oriented pattern
11899
08:53:48,720 --> 08:53:53,720
is that they form boundaries, and within the boundary,
11900
08:53:53,720 --> 08:53:55,720
if you're inside the object, you can say,
11901
08:53:55,720 --> 08:53:58,720
look, I'm going to build you a string object or a database object
11902
08:53:58,720 --> 08:54:01,720
or a beautiful soup object, and I'm going to build this capability
11903
08:54:01,720 --> 08:54:03,720
and I'm going to give it to you in the form of an interface,
11904
08:54:03,720 --> 08:54:05,720
and I'm not really going to care how you use it.
11905
08:54:05,720 --> 08:54:08,720
And so we have this sort of visibility wall
11906
08:54:08,720 --> 08:54:11,720
where I'm going to make an object and I'm going to let you use it,
11907
08:54:11,720 --> 08:54:14,720
and the maker of the object doesn't necessarily have to know
11908
08:54:14,720 --> 08:54:17,720
every single thing about the use of that object.
11909
08:54:17,720 --> 08:54:20,720
But so just like inside the object, they don't have to worry
11910
08:54:20,720 --> 08:54:23,720
about what you're doing with the object outside of it.
11911
08:54:23,720 --> 08:54:25,720
When you're outside the object, you don't have to worry
11912
08:54:25,720 --> 08:54:27,720
about what's going on inside of it.
11913
08:54:27,720 --> 08:54:30,720
We, as the user of the object, we talk to its interface
11914
08:54:30,720 --> 08:54:32,720
and we get things from it and give things to it
11915
08:54:32,720 --> 08:54:34,720
and use functionality within that object,
11916
08:54:34,720 --> 08:54:36,720
but we don't have to look inside of this.
11917
08:54:36,720 --> 08:54:38,720
We can just say, oh, it's a nice little magical thing.
11918
08:54:38,720 --> 08:54:40,720
We read the documentation, we read a web page,
11919
08:54:40,720 --> 08:54:43,720
and it told us to do this, this, and this, and away you go.
11920
08:54:43,720 --> 08:54:46,720
And so it is sort of this isolation boundary
11921
08:54:46,720 --> 08:54:50,720
that works both for the programmer who's writing the object
11922
08:54:50,720 --> 08:54:53,720
and the programmer who's using the object.
11923
08:54:53,720 --> 08:54:56,720
And so it's a very nice pattern,
11924
08:54:56,720 --> 08:54:59,720
and so you'll see how we're going to build code
11925
08:54:59,720 --> 08:55:01,720
and we're going to group it together,
11926
08:55:01,720 --> 08:55:06,720
and then we're going to be using it sort of as a big blob of stuff.
11927
08:55:06,720 --> 08:55:09,720
So some definitions in this space,
11928
08:55:09,720 --> 08:55:13,720
words that I want you to understand.
11929
08:55:13,720 --> 08:55:16,720
When we're going to create one of these things,
11930
08:55:16,720 --> 08:55:20,720
one of these objects, instances, that has some data in it
11931
08:55:20,720 --> 08:55:23,720
and some code in it, we have to be able to define
11932
08:55:23,720 --> 08:55:24,720
the shape of this object.
11933
08:55:24,720 --> 08:55:26,720
What code will each object have in it
11934
08:55:26,720 --> 08:55:29,720
and what data will each object have in it?
11935
08:55:29,720 --> 08:55:31,720
And that's called a class.
11936
08:55:31,720 --> 08:55:33,720
The key to a class in this little picture
11937
08:55:33,720 --> 08:55:36,720
that I've got up here in all these slides is a key.
11938
08:55:36,720 --> 08:55:37,720
The class is a template.
11939
08:55:37,720 --> 08:55:39,720
It's not the thing itself, so it's a cookie cutter.
11940
08:55:39,720 --> 08:55:42,720
It knows a lot about how cookies are made,
11941
08:55:42,720 --> 08:55:45,720
and if you have cookie dough and you hit the thing,
11942
08:55:45,720 --> 08:55:47,720
then you make as many cookies as you want.
11943
08:55:47,720 --> 08:55:52,720
And so this nice little cookie picture is a great, you know,
11944
08:55:52,720 --> 08:55:54,720
mental model of how it works.
11945
08:55:54,720 --> 08:56:03,720
The class is the template,
11946
08:56:03,720 --> 08:56:07,720
and then the object are all of the cookies
11947
08:56:07,720 --> 08:56:09,720
that are made from that template.
11948
08:56:09,720 --> 08:56:12,720
But the template defines the shape and the nature of the class.
11949
08:56:12,720 --> 08:56:16,720
So the code that we write is going of each of the objects.
11950
08:56:16,720 --> 08:56:19,720
The code we write is the class code,
11951
08:56:19,720 --> 08:56:21,720
and then later we say, oh, let's take that template
11952
08:56:21,720 --> 08:56:24,720
and make ourselves an object or an instance.
11953
08:56:24,720 --> 08:56:27,720
Now, as we're defining a class,
11954
08:56:27,720 --> 08:56:30,720
we have two basic things that we put in the class.
11955
08:56:30,720 --> 08:56:32,720
And there's a couple of different terminologies for this.
11956
08:56:32,720 --> 08:56:34,720
One is method, which is code.
11957
08:56:34,720 --> 08:56:36,720
It's like a function that lives inside of a class.
11958
08:56:36,720 --> 08:56:39,720
Not a function that lives inside your program,
11959
08:56:39,720 --> 08:56:40,720
but one that lives inside of a class.
11960
08:56:40,720 --> 08:56:42,720
And so this is a scoping thing.
11961
08:56:42,720 --> 08:56:45,720
A method is really just a function,
11962
08:56:45,720 --> 08:56:47,720
but it lives inside the class.
11963
08:56:47,720 --> 08:56:50,720
And then fields or attributes are data items that are in the class.
11964
08:56:50,720 --> 08:56:53,720
And so they're variables that are defined in the class.
11965
08:56:53,720 --> 08:56:56,720
You can define variables outside the class that you use in your program,
11966
08:56:56,720 --> 08:56:57,720
and you've been doing that all along.
11967
08:56:57,720 --> 08:56:59,720
But if you're saying, I'm going to build this capability
11968
08:56:59,720 --> 08:57:02,720
and it's going to have data inside of it and code inside of it,
11969
08:57:02,720 --> 08:57:05,720
the code is the method or message and field or attribute.
11970
08:57:05,720 --> 08:57:13,720
And there are just two different sets of terminology.
11971
08:57:13,720 --> 08:57:17,720
Method is what I'll probably use if you look in some object-oriented patterns
11972
08:57:17,720 --> 08:57:20,720
like Smalltalk or Apple.
11973
08:57:20,720 --> 08:57:22,720
They often don't call these messages.
11974
08:57:22,720 --> 08:57:26,720
So you can either access a method inside of a class or an object,
11975
08:57:26,720 --> 08:57:28,720
or you can send a message to the object.
11976
08:57:28,720 --> 08:57:30,720
The same is true for field and attribute.
11977
08:57:30,720 --> 08:57:32,720
It's just a chunk of data that's in the object
11978
08:57:32,720 --> 08:57:37,720
that you may or may not have the right to access.
11979
08:57:37,720 --> 08:57:40,720
So like I said, a class is a template.
11980
08:57:40,720 --> 08:57:44,720
It defines the characteristics of the objects that we're going to use to make it.
11981
08:57:44,720 --> 08:57:48,720
It is the cookie cutter.
11982
08:57:48,720 --> 08:57:52,720
So dog is sort of the exemplar.
11983
08:57:52,720 --> 08:57:54,720
Lassie is a particular dog.
11984
08:57:54,720 --> 08:57:57,720
And so dog has fur and dog barks, and dogs do all these things.
11985
08:57:57,720 --> 08:58:01,720
And so we know something about dogs, but it doesn't mean we have a dog, right?
11986
08:58:01,720 --> 08:58:06,720
And the class is a more abstract concept that when it's time to get a dog,
11987
08:58:06,720 --> 08:58:09,720
we know certain things about dogs.
11988
08:58:09,720 --> 08:58:16,720
Instances or objects are once we say, oh, time to make a cookie from the template.
11989
08:58:16,720 --> 08:58:17,720
Time to get a dog.
11990
08:58:17,720 --> 08:58:19,720
We know something about dogs.
11991
08:58:19,720 --> 08:58:23,720
That's the creation of an object, and we call them instances,
11992
08:58:23,720 --> 08:58:24,720
instance of a class.
11993
08:58:24,720 --> 08:58:28,720
So the class doesn't exist, but we say,
11994
08:58:28,720 --> 08:58:31,720
make me a new object using this class as its template.
11995
08:58:31,720 --> 08:58:33,720
Oh, and now make me another one.
11996
08:58:33,720 --> 08:58:36,720
And so we can have many, many objects from one class.
11997
08:58:36,720 --> 08:58:41,720
So just like many cookies from one cookie cutter.
11998
08:58:41,720 --> 08:58:45,720
Method is a bit of code that lives inside of an object.
11999
08:58:45,720 --> 08:58:50,720
It's like a function, but it's scoped to within the object or within the class.
12000
08:58:50,720 --> 08:58:54,720
Okay, so that kind of gets us started on some of the terminology,
12001
08:58:54,720 --> 08:59:03,720
and we'll come back and we'll take a look at how we write code that's object oriented.
12002
08:59:03,720 --> 08:59:06,720
Okay, so now that we've gotten through the definitions,
12003
08:59:06,720 --> 08:59:08,720
let's work into some sample code.
12004
08:59:08,720 --> 08:59:10,720
But hey, look at this.
12005
08:59:10,720 --> 08:59:13,720
We've got ourselves a cookie cutter and some cookies.
12006
08:59:13,720 --> 08:59:17,720
So remember that a class is a template.
12007
08:59:17,720 --> 08:59:19,720
It's not the actual thing.
12008
08:59:19,720 --> 08:59:23,720
An object is an instance of a class.
12009
08:59:23,720 --> 08:59:27,720
So you have to take the class and do something to make the object.
12010
08:59:27,720 --> 08:59:30,720
And actually you can see here some other classes.
12011
08:59:30,720 --> 08:59:34,720
Clearly a sort of a snowflake class and a gingerbread man class.
12012
08:59:34,720 --> 08:59:36,720
That's an object, object, object.
12013
08:59:36,720 --> 08:59:41,720
Somewhere out here there is a snowflake class and a gingerbread class.
12014
08:59:41,720 --> 08:59:46,720
But we've got a snowman object and a snowman object and a snowman class.
12015
08:59:46,720 --> 08:59:51,720
So class is the template.
12016
08:59:51,720 --> 08:59:53,720
Object is the instance.
12017
08:59:53,720 --> 08:59:54,720
So here's a bit of Python code.
12018
08:59:54,720 --> 08:59:57,720
So let's take a look at what we've got here.
12019
08:59:57,720 --> 08:59:59,720
Class is a new reserved word, kind of like def.
12020
08:59:59,720 --> 09:00:01,720
We have the name of the class.
12021
09:00:01,720 --> 09:00:03,720
That is a name that we choose.
12022
09:00:03,720 --> 09:00:07,720
That's the name by which we'll refer to this class for the rest of this program.
12023
09:00:07,720 --> 09:00:09,720
And it has a colon at the end of it,
12024
09:00:09,720 --> 09:00:13,720
which means it starts an indented block, which ends when we deindent.
12025
09:00:13,720 --> 09:00:16,720
Inside the class there are generally two things.
12026
09:00:16,720 --> 09:00:19,720
There is some data, and this just looks like an assignment statement in the class,
12027
09:00:19,720 --> 09:00:20,720
x equals zero.
12028
09:00:20,720 --> 09:00:22,720
And then there is a def.
12029
09:00:22,720 --> 09:00:24,720
This looks just like a function.
12030
09:00:24,720 --> 09:00:27,720
And then it starts with a def, has a colon, indents.
12031
09:00:27,720 --> 09:00:29,720
That function finishes right there.
12032
09:00:29,720 --> 09:00:33,720
The difference is this is a method because it lives inside of a class.
12033
09:00:33,720 --> 09:00:36,720
And so there is no function called party.
12034
09:00:36,720 --> 09:00:40,720
There's a function called party within party animal class.
12035
09:00:40,720 --> 09:00:43,720
And we'll talk in a second about this self thing.
12036
09:00:43,720 --> 09:00:47,720
It is the way that inside this code we refer back to that variable.
12037
09:00:47,720 --> 09:00:50,720
So this is not actually executing any code.
12038
09:00:50,720 --> 09:00:54,720
It's sort of remembering the template, defining the class party animal.
12039
09:00:54,720 --> 09:00:57,720
This is what we call constructing.
12040
09:00:57,720 --> 09:01:00,720
We're constructing, using the party animal template or class,
12041
09:01:00,720 --> 09:01:02,720
we are making a party animal.
12042
09:01:02,720 --> 09:01:06,720
And then once we make that, we stick it in the variable an.
12043
09:01:06,720 --> 09:01:10,720
And then we're going to call this party animal, this party method,
12044
09:01:10,720 --> 09:01:12,720
three times one, two, three.
12045
09:01:12,720 --> 09:01:15,720
Now this self thing, and we'll take a look at the self.
12046
09:01:15,720 --> 09:01:19,720
The self ends up being an alias of an.
12047
09:01:19,720 --> 09:01:21,720
And so you can look at this syntax.
12048
09:01:21,720 --> 09:01:23,720
It's just kind of an equivalent of this syntax.
12049
09:01:23,720 --> 09:01:27,720
It's calling the party method within the party animal class
12050
09:01:27,720 --> 09:01:30,720
and passing the instance in as the first parameter.
12051
09:01:30,720 --> 09:01:36,720
And so self ends up being an alias of an each time these are called.
12052
09:01:36,720 --> 09:01:39,720
Now if we make a different variable and a second object,
12053
09:01:39,720 --> 09:01:43,720
which we will eventually, you will see that that works a little bit differently.
12054
09:01:43,720 --> 09:01:48,720
And so this syntax is a short version of that syntax.
12055
09:01:48,720 --> 09:01:55,720
So if we watch how this executes, it starts up here,
12056
09:01:55,720 --> 09:01:59,720
it just defines it, and then we construct it.
12057
09:01:59,720 --> 09:02:04,720
And that's what basically constructing it, we know how to construct it
12058
09:02:04,720 --> 09:02:08,720
because we look at the class and we make a variable x, we make some code party,
12059
09:02:08,720 --> 09:02:11,720
and then we construct that, that's what the party animal does,
12060
09:02:11,720 --> 09:02:13,720
and then we assign that into an.
12061
09:02:13,720 --> 09:02:17,720
And so an is now pointing at that.
12062
09:02:17,720 --> 09:02:22,720
And then when we call the party method, that basically takes this an
12063
09:02:22,720 --> 09:02:27,720
and passes it in as the first parameter, which is used as self.
12064
09:02:27,720 --> 09:02:32,720
And so self.x, which is what we're doing in this line right here,
12065
09:02:32,720 --> 09:02:37,720
self.x is a variable, x starts out as zero.
12066
09:02:37,720 --> 09:02:41,720
x starts out as zero because when it was constructed it was set to zero.
12067
09:02:41,720 --> 09:02:44,720
So we're in here, an is an alias of self.
12068
09:02:44,720 --> 09:02:48,720
It looks up self.x, which is zero, adds one to it, and so this becomes one.
12069
09:02:48,720 --> 09:02:51,720
And then we print so far, so far one.
12070
09:02:51,720 --> 09:02:54,720
And then the code returns and it goes down and does it again.
12071
09:02:54,720 --> 09:02:58,720
And x becomes two, prints out so far two, comes back down,
12072
09:02:58,720 --> 09:03:02,720
and does the last time, calls it again, self.x is two,
12073
09:03:02,720 --> 09:03:06,720
add one to it and stick it back in, so this becomes three,
12074
09:03:06,720 --> 09:03:09,720
and we print out three, and then the program finishes.
12075
09:03:09,720 --> 09:03:13,720
And so you can think of this as constructing the object,
12076
09:03:13,720 --> 09:03:19,720
and then associating it with this and variable.
12077
09:03:19,720 --> 09:03:22,720
Now that we've created this object, we can play around with things
12078
09:03:22,720 --> 09:03:24,720
we've played around before with dir and type.
12079
09:03:24,720 --> 09:03:31,720
We use dir and type to kind of inspect variables and types and objects.
12080
09:03:31,720 --> 09:03:34,720
So we've been using objects all along.
12081
09:03:34,720 --> 09:03:37,720
This code here says, hey, make me an empty list.
12082
09:03:37,720 --> 09:03:42,720
Well, it turns out that what we're saying is there is already a list class
12083
09:03:42,720 --> 09:03:46,720
inside of Python, and we're constructing an empty list.
12084
09:03:46,720 --> 09:03:51,720
And when we get back this empty list, we're assigning that into x.
12085
09:03:51,720 --> 09:03:55,720
So x, in a sense, contains or points to an empty list.
12086
09:03:55,720 --> 09:03:57,720
So then we say, hey, what is in x?
12087
09:03:57,720 --> 09:03:59,720
What kind of thing is x? Well, it's a list.
12088
09:03:59,720 --> 09:04:01,720
This is a thing. It's a list type.
12089
09:04:01,720 --> 09:04:04,720
Lists have lists of things in them.
12090
09:04:04,720 --> 09:04:07,720
And, you know, use append and all the things we've been doing before,
12091
09:04:07,720 --> 09:04:08,720
they're just objects.
12092
09:04:08,720 --> 09:04:12,720
And then the dir, if you remember the dir, the dir is the capabilities.
12093
09:04:12,720 --> 09:04:15,720
And there's all these internal capabilities that do things like
12094
09:04:15,720 --> 09:04:19,720
implement the bracket operator, et cetera, those double underscore ones.
12095
09:04:19,720 --> 09:04:21,720
We can ignore them, although you can even look them up
12096
09:04:21,720 --> 09:04:23,720
and figure out what they mean if you feel like it.
12097
09:04:23,720 --> 09:04:27,720
But the methods that we tend to call are in this class.
12098
09:04:27,720 --> 09:04:32,720
And so things like x.sort, I've always told you,
12099
09:04:32,720 --> 09:04:35,720
that is the sort method within the x thing.
12100
09:04:35,720 --> 09:04:38,720
And the dot operator is the operator that we use
12101
09:04:38,720 --> 09:04:40,720
to look something up within an object.
12102
09:04:40,720 --> 09:04:43,720
And so you've been using the syntax all along.
12103
09:04:43,720 --> 09:04:47,720
x.sort, dictionary.items, all of those are methods
12104
09:04:47,720 --> 09:04:50,720
within the corresponding class.
12105
09:04:50,720 --> 09:04:53,720
If we take a look at this line of code that we've been doing
12106
09:04:53,720 --> 09:04:57,720
for a very long time, which says, oh, stick hello there into y.
12107
09:04:57,720 --> 09:05:01,720
It's, if I reword that as more oo or object oriented,
12108
09:05:01,720 --> 09:05:06,720
what this single quote does says, make me a string object
12109
09:05:06,720 --> 09:05:11,720
and put some text in it, and then when that is done being constructed,
12110
09:05:11,720 --> 09:05:13,720
stick that into y.
12111
09:05:13,720 --> 09:05:17,720
Right? And so y now points to a string object
12112
09:05:17,720 --> 09:05:20,720
that's been preinitialized to the string hello there.
12113
09:05:20,720 --> 09:05:23,720
Now that's a long way of saying hello there ends up in y.
12114
09:05:23,720 --> 09:05:26,720
But in oo terms we can talk about that.
12115
09:05:26,720 --> 09:05:30,720
If we do a dir of that, we see a whole bunch of internal methods,
12116
09:05:30,720 --> 09:05:32,720
which have double underscores.
12117
09:05:32,720 --> 09:05:34,720
And then we see all kinds of methods that we've been using.
12118
09:05:34,720 --> 09:05:37,720
We've been using methods like upper.
12119
09:05:37,720 --> 09:05:39,720
We've been using methods like find.
12120
09:05:39,720 --> 09:05:44,720
We've been using methods like rstrip, right?
12121
09:05:44,720 --> 09:05:46,720
We've been using these methods.
12122
09:05:46,720 --> 09:05:51,720
So we're going to like y.rstrip, parentheses.
12123
09:05:51,720 --> 09:05:54,720
Again, that's a method, that's an object.
12124
09:05:54,720 --> 09:05:59,720
Not a class, it's an object, and that is the object lookup operator.
12125
09:05:59,720 --> 09:06:03,720
Now if we do the same thing to code that we've built,
12126
09:06:03,720 --> 09:06:07,720
or a class that we've built, so now we have a party animal class.
12127
09:06:07,720 --> 09:06:10,720
Remember this up to here is just definition.
12128
09:06:10,720 --> 09:06:13,720
Now we construct it, and we store it in an.
12129
09:06:13,720 --> 09:06:17,720
So an is a variable that contains an object of type party animal.
12130
09:06:17,720 --> 09:06:21,720
We ask it what type it is, and it prints out here.
12131
09:06:21,720 --> 09:06:25,720
It says this is a class, and it's main underscore party animal.
12132
09:06:25,720 --> 09:06:28,720
And this whole thing here is the underscore main.
12133
09:06:28,720 --> 09:06:30,720
It's scope to underscore main.
12134
09:06:30,720 --> 09:06:32,720
But you can see that you have made a new type.
12135
09:06:32,720 --> 09:06:35,720
You built a type by using this class keyword.
12136
09:06:35,720 --> 09:06:37,720
And then we use the dir.
12137
09:06:37,720 --> 09:06:39,720
Remember, dir looks for capabilities.
12138
09:06:39,720 --> 09:06:44,720
And again, you will see a whole bunch of underscore things.
12139
09:06:44,720 --> 09:06:46,720
They have meaning, you can look them up.
12140
09:06:46,720 --> 09:06:49,720
But eventually you'll see the two things that you've put in it.
12141
09:06:49,720 --> 09:06:53,720
One is the method party, and the other is the attribute, or field x.
12142
09:06:53,720 --> 09:06:57,720
And again, these are the things that you can say, an.x.
12143
09:06:57,720 --> 09:07:00,720
Or an.party.
12144
09:07:00,720 --> 09:07:06,720
Because this dot is the object operator, the object lookup operator that says,
12145
09:07:06,720 --> 09:07:09,720
look up in the object an, the thing x.
12146
09:07:09,720 --> 09:07:11,720
Or look up in the object an, the thing party.
12147
09:07:11,720 --> 09:07:13,720
Okay?
12148
09:07:15,720 --> 09:07:19,720
So up next we'll talk a little bit about how objects are created and destroyed.
12149
09:07:19,720 --> 09:07:22,720
We also call that object life cycle.
12150
09:07:22,720 --> 09:07:27,720
Now I'm going to talk a little bit about object life cycle.
12151
09:07:27,720 --> 09:07:32,720
And what we mean by object life cycle is the act of creating and destroying these objects.
12152
09:07:32,720 --> 09:07:35,720
And I've been using this term constructor already.
12153
09:07:35,720 --> 09:07:41,720
And so when we declare a variable, whether it's a string or a dictionary or a party animal,
12154
09:07:41,720 --> 09:07:43,720
whether we create them and then they're discarded,
12155
09:07:43,720 --> 09:07:46,720
and there's all this dynamic memory that comes and goes.
12156
09:07:46,720 --> 09:07:53,720
And we as the writers of objects have the ability to insert ourselves at the moment of object creation
12157
09:07:53,720 --> 09:07:55,720
and at the moment of object destruction.
12158
09:07:55,720 --> 09:08:00,720
And we make special functions that we call the constructor, the object constructor,
12159
09:08:00,720 --> 09:08:03,720
or the class constructor, and the destructor.
12160
09:08:03,720 --> 09:08:05,720
And we don't actually explicitly call them.
12161
09:08:05,720 --> 09:08:10,720
They're called automatically by the by Python on our behalf.
12162
09:08:10,720 --> 09:08:13,720
And so the constructor is much more commonly used.
12163
09:08:13,720 --> 09:08:18,720
It's used to set up any initial values of variables if necessary, etc., etc.
12164
09:08:18,720 --> 09:08:24,720
Destructors will cover them, but they're used very rarely.
12165
09:08:24,720 --> 09:08:26,720
So here's a bit of code that we've got.
12166
09:08:26,720 --> 09:08:31,720
It's our party animal, and a lot of it is the same as what we've been doing so far.
12167
09:08:31,720 --> 09:08:35,720
So we have this variable x, and the constructor has a special name,
12168
09:08:35,720 --> 09:08:38,720
underscore, underscore, init, underscore.
12169
09:08:38,720 --> 09:08:42,720
Again, we pass in the instance of the object, self.
12170
09:08:42,720 --> 09:08:45,720
And in this one, all we're going to do is print out that you're constructed.
12171
09:08:45,720 --> 09:08:47,720
And here's this code that we've had before.
12172
09:08:47,720 --> 09:08:51,720
And now we have underscore, underscore, del, and then we pass in self.
12173
09:08:51,720 --> 09:08:54,720
And we'll just print out that we're being destructed
12174
09:08:54,720 --> 09:09:01,720
and what the current value of x is for that particular instance.
12175
09:09:01,720 --> 09:09:04,720
So let's go ahead and run this.
12176
09:09:04,720 --> 09:09:07,720
And so, again, this doesn't really do any code up to here.
12177
09:09:07,720 --> 09:09:11,720
That just defines party animal, but this is the constructing of it.
12178
09:09:11,720 --> 09:09:15,720
And basically that says, oh, and it really kind of creates these variables,
12179
09:09:15,720 --> 09:09:17,720
and then it also runs the constructor.
12180
09:09:17,720 --> 09:09:23,720
And so in this case, this line right here is causing the I am constructed message to come out.
12181
09:09:23,720 --> 09:09:29,720
Then we do and party, and party, and that says, you know, one and two.
12182
09:09:29,720 --> 09:09:31,720
And here's an interesting thing.
12183
09:09:31,720 --> 09:09:37,720
We're actually going to destroy this variable by throwing away an an no longer points at that object.
12184
09:09:37,720 --> 09:09:39,720
an is going to point to 42.
12185
09:09:39,720 --> 09:09:42,720
So we're going to sort of overwrite an and put 42 in it.
12186
09:09:42,720 --> 09:09:46,720
And at that point, Python's like, oh, this whole little object that I just created,
12187
09:09:46,720 --> 09:09:52,720
somewhere it's out here, it's vaporizing it and throwing it away.
12188
09:09:52,720 --> 09:09:57,720
And so before this line completes, it actually calls our destructor on our behalf.
12189
09:09:57,720 --> 09:09:59,720
And so that message comes out.
12190
09:09:59,720 --> 09:10:05,720
So we are allowed as the builder of these objects to add these little chunks of code that says,
12191
09:10:05,720 --> 09:10:11,720
I want to be involved at the moment this object is created, and I want to be involved at the moment that this object is destroyed.
12192
09:10:11,720 --> 09:10:16,720
Now, in this last line, an is no longer a party animal.
12193
09:10:16,720 --> 09:10:18,720
an is now an integer.
12194
09:10:18,720 --> 09:10:20,720
It's got a 42 in it.
12195
09:10:20,720 --> 09:10:23,720
It's gone. It's been created. It was used, and then it was destroyed.
12196
09:10:23,720 --> 09:10:26,720
So you've got to be careful if you overwrite something.
12197
09:10:26,720 --> 09:10:29,720
You can sort of throw the object away.
12198
09:10:29,720 --> 09:10:38,720
So the constructor is a special block of code that's called when the object is created to set the object up.
12199
09:10:38,720 --> 09:10:40,720
So we can create lots of instances.
12200
09:10:40,720 --> 09:10:46,720
Everything we've done so far is we make a class, and then we create one instance, one object.
12201
09:10:46,720 --> 09:10:49,720
And each of these objects ends up being stored in its own variable.
12202
09:10:49,720 --> 09:10:51,720
We have a variable an, and we've been using it.
12203
09:10:51,720 --> 09:10:58,720
But the more interesting thing begins to happen when we have multiple instances of the same class sitting in different variables.
12204
09:10:58,720 --> 09:11:01,720
And it has its own copy of the instance variables.
12205
09:11:01,720 --> 09:11:03,720
So let's take a look at this.
12206
09:11:03,720 --> 09:11:11,720
So this code here, I've taken out the destructor, and it shows a little bit more information.
12207
09:11:11,720 --> 09:11:13,720
So now we're going to put two variables in here.
12208
09:11:13,720 --> 09:11:18,720
We're going to have a current score or whatever and a name, and we're going to start it out as blank.
12209
09:11:18,720 --> 09:11:23,720
And this time we're going to add a parameter onto the constructor.
12210
09:11:23,720 --> 09:11:28,720
And so the self comes in sort of automatically as the object is being constructed.
12211
09:11:28,720 --> 09:11:36,720
But if we put a parameter on the constructor call, which is this party animal call, then this comes in as the z variable.
12212
09:11:36,720 --> 09:11:42,720
And so self is the object itself, and z, this first parameter, is whatever parameter we put here.
12213
09:11:42,720 --> 09:11:46,720
Everything we've done so far has no parameter here, but now we have a parameter here.
12214
09:11:46,720 --> 09:11:51,720
And then that means that when we call this constructor, this line of code comes,
12215
09:11:51,720 --> 09:11:56,720
and then name is no longer blank, name is going to be Sally in this particular thing.
12216
09:11:56,720 --> 09:12:01,720
And then it'll say, oh, self.name, which will be Sally who has been constructed.
12217
09:12:01,720 --> 09:12:07,720
And so then we have this, and that object is now constructed, and then we put it in the variable s.
12218
09:12:07,720 --> 09:12:11,720
And then we call the party method on that, and we construct a different one.
12219
09:12:11,720 --> 09:12:20,720
And so this time it calls, and z is Jim, and we basically have a, oops,
12220
09:12:20,720 --> 09:12:24,720
another copy of this. And so this is how it's going to look.
12221
09:12:24,720 --> 09:12:35,720
As it runs down here, when this is called, it makes one instance and stores that in the variable s.
12222
09:12:35,720 --> 09:12:41,720
And there's a variable x in there, there's a name in there, there's an init method in party, and that's all in here.
12223
09:12:41,720 --> 09:12:48,720
All that stuff is in here. And now we say, let's make, and that's going to have Sally in there.
12224
09:12:48,720 --> 09:12:52,720
All right, Sally in there.
12225
09:12:52,720 --> 09:12:56,720
And then we're going to do another constructor, and so it's going to make a whole new thing,
12226
09:12:56,720 --> 09:13:00,720
and it's going to store that in j, and this one's going to have Jim in it.
12227
09:13:00,720 --> 09:13:05,720
S party, then this turns into a one, and then we're going to call j party,
12228
09:13:05,720 --> 09:13:10,720
that turns that into a one, and then s party will cause this to be a two.
12229
09:13:10,720 --> 09:13:17,720
And so what happens is we have now two objects, one in the variable s and one in the variable j,
12230
09:13:17,720 --> 09:13:20,720
and they have separate copies of their instance variables.
12231
09:13:20,720 --> 09:13:25,720
These are the instance variables, or the object fields, or whatever, but they're the variables.
12232
09:13:25,720 --> 09:13:32,720
But the key is that every time we do a new construction, it duplicates this, and there's another copy of it.
12233
09:13:32,720 --> 09:13:40,720
So there's an x within s. So s.x is this variable, and j.x is that variable.
12234
09:13:40,720 --> 09:13:54,720
Okay? So the next thing we'll talk about is inheritance, and that's the idea of taking one class and extending it to make something new.
12235
09:13:54,720 --> 09:13:59,720
So the last topic we'll talk about here in object orientation is the notion of inheritance.
12236
09:13:59,720 --> 09:14:06,720
And this is a form of code reuse, and it's one of the more advanced aspects of object-oriented programming.
12237
09:14:06,720 --> 09:14:14,720
So just kind of understand what it is at a high level, and then you know where to come back to when you need to learn a bit more about inheritance.
12238
09:14:14,720 --> 09:14:21,720
So the idea is instead of making a new class from scratch, we actually make a new class by starting with an existing class.
12239
09:14:21,720 --> 09:14:25,720
We are extending it, or another word for this is subclassing.
12240
09:14:25,720 --> 09:14:33,720
And it's sort of a situation where you're like, I've got this code, and I've got this data, and I just need to add a few things to it,
12241
09:14:33,720 --> 09:14:35,720
and then I'll have a whole new thing.
12242
09:14:35,720 --> 09:14:44,720
And as you design objects and what we call object hierarchies, you often do this, and it's a form of sort of real clever code reuse.
12243
09:14:44,720 --> 09:14:51,720
But again, don't necessarily think that you're supposed to know when to use this or why to use this.
12244
09:14:51,720 --> 09:14:55,720
Right now, it's just terminology, okay? Just terminology.
12245
09:14:55,720 --> 09:14:57,720
We have what call these as parent-child relationships.
12246
09:14:57,720 --> 09:15:02,720
The original class is called a parent, and the new class is called the child class.
12247
09:15:02,720 --> 09:15:05,720
So subclasses are another word for this.
12248
09:15:05,720 --> 09:15:07,720
You have a class, and then you subclass it.
12249
09:15:07,720 --> 09:15:14,720
I think extending and inheriting and parent-child are probably better ways of expressing it than subclassing.
12250
09:15:14,720 --> 09:15:17,720
So here's a bit of code. Let's take a look at this.
12251
09:15:17,720 --> 09:15:22,720
This code's unchanged. It's the party animal code that we've been saying all along.
12252
09:15:22,720 --> 09:15:25,720
It's the one that we construct and put a name in.
12253
09:15:25,720 --> 09:15:27,720
And now what we're going to do is extend it.
12254
09:15:27,720 --> 09:15:31,720
And so you'll notice that this code down here is the part that's doing the extending.
12255
09:15:31,720 --> 09:15:34,720
So we're making a new class, football fan.
12256
09:15:34,720 --> 09:15:39,720
And by putting in parentheses before the colon, party animal, that says,
12257
09:15:39,720 --> 09:15:45,720
football fan inherits everything that is party animal, meaning the x, the name, the init, the party.
12258
09:15:45,720 --> 09:15:48,720
All those methods and data are sitting there.
12259
09:15:48,720 --> 09:15:50,720
And now we're going to add a new variable.
12260
09:15:50,720 --> 09:15:56,720
So football fan has, in addition to all those other variables, it has points, and it has a touchdown method.
12261
09:15:56,720 --> 09:16:03,720
And self-points is added to, we add seven of the points, and then we call the party.
12262
09:16:03,720 --> 09:16:04,720
And that does that.
12263
09:16:04,720 --> 09:16:11,720
So this is calling this method because football fan includes x, name, and party, and init, and everything.
12264
09:16:11,720 --> 09:16:18,720
And all this constructor, so this football fan is really an amalgamation of all these things together.
12265
09:16:18,720 --> 09:16:22,720
Party animal is just this stuff, right?
12266
09:16:22,720 --> 09:16:25,720
And so we still have two classes. We don't just have one.
12267
09:16:25,720 --> 09:16:27,720
We didn't erase the party animal class.
12268
09:16:27,720 --> 09:16:29,720
And so we take a look at the code that we can run here.
12269
09:16:29,720 --> 09:16:32,720
We can say, oh, okay, let's make a party animal, Sally.
12270
09:16:32,720 --> 09:16:41,720
And so that constructs an object like this, and then stores that in s, with an x starting out at zero.
12271
09:16:41,720 --> 09:16:48,720
And then we call this party, oops, better change that color, starts out at zero.
12272
09:16:48,720 --> 09:16:51,720
And then we call the party method, and that changes it to one.
12273
09:16:51,720 --> 09:16:57,720
And so this bit of code, it's as if this part doesn't matter at all because it is a party animal.
12274
09:16:57,720 --> 09:16:59,720
It's not a football fan.
12275
09:16:59,720 --> 09:17:06,720
But now if we take a look at this code down here, take this code down here,
12276
09:17:06,720 --> 09:17:10,720
we're going to construct a football fan and pass in gym.
12277
09:17:10,720 --> 09:17:13,720
But football fan has no underscore, underscore, and knit.
12278
09:17:13,720 --> 09:17:20,720
So that actually uses the underscore and knit from party animal because we extended party animal to make football fan.
12279
09:17:20,720 --> 09:17:23,720
So we inherited all of the good that was in there.
12280
09:17:23,720 --> 09:17:27,720
So there it's going to make a name, a variable x, which is going to start at zero,
12281
09:17:27,720 --> 09:17:32,720
a variable name that's going to have gym in it, and a variable points that's going to have a zero in it.
12282
09:17:32,720 --> 09:17:38,720
So this j variable has more things in it than the s variable has.
12283
09:17:38,720 --> 09:17:46,720
And so we can call the j party, and if we call j party, that goes here and adds one to x, right?
12284
09:17:46,720 --> 09:17:48,720
So that adds one to x.
12285
09:17:48,720 --> 09:17:50,720
And then we call j touchdown.
12286
09:17:50,720 --> 09:17:55,720
Well, that comes down in here and adds seven to the points, right?
12287
09:17:55,720 --> 09:17:58,720
And then calls party within us.
12288
09:17:58,720 --> 09:18:04,720
So self.party is the current object, i.e. self and j are the same thing, right?
12289
09:18:04,720 --> 09:18:13,720
Self.party, and then it goes up here and passes self in, and it adds one to the x, in this case, of this j variable.
12290
09:18:13,720 --> 09:18:15,720
So this becomes two.
12291
09:18:15,720 --> 09:18:20,720
And that's where it prints out seven and two, and away you go.
12292
09:18:20,720 --> 09:18:28,720
And so it's a way for you to kind of take all this stuff and stuff it into a class by making a new class
12293
09:18:28,720 --> 09:18:33,720
and just add the extending bits, the bits that are in addition to the other stuff.
12294
09:18:33,720 --> 09:18:38,720
So like I said, inheritance is a powerful and wonderful concept.
12295
09:18:38,720 --> 09:18:47,720
It's a form of, excellent form of reuse, but basically the whole purpose of this lecture was
12296
09:18:47,720 --> 09:18:52,720
so that I could in the future just use these words and you would understand them as compared to,
12297
09:18:52,720 --> 09:18:56,720
I just want to say method, and I've been saying method all along in this high time that I defined it.
12298
09:18:56,720 --> 09:18:59,720
So let's just review one last time.
12299
09:18:59,720 --> 09:19:01,720
Class is a template.
12300
09:19:01,720 --> 09:19:03,720
It is not actually a thing.
12301
09:19:03,720 --> 09:19:05,720
It is a shape of a thing.
12302
09:19:05,720 --> 09:19:09,720
And we define it and say when we make one of these things, it's going to have these variables in it,
12303
09:19:09,720 --> 09:19:11,720
it's going to have these method in it.
12304
09:19:11,720 --> 09:19:16,720
Attributes, variables within a class, method is a function that's inside of a class.
12305
09:19:16,720 --> 09:19:21,720
Object is once we construct a class, we get back an object.
12306
09:19:21,720 --> 09:19:26,720
And so object here is the snowman cookies.
12307
09:19:26,720 --> 09:19:29,720
Class is the snowman cookie cutter.
12308
09:19:29,720 --> 09:19:38,720
And a constructor is a bit of code that sets up our object, our instance, when it first is created.
12309
09:19:38,720 --> 09:19:47,720
And inheritance is this ability to create a new class but take all and import and affect all the capabilities of an existing class.
12310
09:19:47,720 --> 09:19:51,720
So object-oriented is awesome.
12311
09:19:51,720 --> 09:19:54,720
For the rest of this class, we're not going to write any object code.
12312
09:19:54,720 --> 09:19:57,720
We're not going to use class at all, but we are going to use objects.
12313
09:19:57,720 --> 09:20:00,720
Literally, you've been using objects from the beginning of this course.
12314
09:20:00,720 --> 09:20:09,720
As soon as you said, print, whoops, as soon as you said, you know, x equals high, that's an object.
12315
09:20:09,720 --> 09:20:15,720
And as soon as you said x.upper, you were calling a method, right?
12316
09:20:15,720 --> 09:20:17,720
You've been calling a method all along.
12317
09:20:17,720 --> 09:20:25,720
When you're doing something like fh equals open, this thing you're getting back, that's an object.
12318
09:20:25,720 --> 09:20:28,720
And then you do fh.read or whatever.
12319
09:20:28,720 --> 09:20:31,720
You're calling a method in the dot operator.
12320
09:20:31,720 --> 09:20:33,720
So you've been using objects all along.
12321
09:20:33,720 --> 09:20:41,720
Now I'm just finally explaining to you when I say call the read method or call the upper method or what's this little dot and why is that there?
12322
09:20:41,720 --> 09:20:54,720
So again, it's time for us to understand that, but it will take you a long time before you encounter a problem that's large enough where as part of your solution, you're going to make a new object.
12323
09:20:54,720 --> 09:20:56,720
But when you do, it's really a powerful thing.
12324
09:20:56,720 --> 09:21:02,720
I mean, it's a really bad idea for me as a teacher to say, oh, write a bunch of objects.
12325
09:21:02,720 --> 09:21:04,720
It's premature for that.
12326
09:21:04,720 --> 09:21:08,720
It's later is when you will actually learn how to use objects.
12327
09:21:08,720 --> 09:21:11,720
And you'll be like, oh, thank heaven that these objects are here.
12328
09:21:11,720 --> 09:21:12,720
Okay?
12329
09:21:12,720 --> 09:21:14,720
So that's all for now.
12330
09:21:14,720 --> 09:21:15,720
Thanks for listening.
12331
09:21:15,720 --> 09:21:16,720
See you on the net.
12332
09:21:20,720 --> 09:21:23,720
Hello and welcome to our chapter on databases.
12333
09:21:23,720 --> 09:21:30,720
We're going to learn a lot in this chapter, learn a whole new programming language, SQL, and learn how to use that.
12334
09:21:30,720 --> 09:21:37,720
So you're going to need a new piece of software to run all of the exercises that I'm going to do called SQLite Browser.
12335
09:21:37,720 --> 09:21:39,720
We're using a database called SQLite.
12336
09:21:39,720 --> 09:21:40,720
Go ahead and download this.
12337
09:21:40,720 --> 09:21:42,720
You might have to pause and come back if you like.
12338
09:21:42,720 --> 09:21:46,720
Go to sqlitebrowser.org and download it and install it.
12339
09:21:46,720 --> 09:21:50,720
While you're doing that, we'll talk a little bit about the history.
12340
09:21:50,720 --> 09:22:01,720
So in the old days, 1960s, 1970s, I started doing computing in 1975, we didn't have a lot of storage.
12341
09:22:01,720 --> 09:22:06,720
I mean, this is 16 gigabytes right here, and we didn't even have megabytes.
12342
09:22:06,720 --> 09:22:10,720
I mean, the computer I had had a few megabytes of stuff.
12343
09:22:10,720 --> 09:22:12,720
Well, so we didn't have a lot of disk drives.
12344
09:22:12,720 --> 09:22:18,720
And so permanent storage was often sequential in these tapes, these tape drives that we had.
12345
09:22:18,720 --> 09:22:24,720
Tapes and tape drives were the scalable part of storage because you could just make more tapes and you could rack them up.
12346
09:22:24,720 --> 09:22:28,720
And so that was our way of greatly increasing the storage of the computer.
12347
09:22:28,720 --> 09:22:30,720
The problem they had was, is they were sequential.
12348
09:22:30,720 --> 09:22:33,720
You read it, it advances, read it, advance, read and advance.
12349
09:22:33,720 --> 09:22:39,720
Now, interestingly, we've been writing programs that do this, that everything we've written so far pretty much reads the whole file,
12350
09:22:39,720 --> 09:22:42,720
reads the whole web page, reads this, everything we read it.
12351
09:22:42,720 --> 09:22:44,720
We read either a loop or read the whole thing.
12352
09:22:44,720 --> 09:22:46,720
And that's because we have plenty of memory.
12353
09:22:46,720 --> 09:22:49,720
But we're still reading sequentially.
12354
09:22:49,720 --> 09:22:57,720
And so the way you would do this when you didn't have enough spinning storage or online storage is you'd use offline storage.
12355
09:22:57,720 --> 09:22:59,720
But the trick would be that you would sort it.
12356
09:22:59,720 --> 09:23:04,720
So let's imagine that you're a bank and you have a bunch of accounts, only a few of which are active on any day.
12357
09:23:04,720 --> 09:23:14,720
And you have a tape that has, in account number order from low to high, the prior balance, last night's balance of every one of your bank accounts.
12358
09:23:14,720 --> 09:23:21,720
And then you do all the transactions and you record how much money was taken in or out for each account number.
12359
09:23:21,720 --> 09:23:22,720
And then you sort those transactions.
12360
09:23:22,720 --> 09:23:26,720
And then what you do is what we call the sequential master update.
12361
09:23:26,720 --> 09:23:31,720
And that is, you would write a program that would read the first transaction and hold on to it.
12362
09:23:31,720 --> 09:23:34,720
Say, okay, this is count 45.
12363
09:23:34,720 --> 09:23:36,720
Then it would read the first count, like one.
12364
09:23:36,720 --> 09:23:37,720
And it would copy one.
12365
09:23:37,720 --> 09:23:41,720
And then it would read two and read like seven, eight, 42, 43.
12366
09:23:41,720 --> 09:23:43,720
Then it would read like 44.
12367
09:23:43,720 --> 09:23:49,720
And then it would read 45, but now it would change that and write the new 45 and read the next thing.
12368
09:23:49,720 --> 09:23:50,720
And so this might be 60.
12369
09:23:50,720 --> 09:23:53,720
And it would read a bunch of stuff and copy a bunch of stuff.
12370
09:23:53,720 --> 09:23:57,720
And then it would finally get to 60 and it would merge the add or subtract.
12371
09:23:57,720 --> 09:23:59,720
And so the old balance ended up here.
12372
09:23:59,720 --> 09:24:01,720
And the new balance did here.
12373
09:24:01,720 --> 09:24:03,720
And you had to only make one pass through the data.
12374
09:24:03,720 --> 09:24:05,720
So it was super efficient.
12375
09:24:05,720 --> 09:24:07,720
So we had all these mechanisms to sort.
12376
09:24:07,720 --> 09:24:10,720
We used to do punch cards and have sorters and all these things.
12377
09:24:10,720 --> 09:24:14,720
And then these things would run for hours.
12378
09:24:14,720 --> 09:24:19,720
And if you watch old TV shows, these tapes are spinning and these things are running back and forth.
12379
09:24:19,720 --> 09:24:22,720
These are simply reading and writing tapes.
12380
09:24:22,720 --> 09:24:29,720
And that's how we did a lot of data processing because we could store far more on a tape drive than we could on a disk.
12381
09:24:29,720 --> 09:24:35,720
And with racks of tape drives, we could scale the storage that our computers had.
12382
09:24:35,720 --> 09:24:37,720
And so that's the way we did data processing.
12383
09:24:37,720 --> 09:24:41,720
But it meant that the only way you knew what the old balance was
12384
09:24:41,720 --> 09:24:45,720
was it was the balance as of this morning before your bank started.
12385
09:24:45,720 --> 09:24:47,720
You don't know what the balance was for the day.
12386
09:24:47,720 --> 09:24:54,720
And that led to things like you can never withdraw more than $100 a day or something like that
12387
09:24:54,720 --> 09:24:56,720
because you don't know what the old balance was.
12388
09:24:56,720 --> 09:24:59,720
Or you might go withdraw $100 at a couple of different branches.
12389
09:24:59,720 --> 09:25:04,720
And so they weren't able to look your stuff up right away.
12390
09:25:04,720 --> 09:25:09,720
Now, it didn't take long until the disk drives got better and better and better.
12391
09:25:09,720 --> 09:25:15,720
And you could store the entire accounts, all the accounts and their current balances, on computers.
12392
09:25:15,720 --> 09:25:21,720
And then the problem becomes is what happens if sort of in the middle of the afternoon you want to update a balance?
12393
09:25:21,720 --> 09:25:25,720
Well, do you want to read all your data and then write a brand new one?
12394
09:25:25,720 --> 09:25:27,720
And say that takes like 10 minutes.
12395
09:25:27,720 --> 09:25:32,720
That means for that 10 minutes, only one person can be updating their bank balance.
12396
09:25:32,720 --> 09:25:37,720
And so because we could randomly access this data, we didn't have to read it all sequentially.
12397
09:25:37,720 --> 09:25:40,720
The trick was is how do you spread the data out?
12398
09:25:40,720 --> 09:25:43,720
And then how do you make it so you can change a balance?
12399
09:25:43,720 --> 09:25:45,720
This is, of course, second nature today.
12400
09:25:45,720 --> 09:25:49,720
But how do you make it so you change the balance here without changing the balance there?
12401
09:25:49,720 --> 09:25:52,720
And you can have multiple people going simultaneously to these things.
12402
09:25:52,720 --> 09:25:57,720
And make sure that you can't say withdraw money at two different locations simultaneously
12403
09:25:57,720 --> 09:26:00,720
and somehow have your bank balance get corrupted by that.
12404
09:26:00,720 --> 09:26:02,720
So there's a lot of debate on how to do that.
12405
09:26:02,720 --> 09:26:05,720
And in early days, we just did sequential master update.
12406
09:26:05,720 --> 09:26:12,720
But increasingly, we wanted to make better use of the random nature of our computers and our storage.
12407
09:26:12,720 --> 09:26:15,720
And so that's what led to databases.
12408
09:26:15,720 --> 09:26:24,720
Databases are the science of how you make use of rotating random access data, permanent data,
12409
09:26:24,720 --> 09:26:29,720
in a way that allows you to read, modify, and update that simultaneously from many different locations.
12410
09:26:29,720 --> 09:26:32,720
And yet keep the data completely consistent.
12411
09:26:32,720 --> 09:26:36,720
And so this led to a study of a thing called relational databases.
12412
09:26:36,720 --> 09:26:41,720
And relational databases are not the only databases that happened.
12413
09:26:41,720 --> 09:26:43,720
We had many other kinds of databases.
12414
09:26:43,720 --> 09:26:44,720
And there was a debate.
12415
09:26:44,720 --> 09:26:49,720
And I remember in the 70s and the 80s, there was a folks that says, oh, no, no, there.
12416
09:26:49,720 --> 09:26:50,720
You can do index sequential.
12417
09:26:50,720 --> 09:26:51,720
That's the way to do it.
12418
09:26:51,720 --> 09:26:58,720
And relational databases weren't all that popular the first time that I saw them.
12419
09:26:58,720 --> 09:27:01,720
I didn't like relational databases.
12420
09:27:01,720 --> 09:27:06,720
Relational databases had an inherent advantage because they were based on some really powerful mathematics.
12421
09:27:06,720 --> 09:27:11,720
And the interesting thing is, early on, the relational databases were slower.
12422
09:27:11,720 --> 09:27:17,720
But eventually, they figured out how to sort of bring all the cleverness to bear to make relational databases fast.
12423
09:27:17,720 --> 09:27:21,720
And so relational databases are a pretty advanced technology.
12424
09:27:21,720 --> 09:27:24,720
And there are companies like Oracle that are very, very wealthy.
12425
09:27:24,720 --> 09:27:29,720
And their primary product for many, many years was nothing more than a clever database product,
12426
09:27:29,720 --> 09:27:32,720
a clever piece of software that was really good at solving this problem.
12427
09:27:32,720 --> 09:27:36,720
And that's how important this problem was to computing.
12428
09:27:36,720 --> 09:27:39,720
If you read about databases, you're going to see two sets of terminology.
12429
09:27:39,720 --> 09:27:46,720
One set of terminology comes from the mathematical background and has to do with the underlying math,
12430
09:27:46,720 --> 09:27:49,720
things like relations, tuples, and attributes.
12431
09:27:49,720 --> 09:27:54,720
That's kind of like the fancy math version of it.
12432
09:27:54,720 --> 09:27:58,720
And programmers kind of think of them as rows and columns inside of a table.
12433
09:27:58,720 --> 09:28:03,720
And so if you look at sort of fancy theory, you'll see words that look like this.
12434
09:28:03,720 --> 09:28:05,720
And they're just full of this and the connection.
12435
09:28:05,720 --> 09:28:07,720
Now, all this is important and true.
12436
09:28:07,720 --> 09:28:13,720
And if you really want to get good, you sort of begin to understand the nature that we model data at connections
12437
09:28:13,720 --> 09:28:19,720
rather than at sort of intersection points rather than just modeling data as a flat file the way we do.
12438
09:28:19,720 --> 09:28:26,720
But for now, we're going to, as programmers, think of this as just like, oh, it's like a super fast spreadsheet.
12439
09:28:26,720 --> 09:28:28,720
The super fast part is the math.
12440
09:28:28,720 --> 09:28:30,720
For us, the rows, columns, and tables are spreadsheets.
12441
09:28:30,720 --> 09:28:34,720
So think in a spreadsheet of sheets, sheet, sheet, sheet.
12442
09:28:34,720 --> 09:28:39,720
And that's like a table, a named thing like tracks or albums, artists or genres.
12443
09:28:39,720 --> 09:28:43,720
And then there is rows, and each row has a different kind of data.
12444
09:28:43,720 --> 09:28:44,720
And then there's columns.
12445
09:28:44,720 --> 09:28:49,720
And we sort of specialize the first column in many spreadsheets to say what's in there.
12446
09:28:49,720 --> 09:28:50,720
This is not really the data.
12447
09:28:50,720 --> 09:28:52,720
This is like metadata.
12448
09:28:52,720 --> 09:28:54,720
It's like the titles in this first column.
12449
09:28:54,720 --> 09:28:56,720
That's not really the data, and the data starts here.
12450
09:28:56,720 --> 09:29:02,720
And we have different kinds of data like strings and numbers, et cetera, et cetera, for each of the rows.
12451
09:29:02,720 --> 09:29:08,720
And literally, you can get away with this as sort of about 80% of databases.
12452
09:29:08,720 --> 09:29:10,720
It's just a really super cool spreadsheet.
12453
09:29:10,720 --> 09:29:15,720
But under the covers, it is far more powerful than that.
12454
09:29:15,720 --> 09:29:21,720
So one of the early arguments that happened was, again, what the programming model for this was.
12455
09:29:21,720 --> 09:29:27,720
And a lot of folks wanted a programming model that reflected how the data was actually stored.
12456
09:29:27,720 --> 09:29:34,720
The notion of structured query language came about in a way to express what you wanted to happen
12457
09:29:34,720 --> 09:29:37,720
and allow that to be sort of a very abstract expression.
12458
09:29:37,720 --> 09:29:40,720
Select all records that meet this criteria.
12459
09:29:40,720 --> 09:29:43,720
Not read, read, read, read, read, read.
12460
09:29:43,720 --> 09:29:47,720
And so structured query language is not a procedural language.
12461
09:29:47,720 --> 09:29:52,720
It is an imperative language where you're simply saying what you want.
12462
09:29:52,720 --> 09:29:54,720
And then somebody writes the loop.
12463
09:29:54,720 --> 09:29:58,720
The database actually does the loop, but it's a way for you to avoid actually writing the loop.
12464
09:29:58,720 --> 09:30:00,720
Now, that turns out to be the power of databases.
12465
09:30:00,720 --> 09:30:05,720
Because the cleverness in how to write the loop is a way that you would probably never figure out
12466
09:30:05,720 --> 09:30:09,720
how to be most supremely optimal when it comes to writing the loop.
12467
09:30:09,720 --> 09:30:13,720
As you'll see toward the end of joining many tables together and selecting and throwing a ray
12468
09:30:13,720 --> 09:30:15,720
and getting down a count or whatever.
12469
09:30:15,720 --> 09:30:18,720
Someone has figured out how to do that really, really well.
12470
09:30:18,720 --> 09:30:22,720
So the idea was, is you would express, you know, we're going to create some data,
12471
09:30:22,720 --> 09:30:25,720
we're going to retrieve some data, we're going to insert and delete it.
12472
09:30:25,720 --> 09:30:27,720
Create, read, crud.
12473
09:30:27,720 --> 09:30:30,720
C-R-U-D.
12474
09:30:30,720 --> 09:30:33,720
Create, read, update, and delete, crud.
12475
09:30:33,720 --> 09:30:34,720
And so that's what this does.
12476
09:30:34,720 --> 09:30:37,720
It's a language that does this very simply.
12477
09:30:37,720 --> 09:30:43,720
Now, the applications that we're going to use this for are more of a data analysis application.
12478
09:30:43,720 --> 09:30:46,720
We've been doing data analysis through the whole course.
12479
09:30:46,720 --> 09:30:51,720
And the kinds of things that we'll see in the remaining chapters is we'll take some raw data file.
12480
09:30:51,720 --> 09:30:53,720
These might actually come across the network.
12481
09:30:53,720 --> 09:31:00,720
And we'll write some Python programs to play with that data, parse it, clean it up, make sense of it, you know.
12482
09:31:00,720 --> 09:31:02,720
And then write it into a database.
12483
09:31:02,720 --> 09:31:07,720
And this might be a slow processor, this might be really nasty, and this might be a way to have very clean data.
12484
09:31:07,720 --> 09:31:13,720
And then we'll write another Python program to sort of read this, read through it, and it's all efficient and pretty.
12485
09:31:13,720 --> 09:31:22,720
And then we can produce files and maybe we'll visualize it or do further analysis in our Excel or JavaScript visualization framework.
12486
09:31:22,720 --> 09:31:30,720
And so in this situation, you will be the person who is both sort of writing the programs, database administrator,
12487
09:31:30,720 --> 09:31:35,720
and you can, using SQLite Browser, play and look at the database kind of in a raw way.
12488
09:31:35,720 --> 09:31:40,720
And the first part of this, we are mostly going to be using SQLite Browser just to talk straight to a database.
12489
09:31:40,720 --> 09:31:46,720
Later, we'll write Python programs that read and write data and visualize the data.
12490
09:31:46,720 --> 09:31:48,720
So this is what we're going to do first.
12491
09:31:48,720 --> 09:31:51,720
And then second, we're going to do this part right here.
12492
09:31:51,720 --> 09:31:53,720
That's the second thing we're going to do.
12493
09:31:53,720 --> 09:31:59,720
Now, another really common use of applications and something that if you continue learning more about programming,
12494
09:31:59,720 --> 09:32:08,720
is that you will want to write an online application like Amazon or a company or Twitter
12495
09:32:08,720 --> 09:32:12,720
that's got a website and it stores dynamic data in databases.
12496
09:32:12,720 --> 09:32:17,720
And so the picture for that is similar but different than the picture we're going to start out with.
12497
09:32:17,720 --> 09:32:23,720
And so the way this usually works is that you, the end user, uses a web browser, talks to the application,
12498
09:32:23,720 --> 09:32:27,720
and the developer writes the application software.
12499
09:32:27,720 --> 09:32:31,720
And that application software stores its data in a database.
12500
09:32:31,720 --> 09:32:35,720
And inside that database, we talk to the database using SQL.
12501
09:32:35,720 --> 09:32:38,720
And all the data is actually stored here and the magic happens.
12502
09:32:38,720 --> 09:32:42,720
The data server is that database software that's so precious and valuable.
12503
09:32:42,720 --> 09:32:48,720
And then there's another person often called the database administrator who has access to the direct access to the data.
12504
09:32:48,720 --> 09:33:01,720
And these roles in medium and large projects are kept separate mostly because the production,
12505
09:33:01,720 --> 09:33:07,720
while it's running and live, the developer leaves the data alone and works on, say, the next version of the software.
12506
09:33:07,720 --> 09:33:15,720
And then the developer has a test version of the application that they run on their computer where they're doing all that stuff.
12507
09:33:15,720 --> 09:33:23,720
And so this database administrator is a role in a large project where we have to run production and keep production careful,
12508
09:33:23,720 --> 09:33:26,720
keep production in good shape.
12509
09:33:26,720 --> 09:33:30,720
So the database administrator has this responsibility for the production aspects of the data.
12510
09:33:30,720 --> 09:33:34,720
And you may be working in a situation where you're not actually controlling the data.
12511
09:33:34,720 --> 09:33:36,720
The database server is on different computers.
12512
09:33:36,720 --> 09:33:40,720
You have a little special access and you write programs to sort of read the data.
12513
09:33:40,720 --> 09:33:47,720
And so the database administrator is the person who is asked by the organization to administer that data.
12514
09:33:47,720 --> 09:33:54,720
The data that we develop, and we'll do this in the second part of these lectures, conforms to a data model.
12515
09:33:54,720 --> 09:33:55,720
That's the metadata.
12516
09:33:55,720 --> 09:33:56,720
Is this an integer?
12517
09:33:56,720 --> 09:33:57,720
Is this a string?
12518
09:33:57,720 --> 09:33:59,720
You know, how many columns is this?
12519
09:33:59,720 --> 09:34:02,720
And the data model turns out to be very, very important.
12520
09:34:02,720 --> 09:34:06,720
And there's a lot of science to building an effective data model that leads to really good performance.
12521
09:34:06,720 --> 09:34:14,720
And it's a collaborative activity between the application developers and the database administrator to make it so it's efficient,
12522
09:34:14,720 --> 09:34:17,720
runs in production, et cetera, et cetera, et cetera.
12523
09:34:17,720 --> 09:34:21,720
There's a lot of products out there that you may encounter.
12524
09:34:21,720 --> 09:34:22,720
We're going to be using SQLite.
12525
09:34:22,720 --> 09:34:27,720
SQLite's a little tiny database server, and it's built into so many things, and that's why we like it.
12526
09:34:27,720 --> 09:34:34,720
But if you're going to work at a large organization, you can easily run into Oracle, which is the number one commercial product.
12527
09:34:34,720 --> 09:34:40,720
Microsoft has a thing called SQL Server, which is a commercial product, and it's also very popular and very effective.
12528
09:34:40,720 --> 09:34:45,720
The more popular open source, there's things called Postgres.
12529
09:34:45,720 --> 09:34:46,720
There's MySQL.
12530
09:34:46,720 --> 09:34:49,720
And MySQL recently was sort of bought by Oracle.
12531
09:34:49,720 --> 09:34:56,720
And there is a copy of that called MariaDB that doesn't belong to Oracle, MariaDB.
12532
09:34:56,720 --> 09:35:06,720
And so most of the SQL that we're going to learn is common across these database systems because SQL is a standard.
12533
09:35:06,720 --> 09:35:11,720
But then there are parts that weren't part of the original standard where each database vendor has done things a little bit different.
12534
09:35:11,720 --> 09:35:18,720
But there is a core common subset that does the basic create, read, update, and delete operations.
12535
09:35:18,720 --> 09:35:21,720
So SQLite is a very popular.
12536
09:35:21,720 --> 09:35:24,720
You probably have it in your cell phone 10, 12 times.
12537
09:35:24,720 --> 09:35:26,720
Your web browser has a database engine in it.
12538
09:35:26,720 --> 09:35:29,720
Your car has a few databases in it.
12539
09:35:29,720 --> 09:35:33,720
And so SQLite is what's called an embedded database system.
12540
09:35:33,720 --> 09:35:35,720
Python comes built in with it.
12541
09:35:35,720 --> 09:35:39,720
You just import SQLite 3 and away you go.
12542
09:35:39,720 --> 09:35:49,720
And so it's very, very popular because it's free, it's open source, and it's such a tiny little piece of software that you just include it in other pieces of software
12543
09:35:49,720 --> 09:35:53,720
and use it to solve the data management problems of those pieces of software.
12544
09:35:53,720 --> 09:35:56,720
Like your browser might use SQLite to store your bookmarks.
12545
09:35:56,720 --> 09:35:59,720
Now you think, oh, there's only how many bookmarks can you have.
12546
09:35:59,720 --> 09:36:01,720
But what if there you need it to be fast?
12547
09:36:01,720 --> 09:36:03,720
And what if there's like people that have 10,000 bookmarks?
12548
09:36:03,720 --> 09:36:04,720
There probably are.
12549
09:36:04,720 --> 09:36:05,720
Do you still want it fast?
12550
09:36:05,720 --> 09:36:06,720
Do you want to be able to search?
12551
09:36:06,720 --> 09:36:11,720
And so you get all that by using a database like SQLite.
12552
09:36:11,720 --> 09:36:18,720
And so again, we're going to encourage you to download the SQLite browser so you can follow along with what we're going to do coming up next.
12553
09:36:18,720 --> 09:36:20,720
And so here is the SQLite browser.
12554
09:36:20,720 --> 09:36:22,720
Here's what it looks like.
12555
09:36:22,720 --> 09:36:24,720
And it's just a desktop application.
12556
09:36:24,720 --> 09:36:33,720
And coming up next, we'll start playing with this desktop application and see how it works.
12557
09:36:33,720 --> 09:36:34,720
So now we're going to make a database.
12558
09:36:34,720 --> 09:36:36,720
We're going to use SQLite browser.
12559
09:36:36,720 --> 09:36:39,720
Hopefully you've downloaded it so you can follow along.
12560
09:36:39,720 --> 09:36:44,720
And I've got this handout, this basic database handout that saves you from having to type all these things.
12561
09:36:44,720 --> 09:36:48,720
So bring that up in your web browser.
12562
09:36:48,720 --> 09:36:51,720
And so that gives you all of the commands that I'm going to type now.
12563
09:36:51,720 --> 09:37:03,720
And so you could pull them out of the, either the web page or the, you can pull them out of the slides or you can pull them out of that, out of that.
12564
09:37:03,720 --> 09:37:07,720
So I'm going to bring up the database browser here.
12565
09:37:07,720 --> 09:37:09,720
Database browser.
12566
09:37:09,720 --> 09:37:12,720
Now the thing that's going to happen, you'll see this happen on my desktop.
12567
09:37:12,720 --> 09:37:15,720
I'm going to make a new database and you have to store it somewhere.
12568
09:37:15,720 --> 09:37:24,720
And so I'm going to put it on my desktop and I'm going to call it py4efund.
12569
09:37:24,720 --> 09:37:29,720
And so we should see a new file on my database right there, py4efund.
12570
09:37:29,720 --> 09:37:34,720
Now that's a file that you don't want to edit with a text editor or anything like that.
12571
09:37:34,720 --> 09:37:42,720
This is a database that you're, this is a file that's to be read by SQLite browser and nothing else.
12572
09:37:42,720 --> 09:37:51,720
Okay, so we're going to create a table and I'm going to make a table called users with a column called name that's a text and a column called email.
12573
09:37:51,720 --> 09:37:54,720
So I'm going to, it's already asking me to make a table.
12574
09:37:54,720 --> 09:38:02,720
I'm going to call this users and I'm going to add a field that is called name and I'm going to add a text.
12575
09:38:02,720 --> 09:38:07,720
And I'm going to add another field called email and I'm going to make that be text.
12576
09:38:07,720 --> 09:38:15,720
Now the key thing here is we are in effect making columns and rendering an opinion as to exactly what the column is supposed to be used for.
12577
09:38:15,720 --> 09:38:17,720
And we're not allowed to violate that.
12578
09:38:17,720 --> 09:38:26,720
It's not like, oh, we'll do whatever you want because the database is optimizing its storage based on our contract that we're effectively making the contract ourselves.
12579
09:38:26,720 --> 09:38:33,720
We could make these columns anything we wanted, but we're just going to, we have to, we're going to contract with ourselves.
12580
09:38:33,720 --> 09:38:34,720
And you can see it's kind of small here.
12581
09:38:34,720 --> 09:38:41,720
You can see there's a create table and that's on the slide and that's the, the, the SQL way of doing that.
12582
09:38:41,720 --> 09:38:44,720
This user interface is just helping us write SQL.
12583
09:38:44,720 --> 09:38:46,720
So now I'm going to just say, okay.
12584
09:38:46,720 --> 09:38:58,720
And if you take a look, you can see that I now have a table users and I can look at my database structure and the table users and away we go.
12585
09:38:58,720 --> 09:39:08,720
And so, so now that's, that is creating it. And like I said, here in the slides is the create statement or on the web page, there's the create statement that could have done it.
12586
09:39:08,720 --> 09:39:14,720
Now we can insert some data.
12587
09:39:14,720 --> 09:39:28,720
Let's add a new record to this database users and we'll call this guy name Charles C7 at umish.edu.
12588
09:39:28,720 --> 09:39:32,720
So now we have a record. So it's kind of like a database spreadsheet.
12589
09:39:32,720 --> 09:39:43,720
Now that's not the SQL way to do it. There's SQL sort of going on in the background, but if we really want to do this using SQL, we're going to use the insert statement.
12590
09:39:43,720 --> 09:39:51,720
And the insert statement looks like this.
12591
09:39:51,720 --> 09:39:58,720
The SQL syntax sometimes has extra words. Insert into is actually an SQL key words.
12592
09:39:58,720 --> 09:40:08,720
The name of table, the columns, and then the word values, and then one to one correspondence between the values and its parenthesis.
12593
09:40:08,720 --> 09:40:13,720
So it looks kind of like a tupple in Python, but we're nowhere near Python right now.
12594
09:40:13,720 --> 09:40:20,720
Okay, and so that's what we're going to do. And so I'm going to grab this.
12595
09:40:20,720 --> 09:40:28,720
Kristen and I'm going to go over here to my SQLite browser and say execute SQL.
12596
09:40:28,720 --> 09:40:37,720
So now I can say paste that in and then hit this little run button and that's going to submit the SQL to SQLite and then update that file.
12597
09:40:37,720 --> 09:40:41,720
And it says query executed successfully and away we go.
12598
09:40:41,720 --> 09:40:46,720
So if I go back now and I look at the data, I see that there's two things in here.
12599
09:40:46,720 --> 09:40:51,720
And now I can actually insert all the rest of these. Let's go back to my little bit of stuff here.
12600
09:40:51,720 --> 09:41:04,720
Let's put all these other rows in. It turns out that if I go into the execute SQL and I want to do more than one command at a time,
12601
09:41:04,720 --> 09:41:10,720
I can put a semicolon at the end of each one of these things and then I can run them all at the same time.
12602
09:41:10,720 --> 09:41:13,720
I mean, one after another actually is what's going on here.
12603
09:41:13,720 --> 09:41:18,720
So boom, boom, boom, and I take a look at the data and look, I've got all those things in there.
12604
09:41:18,720 --> 09:41:23,720
Now, eventually the thing that's going to generate that SQL is a program, not us.
12605
09:41:23,720 --> 09:41:27,720
This is we're being the database administrator, so we're sort of doing things manually.
12606
09:41:27,720 --> 09:41:32,720
Once things get going, you write programs, do that insert over and over and over again in Python
12607
09:41:32,720 --> 09:41:35,720
or a web language like PHP or something like that.
12608
09:41:35,720 --> 09:41:39,720
And so that is the insert.
12609
09:41:39,720 --> 09:41:42,720
Now, we can get rid of data.
12610
09:41:42,720 --> 09:41:45,720
And so I'm going to say delete from, that's the key word.
12611
09:41:45,720 --> 09:41:48,720
Users is the name of a table. Where is a where clause?
12612
09:41:48,720 --> 09:41:52,720
We'll have lots of where clauses in SQL, which is, it's not like an if.
12613
09:41:52,720 --> 09:41:58,720
In effect, the delete is going towards the whole table and being turned on and off by this where clause.
12614
09:41:58,720 --> 09:42:02,720
So delete from users, if you didn't put the where clause on, will actually delete all the rows.
12615
09:42:02,720 --> 09:42:11,720
But where email equals ted.eumich.edu, well, that one is going to make it so it only applies to the rows where that is true.
12616
09:42:11,720 --> 09:42:18,720
So I'm going to go over here in SQL and I'm going to say delete from users where email equals ted.eumich.edu
12617
09:42:18,720 --> 09:42:20,720
and then I'm going to run it because it's only one.
12618
09:42:20,720 --> 09:42:22,720
I don't need a semicolon at the end of it.
12619
09:42:22,720 --> 09:42:26,720
And now if I go back and I look at the data, ted is gone.
12620
09:42:26,720 --> 09:42:29,720
Okay.
12621
09:42:29,720 --> 09:42:34,720
Update. So the update says,
12622
09:42:34,720 --> 09:42:38,720
updates keyword, users is the name of the table, set is a keyword,
12623
09:42:38,720 --> 09:42:41,720
and then this is column equals new value, and then a where clause.
12624
09:42:41,720 --> 09:42:46,720
Again, this update, if we didn't have a where clause, would change every row in the table.
12625
09:42:46,720 --> 09:42:55,720
And so where email equals csev.eumich.edu.
12626
09:42:55,720 --> 09:42:59,720
Oh, I got to change that because I already got the name to be Charles.
12627
09:42:59,720 --> 09:43:00,720
So you see the name is already Charles.
12628
09:43:00,720 --> 09:43:05,720
So I'll just execute here.
12629
09:43:05,720 --> 09:43:07,720
Make this be Chuck. So we see it.
12630
09:43:07,720 --> 09:43:09,720
And then I run it.
12631
09:43:09,720 --> 09:43:12,720
Then you take a look at the data and it's changed.
12632
09:43:12,720 --> 09:43:13,720
That's it.
12633
09:43:13,720 --> 09:43:15,720
That's an update statement.
12634
09:43:15,720 --> 09:43:16,720
We're doing, you're doing great.
12635
09:43:16,720 --> 09:43:18,720
You're doing great.
12636
09:43:18,720 --> 09:43:30,720
And so the next thing we're going to do is we're going to take a look at how we retrieve data.
12637
09:43:30,720 --> 09:43:33,720
Now this is the select statement, select star.
12638
09:43:33,720 --> 09:43:38,720
You have a list of columns and star means all columns from is a keyword and then the name of a table.
12639
09:43:38,720 --> 09:43:42,720
So this select star from users is the kind of thing you type all the time.
12640
09:43:42,720 --> 09:43:46,720
As a matter of fact, it's what SQLite browser is doing internally to cause this to happen.
12641
09:43:46,720 --> 09:43:51,720
But we can do it by hand by saying select star from users and then run it.
12642
09:43:51,720 --> 09:43:56,720
And so then we get a little record set that is those four records that are sitting there.
12643
09:43:56,720 --> 09:43:58,720
We can also throw a where clause on the end of it.
12644
09:43:58,720 --> 09:44:05,720
So we say select star from users where email equals csev at umich.edu.
12645
09:44:05,720 --> 09:44:11,720
And that again, the select star from users goes at the whole table and the where clause goes at the whole table
12646
09:44:11,720 --> 09:44:14,720
and then filters out all of the things except one record.
12647
09:44:14,720 --> 09:44:19,720
So the where clause is send it to the table but then filter based on whatever.
12648
09:44:19,720 --> 09:44:22,720
And so it only shows us that.
12649
09:44:22,720 --> 09:44:26,720
Okay, we're cruising right along here.
12650
09:44:26,720 --> 09:44:30,720
You can also put an order by clause on there.
12651
09:44:30,720 --> 09:44:34,720
So we can say select star from users order by email.
12652
09:44:34,720 --> 09:44:37,720
So that's a column.
12653
09:44:37,720 --> 09:44:40,720
Select star from users order by email.
12654
09:44:40,720 --> 09:44:42,720
And so that orders by email.
12655
09:44:42,720 --> 09:44:48,720
Or we can change it by to name and we can say descending.
12656
09:44:48,720 --> 09:44:52,720
So that's the name and descending order.
12657
09:44:52,720 --> 09:44:59,720
Sorting and selecting are good things that databases are really good at.
12658
09:44:59,720 --> 09:45:02,720
So this is the summary of what I've told you.
12659
09:45:02,720 --> 09:45:05,720
So the databases do create, read, update and delete crud.
12660
09:45:05,720 --> 09:45:10,720
And we've done all those things except we did create, delete, update, read.
12661
09:45:10,720 --> 09:45:11,720
That's what we did.
12662
09:45:11,720 --> 09:45:13,720
And that's the summary of SQL.
12663
09:45:13,720 --> 09:45:19,720
And so you might be saying why did I take so long to learn such a simple and elegant and beautiful language
12664
09:45:19,720 --> 09:45:21,720
because it's not really exciting.
12665
09:45:21,720 --> 09:45:27,720
It's a extremely simple language that's very predictable and you're like that's pretty easy.
12666
09:45:27,720 --> 09:45:34,720
And it turns out that some of you may have been using SQL in situations maybe with Microsoft Access or something.
12667
09:45:34,720 --> 09:45:37,720
Or actually type in this stuff and you just kind of typed it
12668
09:45:37,720 --> 09:45:40,720
and you never realized that you were learning a programming language.
12669
09:45:40,720 --> 09:45:44,720
That's why I like SQL and that's a very declarative language and it's very straightforward.
12670
09:45:44,720 --> 09:45:49,720
It's much easier to learn SQL than it is to learn Python.
12671
09:45:49,720 --> 09:45:53,720
Because in Python you have to figure out how loops work and how iteration variables work
12672
09:45:53,720 --> 09:45:55,720
and you'll notice there's none of that.
12673
09:45:55,720 --> 09:45:59,720
But the key is we've only started to understand the power.
12674
09:45:59,720 --> 09:46:06,720
That's the simple ability to move around and update data and read data randomly using these simple sets of commands.
12675
09:46:06,720 --> 09:46:14,720
But up next we're going to look at how you do this with data models and relationships and really multiple tables.
12676
09:46:19,720 --> 09:46:21,720
Hello and welcome to a code walkthrough.
12677
09:46:21,720 --> 09:46:25,720
In this bit of code we're talking about the emaildb.py.
12678
09:46:25,720 --> 09:46:34,720
This is a beautiful little example in that it sort of reduces talking to the database to kind of its pure essence.
12679
09:46:34,720 --> 09:46:40,720
And so we'll start out this code and we import the SQLite 3 just to get the library there.
12680
09:46:40,720 --> 09:46:46,720
We make a connection and in databases we sort of end up with an open that's two steps.
12681
09:46:46,720 --> 09:46:53,720
There's the connection to the database which checks access to the file and the cursor is kind of like our handle.
12682
09:46:53,720 --> 09:47:00,720
It's not as simple as you just open it and read it but you open it and then you send SQL commands through the cursor
12683
09:47:00,720 --> 09:47:03,720
and then you get your responses through that same cursor.
12684
09:47:03,720 --> 09:47:07,720
So C-U-R here is the variable that we're interested in.
12685
09:47:07,720 --> 09:47:13,720
And the first thing that we're going to do is we're going to, we've got this file.
12686
09:47:13,720 --> 09:47:17,720
It will either create this file and right now this file doesn't exist.
12687
09:47:17,720 --> 09:47:21,720
It's going to be in the same directory.
12688
09:47:21,720 --> 09:47:25,720
There's no emaildb.
12689
09:47:25,720 --> 09:47:28,720
So this is actually going to create the file when it runs.
12690
09:47:28,720 --> 09:47:33,720
And then the first thing we're going to do is drop the table if it exists. Drop table is a bit of SQL.
12691
09:47:33,720 --> 09:47:37,720
The if exists just keeps this from blowing up if we start it with a fresh database.
12692
09:47:37,720 --> 09:47:41,720
And in this case there is no file there so we are starting with a fresh database.
12693
09:47:41,720 --> 09:47:45,720
So this will accomplish absolutely nothing which is just fine.
12694
09:47:45,720 --> 09:47:47,720
Now we're using triple quotes here.
12695
09:47:47,720 --> 09:47:50,720
I'm just kind of using that to make this a little bit easier to read.
12696
09:47:50,720 --> 09:47:53,720
I probably could pull those lines up a bit.
12697
09:47:53,720 --> 09:47:57,720
This one's actually small enough that I could, maybe I'll just do that.
12698
09:47:57,720 --> 09:48:03,720
Let's do that. Let's bring that baby right up and turn this into a single quote.
12699
09:48:03,720 --> 09:48:06,720
That's short enough.
12700
09:48:06,720 --> 09:48:10,720
But triple quote is just, this one here is a little longer so I'll use triple quote.
12701
09:48:10,720 --> 09:48:13,720
So we're going to drop table. That's going to do nothing first time through.
12702
09:48:13,720 --> 09:48:15,720
Then we're going to do a create table.
12703
09:48:15,720 --> 09:48:18,720
Now sometimes your application will have like a read me or something.
12704
09:48:18,720 --> 09:48:21,720
It says go run these commands to set the database up.
12705
09:48:21,720 --> 09:48:25,720
But we're able to just set this database up in this particular application.
12706
09:48:25,720 --> 09:48:30,720
We'll see later ones where we're going to leave the database and not start it fresh.
12707
09:48:30,720 --> 09:48:33,720
And in this one we can do the same.
12708
09:48:33,720 --> 09:48:38,720
But in this one we could but we're just going to start fresh by dropping the table.
12709
09:48:38,720 --> 09:48:40,720
So we'll create it.
12710
09:48:40,720 --> 09:48:44,720
We're going to have an email and an account.
12711
09:48:44,720 --> 09:48:49,720
Basically what we're doing here is we're really going to pretend that this is a dictionary.
12712
09:48:49,720 --> 09:48:53,720
If you recall when I said dictionary, a dictionary is like an in-memory database.
12713
09:48:53,720 --> 09:48:56,720
Well, now we're using a database to do a database.
12714
09:48:56,720 --> 09:49:00,720
But the first thing we're going to do here is pretend it's a dictionary.
12715
09:49:00,720 --> 09:49:01,720
So that's a little crazy.
12716
09:49:01,720 --> 09:49:04,720
So these next lines of code hopefully are pretty familiar to you, right?
12717
09:49:04,720 --> 09:49:12,720
Get a file name, loop through it, check to see if it's, you know, grab and box short by default
12718
09:49:12,720 --> 09:49:15,720
so we can press the enter key and then loop through it, right?
12719
09:49:15,720 --> 09:49:21,720
And so this little part right here, this is our basic loop that we're doing.
12720
09:49:21,720 --> 09:49:25,720
And so, you know, that is pretty normal.
12721
09:49:25,720 --> 09:49:31,720
And if we look at this line right here, that line right there is the line that is,
12722
09:49:31,720 --> 09:49:36,720
that line right there makes sure that we can only get the from lines.
12723
09:49:36,720 --> 09:49:38,720
We've done that a bunch of times and we're going to split it.
12724
09:49:38,720 --> 09:49:42,720
We're not going to strip the right because the split's going to take care of that.
12725
09:49:42,720 --> 09:49:48,720
And then we're going to grab the email address, which of course in the from line is the second part.
12726
09:49:48,720 --> 09:49:52,720
And then we will have that.
12727
09:49:52,720 --> 09:49:53,720
So now we're going to do some database.
12728
09:49:53,720 --> 09:50:01,720
So the first thing we're going to do, this bit right here is kind of like the dictionary part.
12729
09:50:01,720 --> 09:50:05,720
So the first thing that we're going to do is we're going to select count from our database,
12730
09:50:05,720 --> 09:50:08,720
that is an integer, where email equals.
12731
09:50:08,720 --> 09:50:12,720
And this part right here bears some explaining.
12732
09:50:12,720 --> 09:50:15,720
This is going to be csevitumich.edu or whatever.
12733
09:50:15,720 --> 09:50:24,720
Now, it is dangerous to put those strings, especially from user enter to enter data into your SQL.
12734
09:50:24,720 --> 09:50:25,720
You technically could.
12735
09:50:25,720 --> 09:50:30,720
I could make this be a email equals csevitumich.edu.
12736
09:50:30,720 --> 09:50:31,720
I'd have to skate the boats and stuff.
12737
09:50:31,720 --> 09:50:34,720
But this question mark is a placeholder.
12738
09:50:34,720 --> 09:50:39,720
And this is a way to basically make sure that we don't allow SQL injection.
12739
09:50:39,720 --> 09:50:43,720
Go Google SQL injection to get a sense of what that is.
12740
09:50:43,720 --> 09:50:48,720
It's more of an issue in online applications.
12741
09:50:48,720 --> 09:50:53,720
But in this application, we're just being good.
12742
09:50:53,720 --> 09:50:58,720
So the way this works is this is a placeholder in this SQL that will ultimately be replaced by this.
12743
09:50:58,720 --> 09:51:00,720
Now, you could have several question marks.
12744
09:51:00,720 --> 09:51:02,720
We only have one in here.
12745
09:51:02,720 --> 09:51:04,720
And so you give a tuple.
12746
09:51:04,720 --> 09:51:07,720
And if we just put email, it won't turn into a tuple.
12747
09:51:07,720 --> 09:51:09,720
This is a one tuple, basically.
12748
09:51:09,720 --> 09:51:14,720
This little weird parenthesis, email, comma, parenthesis.
12749
09:51:14,720 --> 09:51:16,720
That is a tuple with only one thing in it.
12750
09:51:16,720 --> 09:51:19,720
And that's just the weird Python syntax.
12751
09:51:19,720 --> 09:51:21,720
It's rare that I apologize for Python syntax.
12752
09:51:21,720 --> 09:51:24,720
But that's a little bit less than pretty.
12753
09:51:24,720 --> 09:51:25,720
But it's OK.
12754
09:51:25,720 --> 09:51:26,720
It's a tuple.
12755
09:51:26,720 --> 09:51:31,720
And normally, if there were two of these, then there would be email, name, dot, dot, dot, dot.
12756
09:51:31,720 --> 09:51:33,720
OK?
12757
09:51:33,720 --> 09:51:38,720
So this cur.execute is actually not really retrieving the data.
12758
09:51:38,720 --> 09:51:44,720
In a way, it's looking at the SQL and making sure that maybe it might verify that the table name is right
12759
09:51:44,720 --> 09:51:46,720
or if there's any syntax errors, et cetera, et cetera.
12760
09:51:46,720 --> 09:51:49,720
So this actually is not really reading the data.
12761
09:51:49,720 --> 09:51:52,720
But we have prepared this cursor.
12762
09:51:52,720 --> 09:51:55,720
This is kind of like the opening of a file.
12763
09:51:55,720 --> 09:51:57,720
But what we're opening is a record set.
12764
09:51:57,720 --> 09:52:03,720
We're opening a set of records that are going to be this wherever it's true.
12765
09:52:03,720 --> 09:52:06,720
So it's like we're going to read this like a file.
12766
09:52:06,720 --> 09:52:08,720
Now, later things will loop through this.
12767
09:52:08,720 --> 09:52:10,720
But we're only going to say, hey, grab that first one.
12768
09:52:10,720 --> 09:52:13,720
We could have even put maybe a limit clause on there or something.
12769
09:52:13,720 --> 09:52:16,720
Grab the first one and give it back in row.
12770
09:52:16,720 --> 09:52:25,720
And so row is going to be the information that we get from the database.
12771
09:52:25,720 --> 09:52:31,720
And so if there are no records that meet this, then row is going to be none.
12772
09:52:31,720 --> 09:52:34,720
So here's kind of, again, like the get.
12773
09:52:34,720 --> 09:52:38,720
Here's like the get, where if the row wasn't there,
12774
09:52:38,720 --> 09:52:43,720
because the way we're doing this is we're going to end up with this row in the database.
12775
09:52:43,720 --> 09:52:45,720
Here is this database.
12776
09:52:45,720 --> 09:52:46,720
And there's going to be two columns.
12777
09:52:46,720 --> 09:52:47,720
And there's a bunch of rows.
12778
09:52:47,720 --> 09:52:55,720
And then here's going to be csev4 and gen3 and steven6, right?
12779
09:52:55,720 --> 09:52:56,720
So these are the counts.
12780
09:52:56,720 --> 09:53:02,720
And so we're grabbing this variable out if it's csev that we're grabbing.
12781
09:53:02,720 --> 09:53:03,720
And that's going to come into here, right?
12782
09:53:03,720 --> 09:53:05,720
That's going to show up in here.
12783
09:53:05,720 --> 09:53:13,720
And that row is actually, it turns out that the row is a list,
12784
09:53:13,720 --> 09:53:15,720
but we're only getting one thing.
12785
09:53:15,720 --> 09:53:18,720
And what we really are doing is if we searched through and we got through
12786
09:53:18,720 --> 09:53:22,720
and there was nothing, then row is none means that there was none
12787
09:53:22,720 --> 09:53:27,720
and we're seeing like gens for the first time and we have to insert it.
12788
09:53:27,720 --> 09:53:31,720
So if row is none, we're going to run an insert statement.
12789
09:53:31,720 --> 09:53:34,720
Insert into counts, email count.
12790
09:53:34,720 --> 09:53:37,720
Now we've got to set it to one because it's the first time we've seen it.
12791
09:53:37,720 --> 09:53:39,720
So values, and then again the question mark.
12792
09:53:39,720 --> 09:53:43,720
The question mark basically says, hey, I'm going to have a value in this tuple
12793
09:53:43,720 --> 09:53:45,720
and there's an ordering to the tuple.
12794
09:53:45,720 --> 09:53:49,720
And so there's only one question here, one question mark placeholder here
12795
09:53:49,720 --> 09:53:50,720
and then one is the initial count.
12796
09:53:50,720 --> 09:53:54,720
So email, question mark, count, one, away we go.
12797
09:53:54,720 --> 09:53:59,720
And so then we have, again, we have a tuple that gives to this execute statement
12798
09:53:59,720 --> 09:54:02,720
just like in that execute statement, the corresponding sort of strings
12799
09:54:02,720 --> 09:54:06,720
or integers that are to be placed by each of the questions.
12800
09:54:06,720 --> 09:54:09,720
So when this runs, there's going to be a new record
12801
09:54:09,720 --> 09:54:13,720
and there's going to be a one that's put in there into that new record.
12802
09:54:13,720 --> 09:54:16,720
If on the other hand we pull back a row that exists,
12803
09:54:16,720 --> 09:54:18,720
we're going to get this for number.
12804
09:54:18,720 --> 09:54:21,720
And you might think we want to take this for number and add it,
12805
09:54:21,720 --> 09:54:25,720
but in databases it's always better to do an update
12806
09:54:25,720 --> 09:54:29,720
because there might be multiple applications
12807
09:54:29,720 --> 09:54:31,720
that are talking to this database at the same time.
12808
09:54:31,720 --> 09:54:36,720
So no matter what update does is in a single atomic operation
12809
09:54:36,720 --> 09:54:39,720
it turns whatever this number is into one higher
12810
09:54:39,720 --> 09:54:43,720
and we don't have to worry about other pieces of code potentially modifying.
12811
09:54:43,720 --> 09:54:45,720
Now in this case we don't have to worry about that
12812
09:54:45,720 --> 09:54:47,720
because we're the only piece of code,
12813
09:54:47,720 --> 09:54:52,720
but using update to increment something is way better than reading the value
12814
09:54:52,720 --> 09:54:56,720
and then doing an update to adding one inside of Python
12815
09:54:56,720 --> 09:54:59,720
and then updating the new value which is that's two SQL statements
12816
09:54:59,720 --> 09:55:03,720
but it's also not atomic.
12817
09:55:03,720 --> 09:55:09,720
So if the row is none, if the row exists we just know that it exists
12818
09:55:09,720 --> 09:55:11,720
and we just want to add one to the number.
12819
09:55:11,720 --> 09:55:14,720
We do have the number sitting here in the row variable
12820
09:55:14,720 --> 09:55:16,720
but we don't need it.
12821
09:55:16,720 --> 09:55:21,720
And so we're going to say update count set count equals count plus one
12822
09:55:21,720 --> 09:55:24,720
column name where email equals and then another place holder
12823
09:55:24,720 --> 09:55:27,720
and then another tuple for the question mark.
12824
09:55:27,720 --> 09:55:30,720
And so that's what this little bit of code does.
12825
09:55:30,720 --> 09:55:34,720
That is kind of the read it, parse it, check to see if it's there,
12826
09:55:34,720 --> 09:55:37,720
if it's not, insert it, if it is updated.
12827
09:55:37,720 --> 09:55:41,720
And so then we see this con commit.
12828
09:55:41,720 --> 09:55:46,720
And this con commit basically the way it works is that the database
12829
09:55:46,720 --> 09:55:50,720
is efficiently keeping some of the information in memory
12830
09:55:50,720 --> 09:55:53,720
and at some point it has to write all that stuff out to disk.
12831
09:55:53,720 --> 09:55:56,720
So you can choose at times where you put this commit.
12832
09:55:56,720 --> 09:55:59,720
Right now we're going to commit every time through this loop
12833
09:55:59,720 --> 09:56:01,720
but you might commit every tenth time through the loop
12834
09:56:01,720 --> 09:56:05,720
because the commit will take some time because it forces everything
12835
09:56:05,720 --> 09:56:08,720
to be written to disk and these can run really fast
12836
09:56:08,720 --> 09:56:10,720
and the commit is the slowest part here.
12837
09:56:10,720 --> 09:56:13,720
So sometimes we do things like commit every tenth record
12838
09:56:13,720 --> 09:56:15,720
or every hundredth record.
12839
09:56:15,720 --> 09:56:18,720
If it's an online system which is not what this is,
12840
09:56:18,720 --> 09:56:21,720
you have to commit at the end of every sort of screenping.
12841
09:56:21,720 --> 09:56:24,720
But for this kind of a system because we're putting so much in,
12842
09:56:24,720 --> 09:56:26,720
this is kind of a bulk insert,
12843
09:56:26,720 --> 09:56:28,720
we might come up with a thing where we,
12844
09:56:28,720 --> 09:56:31,720
you know, every tenth time we do a commit.
12845
09:56:31,720 --> 09:56:34,720
But ultimately what this will do when this is running
12846
09:56:34,720 --> 09:56:37,720
is it will build up slowly but surely adding new records
12847
09:56:37,720 --> 09:56:40,720
and then one one and then it will build two and a three
12848
09:56:40,720 --> 09:56:42,720
and all these things and add another one, that will be one.
12849
09:56:42,720 --> 09:56:43,720
It will do this thing, right?
12850
09:56:43,720 --> 09:56:47,720
And then at the end of the day that is what's going to be in the database.
12851
09:56:47,720 --> 09:56:55,720
Now, so now we're, so let's take a look what's in the database
12852
09:56:55,720 --> 09:56:57,720
and now we can actually read the database.
12853
09:56:57,720 --> 09:57:01,720
And so in the database we're going to run a select
12854
09:57:01,720 --> 09:57:04,720
and we're going to say we're going to select the email and account
12855
09:57:04,720 --> 09:57:06,720
from counts, order by count, descending.
12856
09:57:06,720 --> 09:57:08,720
So look at that, isn't that cool?
12857
09:57:08,720 --> 09:57:11,720
We're getting in the top ten because databases are good at sorting
12858
09:57:11,720 --> 09:57:13,720
and they're good at all these other things.
12859
09:57:13,720 --> 09:57:15,720
So we're going to then execute this
12860
09:57:15,720 --> 09:57:19,720
and then we're going to ask for the rows one at a time
12861
09:57:19,720 --> 09:57:23,720
and the rows are going to be a tuple
12862
09:57:23,720 --> 09:57:26,720
and row sub zero will be email and row sub one will be count.
12863
09:57:26,720 --> 09:57:29,720
So we run all this stuff and then we close the connection
12864
09:57:29,720 --> 09:57:31,720
and away we go, okay?
12865
09:57:31,720 --> 09:57:34,720
So let's go ahead and run this.
12866
09:57:34,720 --> 09:57:37,720
Let's go ahead and run all this stuff.
12867
09:57:37,720 --> 09:57:42,720
Python three, email bb.py.
12868
09:57:42,720 --> 09:57:49,720
It asks for a file name, mbox short.
12869
09:57:49,720 --> 09:57:50,720
I can hit enter, right?
12870
09:57:50,720 --> 09:57:53,720
mbox short and that's it and it looks just like that
12871
09:57:53,720 --> 09:57:55,720
and it counts it and away we go.
12872
09:57:55,720 --> 09:57:59,720
Now the difference is at this point we have a file,
12873
09:57:59,720 --> 09:58:06,720
emaildb.sqlite and we can run the SQLite browser
12874
09:58:06,720 --> 09:58:12,720
and we can then open this database
12875
09:58:12,720 --> 09:58:13,720
and we can see what's in there.
12876
09:58:13,720 --> 09:58:14,720
So here we go.
12877
09:58:14,720 --> 09:58:16,720
It has made an SQLite database.
12878
09:58:16,720 --> 09:58:18,720
We have a table of counts
12879
09:58:18,720 --> 09:58:21,720
and then we can take a look at the data and there we go.
12880
09:58:21,720 --> 09:58:25,720
We've got the data and we can do this.
12881
09:58:25,720 --> 09:58:27,720
And so let me close this.
12882
09:58:27,720 --> 09:58:33,720
It's important at times when you don't want necessarily to have,
12883
09:58:33,720 --> 09:58:35,720
well let's see if we can cause it to lock up.
12884
09:58:35,720 --> 09:58:39,720
Let me run this again and it's going to drop this table.
12885
09:58:39,720 --> 09:58:44,720
So I'm going to run the code again
12886
09:58:44,720 --> 09:58:51,720
but this time I am going to do the full one, mbox.txt.
12887
09:58:51,720 --> 09:58:55,720
Now we'll see what happens here but it ran
12888
09:58:55,720 --> 09:58:58,720
and now so what we have to do then to see this date
12889
09:58:58,720 --> 09:59:01,720
is from the previous run but if we want the most recent one
12890
09:59:01,720 --> 09:59:03,720
we hit refresh and then away we go
12891
09:59:03,720 --> 09:59:05,720
and so we can see this stuff.
12892
09:59:05,720 --> 09:59:09,720
And so this is just a real simple start
12893
09:59:09,720 --> 09:59:12,720
to see how you can connect some of the stuff that we've been doing
12894
09:59:12,720 --> 09:59:14,720
but store the data in a database.
12895
09:59:14,720 --> 09:59:16,720
But the nice thing about the database
12896
09:59:16,720 --> 09:59:20,720
is that it can store this stuff from run to run.
12897
09:59:20,720 --> 09:59:23,720
Even though in this case we're dropping the table every time
12898
09:59:23,720 --> 09:59:26,720
in later things we will see how we can store data from run to run
12899
09:59:26,720 --> 09:59:29,720
to give ourselves more restartable processes.
12900
09:59:29,720 --> 09:59:35,720
Cheers.
12901
09:59:35,720 --> 09:59:37,720
We're going to do some code walkthrough
12902
09:59:37,720 --> 09:59:39,720
and if you want to follow through with the code
12903
09:59:39,720 --> 09:59:44,720
you can download the sample code from Python for Everybody.
12904
09:59:44,720 --> 09:59:48,720
And so the code that we're going to play with is the Twitter Spider code
12905
09:59:48,720 --> 09:59:54,720
that is both talking to the Twitter API and talking to the database.
12906
09:59:54,720 --> 09:59:58,720
And so what we're going to be doing is we're going to run code
12907
09:59:58,720 --> 10:00:01,720
that's going to hit the Twitter API much like we did in a previous chapter
12908
10:00:01,720 --> 10:00:04,720
and we're going to retrieve the data but we're going to remember the data
12909
10:00:04,720 --> 10:00:07,720
so we don't have to retrieve it again.
12910
10:00:07,720 --> 10:00:10,720
And so we're going to keep track of people's friends
12911
10:00:10,720 --> 10:00:14,720
and what we're doing here is sort of illicitly pulling down
12912
10:00:14,720 --> 10:00:17,720
slowly but surely based subject to our rate limit
12913
10:00:17,720 --> 10:00:20,720
we're pulling down who our friends are.
12914
10:00:20,720 --> 10:00:22,720
And so let's take a look.
12915
10:00:22,720 --> 10:00:25,720
We're going to use urllib and urllib error,
12916
10:00:25,720 --> 10:00:30,720
which was code that augments my URL to do all the OAuth calculation.
12917
10:00:30,720 --> 10:00:32,720
We're going to get JSON data back.
12918
10:00:32,720 --> 10:00:34,720
We're going to make a database and we have to import SQL
12919
10:00:34,720 --> 10:00:39,720
because of the way Python doesn't trust any certificates
12920
10:00:39,720 --> 10:00:41,720
no matter how good they are.
12921
10:00:41,720 --> 10:00:44,720
So this is our URL to talk to the Twitter API.
12922
10:00:44,720 --> 10:00:47,720
We're going to make a database and again the way SQL lite works
12923
10:00:47,720 --> 10:00:52,720
is if this spider.sql lite doesn't exist, it creates it.
12924
10:00:52,720 --> 10:00:57,720
And we get ourself a cursor and we're going to do a create table.
12925
10:00:57,720 --> 10:01:01,720
This if not exists some SQLs but SQL lite 3 does this.
12926
10:01:01,720 --> 10:01:03,720
Create table if it doesn't exist.
12927
10:01:03,720 --> 10:01:07,720
We want to start this over and over unlike the tracks example
12928
10:01:07,720 --> 10:01:10,720
I want to start this over and over and not lose data.
12929
10:01:10,720 --> 10:01:13,720
And this is a spidering process and we'll see a lot of these
12930
10:01:13,720 --> 10:01:17,720
where we want a restartable process where we use a database.
12931
10:01:17,720 --> 10:01:21,720
So if we're starting with nothing and there's no file of spider SQL lite
12932
10:01:21,720 --> 10:01:24,720
it creates this table and it's the name of the person,
12933
10:01:24,720 --> 10:01:27,720
whether we retrieved it or not and how many friends this person has
12934
10:01:27,720 --> 10:01:29,720
that we know of in our database.
12935
10:01:29,720 --> 10:01:33,720
Now this little bit is to deal with the SSL certificate errors.
12936
10:01:33,720 --> 10:01:36,720
The certificates are totally fine but Python doesn't trust any certificates
12937
10:01:36,720 --> 10:01:40,720
by default which is frustrating but whatever.
12938
10:01:40,720 --> 10:01:41,720
So here we're going to have a loop.
12939
10:01:41,720 --> 10:01:43,720
We're going to ask for a Twitter account.
12940
10:01:43,720 --> 10:01:45,720
We have to type quit to quit.
12941
10:01:45,720 --> 10:01:49,720
If we hit enter in this case we're going to actually read from the database
12942
10:01:49,720 --> 10:01:54,720
an unretrieved Twitter person and then grab all that person's friends.
12943
10:01:54,720 --> 10:02:02,720
And so then we're going to do a fetch one, get one
12944
10:02:02,720 --> 10:02:06,720
and that's going to get the name of the first person, the sub zero.
12945
10:02:06,720 --> 10:02:10,720
If we had more things than name here, sub zero is the first of those.
12946
10:02:10,720 --> 10:02:13,720
Fetch one means get one row from the database
12947
10:02:13,720 --> 10:02:16,720
and sub zero means the first column of that first row.
12948
10:02:16,720 --> 10:02:20,720
And if this fails then we've retrieved all the Twitter accounts.
12949
10:02:20,720 --> 10:02:26,720
And so we're going to augment this Twitter URL using this makes
12950
10:02:26,720 --> 10:02:29,720
you can look at the twurl.py code.
12951
10:02:29,720 --> 10:02:34,720
This basically requires the hidden.py file
12952
10:02:34,720 --> 10:02:37,720
which has your keys and secrets in it.
12953
10:02:37,720 --> 10:02:39,720
You've got to get hidden.py updated.
12954
10:02:39,720 --> 10:02:41,720
I've got it updated but I'm not going to show you
12955
10:02:41,720 --> 10:02:43,720
because it has my keys and secrets in it.
12956
10:02:43,720 --> 10:02:45,720
And so we're only going to take the first five
12957
10:02:45,720 --> 10:02:48,720
which means we're probably not going to find friends of friends of friends.
12958
10:02:48,720 --> 10:02:50,720
It's only if most five recent ones.
12959
10:02:50,720 --> 10:02:54,720
We could run this with a much higher number to get to the
12960
10:02:54,720 --> 10:02:55,720
so we have more than one friend.
12961
10:02:55,720 --> 10:02:58,720
We'll show the URL while we retrieve it.
12962
10:02:58,720 --> 10:03:00,720
We will do our UL open.
12963
10:03:00,720 --> 10:03:04,720
We'll do a read and then we'll do a decode to make sure that this UTF
12964
10:03:04,720 --> 10:03:07,720
this will give us data in UTF-8 and then decode
12965
10:03:07,720 --> 10:03:11,720
will give us data in Unicode which is what we need inside of Python.
12966
10:03:11,720 --> 10:03:16,720
We will ask for the headers from the connection.
12967
10:03:16,720 --> 10:03:19,720
We'll say give me the headers, give me a dictionary of the headers
12968
10:03:19,720 --> 10:03:23,720
and the x rate limiting header from the Twitter API
12969
10:03:23,720 --> 10:03:27,720
tells us when we're going to be told we can't use this API anymore
12970
10:03:27,720 --> 10:03:29,720
because this is one of those things.
12971
10:03:29,720 --> 10:03:34,720
And then we're going to parse and load the data that we got from Twitter
12972
10:03:34,720 --> 10:03:39,720
and get a, I think it's a list.
12973
10:03:39,720 --> 10:03:41,720
Yeah, it's a list.
12974
10:03:41,720 --> 10:03:45,720
And then we could dump this if you want and yours you can undo that.
12975
10:03:45,720 --> 10:03:49,720
And then what we're going to do is we've just retrieved
12976
10:03:49,720 --> 10:03:52,720
this person's screen name and their friends.
12977
10:03:52,720 --> 10:03:56,720
And so the first thing we want to do is update the database
12978
10:03:56,720 --> 10:03:58,720
and change the retrieve from zero to one.
12979
10:03:58,720 --> 10:04:01,720
And that's because we want, we're going to use this to know about unretrieved.
12980
10:04:01,720 --> 10:04:05,720
So retrieved being one means we've already retrieved it
12981
10:04:05,720 --> 10:04:08,720
and we did retrieve it so for that account we've retrieved it.
12982
10:04:08,720 --> 10:04:10,720
And then what we're going to do is we're going to parse that.
12983
10:04:10,720 --> 10:04:13,720
And so this is similar to the Twitter code we did previously
12984
10:04:13,720 --> 10:04:15,720
in the web services chapter.
12985
10:04:15,720 --> 10:04:16,720
We're going to go through all the users.
12986
10:04:16,720 --> 10:04:17,720
We're going to find their screen name.
12987
10:04:17,720 --> 10:04:20,720
We're going to print the screen name out.
12988
10:04:20,720 --> 10:04:29,720
And then what we're going to do is see if, let's see.
12989
10:04:29,720 --> 10:04:34,720
So we're going through all the users who are the friends of this person
12990
10:04:34,720 --> 10:04:38,720
and we're going to say, oh okay, let's select the friends from Twitter
12991
10:04:38,720 --> 10:04:42,720
where the name is the friend person.
12992
10:04:42,720 --> 10:04:48,720
And what we're going to do is we're going to,
12993
10:04:48,720 --> 10:04:52,720
if we're going to do a curve fetch one of this Twitter,
12994
10:04:52,720 --> 10:04:57,720
the name of the friends, this is the friend screen name, right?
12995
10:04:57,720 --> 10:05:01,720
So we're going to say, oh okay, if we get this,
12996
10:05:01,720 --> 10:05:03,720
we're going to get that friend screen name
12997
10:05:03,720 --> 10:05:07,720
and we're going to get how many friends this particular screen name has.
12998
10:05:07,720 --> 10:05:12,720
If we find a URL, we find it in there,
12999
10:05:12,720 --> 10:05:15,720
we're going to do an update statement and add one to their friend count,
13000
10:05:15,720 --> 10:05:18,720
how many friends they have, and then keep track.
13001
10:05:18,720 --> 10:05:20,720
This count here is not in the database.
13002
10:05:20,720 --> 10:05:22,720
It's just so I can print it out at the end.
13003
10:05:22,720 --> 10:05:26,720
If there is no record for this particular friend,
13004
10:05:26,720 --> 10:05:30,720
we're going to insert them into it new and we're going to say,
13005
10:05:30,720 --> 10:05:33,720
here's the new person that we just saw.
13006
10:05:33,720 --> 10:05:35,720
Here, that's their name.
13007
10:05:35,720 --> 10:05:37,720
We're going to set retrieve to zero
13008
10:05:37,720 --> 10:05:41,720
and we're going to say that they have one friend, okay?
13009
10:05:41,720 --> 10:05:44,720
And then we're going to commit the transaction
13010
10:05:44,720 --> 10:05:47,720
and then we're going to close this at the end, okay?
13011
10:05:47,720 --> 10:05:49,720
So let's go ahead and run this.
13012
10:05:49,720 --> 10:05:52,720
The first time it's going to create an empty database.
13013
10:05:52,720 --> 10:05:56,720
So I'm going to say python3 twspider.
13014
10:05:56,720 --> 10:06:02,720
So ls star SQLite, nothing there.
13015
10:06:02,720 --> 10:06:06,720
Python3, oops, that's because I removed it.
13016
10:06:06,720 --> 10:06:11,720
Python3 twspider.py.
13017
10:06:11,720 --> 10:06:17,720
Okay, so I'm going to start with a Twitter account, Dr. Chuck.
13018
10:06:17,720 --> 10:06:20,720
And so it's doing its retrieval and don't worry,
13019
10:06:20,720 --> 10:06:24,720
showing the token and the signature is not dangerous
13020
10:06:24,720 --> 10:06:26,720
because you don't have the keys or the token,
13021
10:06:26,720 --> 10:06:28,720
I mean the secrets and the token secrets.
13022
10:06:28,720 --> 10:06:29,720
So don't get all too worried.
13023
10:06:29,720 --> 10:06:33,720
So I have 11 calls left, so I got to hope this all works.
13024
10:06:33,720 --> 10:06:35,720
One of my friends is Stephanie Teasley
13025
10:06:35,720 --> 10:06:37,720
and I do these are in reverse order.
13026
10:06:37,720 --> 10:06:45,720
So let's grab Stephanie and ask for Stephanie's friends.
13027
10:06:45,720 --> 10:06:47,720
So now we just retrieve Stephanie's friends
13028
10:06:47,720 --> 10:06:51,720
and here are Stephanie's most recent friends.
13029
10:06:51,720 --> 10:06:54,720
And I can just hit enter and it will randomly pick.
13030
10:06:54,720 --> 10:06:57,720
Let's see if I can in the database.
13031
10:06:57,720 --> 10:07:00,720
Let's open this up, file open database.
13032
10:07:00,720 --> 10:07:01,720
Hope I don't lock myself.
13033
10:07:01,720 --> 10:07:05,720
Sometimes it's a little scary when you look at the database
13034
10:07:05,720 --> 10:07:07,720
and you're just checking.
13035
10:07:07,720 --> 10:07:10,720
So this is what my database looks like.
13036
10:07:10,720 --> 10:07:16,720
We retrieve Stephanie and she has, this is how many people.
13037
10:07:16,720 --> 10:07:20,720
So these are the friends of Stephanie and me
13038
10:07:20,720 --> 10:07:22,720
and these are how many, I'm not in there.
13039
10:07:22,720 --> 10:07:25,720
So we retrieve Stephanie, which was a friend.
13040
10:07:25,720 --> 10:07:28,720
So let's go grab, oh I don't know.
13041
10:07:28,720 --> 10:07:31,720
Let's grab Tim McKay and get that one.
13042
10:07:31,720 --> 10:07:34,720
Remaining 10, I don't have too many of these.
13043
10:07:34,720 --> 10:07:36,720
Tim McKay, right?
13044
10:07:36,720 --> 10:07:38,720
So there we go.
13045
10:07:38,720 --> 10:07:40,720
Remaining nine.
13046
10:07:40,720 --> 10:07:43,720
And so if I do a refresh on this,
13047
10:07:43,720 --> 10:07:45,720
then you see I've got some more folks.
13048
10:07:45,720 --> 10:07:47,720
If I hit enter here, it will retrieve,
13049
10:07:47,720 --> 10:07:51,720
it will pick one randomly based on the retrieve being zero.
13050
10:07:51,720 --> 10:07:54,720
So it won't pick Stephanie or Tim because they're zero,
13051
10:07:54,720 --> 10:07:56,720
but we have lots of other folks to pick randomly.
13052
10:07:56,720 --> 10:07:58,720
And we'll hit enter.
13053
10:07:58,720 --> 10:08:01,720
So it picked, who did it pick?
13054
10:08:01,720 --> 10:08:06,720
It picked screen name LiveEduTV, which is ironic
13055
10:08:06,720 --> 10:08:09,720
because I'm recording this on LiveEduTV right now.
13056
10:08:09,720 --> 10:08:12,720
And so we can keep hitting refresh and away we go.
13057
10:08:12,720 --> 10:08:16,720
So I'm gonna stop now because I only have eight remaining.
13058
10:08:16,720 --> 10:08:18,720
And so I'm gonna type quit.
13059
10:08:18,720 --> 10:08:22,720
And so we will see how that works.
13060
10:08:22,720 --> 10:08:23,720
So that's how it works.
13061
10:08:23,720 --> 10:08:29,720
Now remember that you've got to edit the hidden.py file
13062
10:08:29,720 --> 10:08:33,720
to make this work because we are talking to the Twitter API.
13063
10:08:33,720 --> 10:08:39,720
If you don't edit that file, it won't work for you.
13064
10:08:39,720 --> 10:08:41,720
Okay, so I hope you find this useful.
13065
10:08:41,720 --> 10:08:46,720
Cheers.
13066
10:08:46,720 --> 10:08:48,720
So now we're gonna take a look at how we deal with
13067
10:08:48,720 --> 10:08:50,720
smaller than one table, multiple tables.
13068
10:08:50,720 --> 10:08:54,720
Because the real power of SQL and the power of database performance
13069
10:08:54,720 --> 10:08:56,720
has to do with when you start connecting tables together.
13070
10:08:56,720 --> 10:08:58,720
If you go back to that original mathematics,
13071
10:08:58,720 --> 10:09:02,720
it models data at the intersections between the row and the columns.
13072
10:09:02,720 --> 10:09:06,720
And these intersections are the magical bits.
13073
10:09:06,720 --> 10:09:10,720
And so breaking an application to use multiple tables is an art form.
13074
10:09:10,720 --> 10:09:12,720
It takes a while.
13075
10:09:12,720 --> 10:09:15,720
There are some simple basic things that you can learn
13076
10:09:15,720 --> 10:09:17,720
and will teach you here.
13077
10:09:17,720 --> 10:09:19,720
And so it's not too hard to learn the basics,
13078
10:09:19,720 --> 10:09:24,720
but then it's much more complex to be super skilled at it.
13079
10:09:24,720 --> 10:09:26,720
And in general, advanced databases, in my mind,
13080
10:09:26,720 --> 10:09:29,720
it's hard to teach advanced databases
13081
10:09:29,720 --> 10:09:33,720
because they're always so contextually grounded.
13082
10:09:33,720 --> 10:09:37,720
You know, something like Twitter or Google,
13083
10:09:37,720 --> 10:09:39,720
the databases are so specialized.
13084
10:09:39,720 --> 10:09:43,720
By the time you make, everyone can do small to medium-sized databases
13085
10:09:43,720 --> 10:09:45,720
using the basic techniques, but at some point,
13086
10:09:45,720 --> 10:09:47,720
once you escape medium-sized databases,
13087
10:09:47,720 --> 10:09:49,720
you end up in these sort of narrow things
13088
10:09:49,720 --> 10:09:52,720
and optimize each database very separately.
13089
10:09:52,720 --> 10:09:54,720
And so I just tell people, you know,
13090
10:09:54,720 --> 10:09:57,720
learn the basics really, really well, write programs,
13091
10:09:57,720 --> 10:10:00,720
and then go do real work.
13092
10:10:00,720 --> 10:10:06,720
But database design is the act of figuring out
13093
10:10:06,720 --> 10:10:09,720
the data that your application is going to want to store
13094
10:10:09,720 --> 10:10:11,720
and spreading that across multiple tables.
13095
10:10:11,720 --> 10:10:12,720
But we don't just do it randomly.
13096
10:10:12,720 --> 10:10:14,720
We do it very much cleverly.
13097
10:10:14,720 --> 10:10:17,720
And if you look at a data model, this is what it looks like.
13098
10:10:17,720 --> 10:10:20,720
And what we're showing here in this data model
13099
10:10:20,720 --> 10:10:24,720
is we are showing five tables,
13100
10:10:24,720 --> 10:10:27,720
and this is kind of a calendar kind of a system,
13101
10:10:27,720 --> 10:10:30,720
and we're seeing the columns that are in each of the tables,
13102
10:10:30,720 --> 10:10:33,720
and then we're seeing the relationships between the tables.
13103
10:10:33,720 --> 10:10:35,720
And even in these relationships,
13104
10:10:35,720 --> 10:10:37,720
there's kind of a little bit of code,
13105
10:10:37,720 --> 10:10:39,720
and when you have an arrow that looks like that,
13106
10:10:39,720 --> 10:10:40,720
there's many of those to one,
13107
10:10:40,720 --> 10:10:43,720
and this is a many-to-one relationship.
13108
10:10:43,720 --> 10:10:44,720
Many-to-one relationship.
13109
10:10:44,720 --> 10:10:46,720
We'll talk all about that stuff.
13110
10:10:46,720 --> 10:10:48,720
But if you go into an organization
13111
10:10:48,720 --> 10:10:51,720
and you have a really large and complex data application,
13112
10:10:51,720 --> 10:10:53,720
they might have something printed out on the wall
13113
10:10:53,720 --> 10:10:54,720
that looks about like this,
13114
10:10:54,720 --> 10:10:57,720
which shows the database tables and connections,
13115
10:10:57,720 --> 10:10:58,720
et cetera, et cetera.
13116
10:10:58,720 --> 10:10:59,720
And they might say,
13117
10:10:59,720 --> 10:11:01,720
oh, your job is to go down in this little corner,
13118
10:11:01,720 --> 10:11:03,720
add one column field there,
13119
10:11:03,720 --> 10:11:05,720
and then do this, and then connect it with this thing over there,
13120
10:11:05,720 --> 10:11:08,720
and then make a screen that shows all these things
13121
10:11:08,720 --> 10:11:09,720
that pulls from this table, this table,
13122
10:11:09,720 --> 10:11:11,720
this table, and that table,
13123
10:11:11,720 --> 10:11:13,720
and that's your job if you're a programmer
13124
10:11:13,720 --> 10:11:16,720
on a large software development project.
13125
10:11:16,720 --> 10:11:19,720
These database models become sort of like
13126
10:11:19,720 --> 10:11:21,720
the core backbone of the knowledge
13127
10:11:21,720 --> 10:11:25,720
that applications are managing and using.
13128
10:11:25,720 --> 10:11:28,720
So the idea is that you take your application,
13129
10:11:28,720 --> 10:11:29,720
we're going to start really simple,
13130
10:11:29,720 --> 10:11:31,720
we're going to take your application,
13131
10:11:31,720 --> 10:11:32,720
and you have to draw a picture.
13132
10:11:32,720 --> 10:11:36,720
And the basic rule, and literally you could spend
13133
10:11:36,720 --> 10:11:40,720
course upon course learning about database normalization,
13134
10:11:40,720 --> 10:11:42,720
but I'm going to distill it into one basic rule,
13135
10:11:42,720 --> 10:11:47,720
and that is never put the same string data in twice.
13136
10:11:47,720 --> 10:11:49,720
So my name, Charles Severance,
13137
10:11:49,720 --> 10:11:51,720
if I build a database well,
13138
10:11:51,720 --> 10:11:53,720
you should go into that database and you'd say,
13139
10:11:53,720 --> 10:11:55,720
okay, the words Charles Severance,
13140
10:11:55,720 --> 10:11:58,720
which is the name of a person, me, in that database,
13141
10:11:58,720 --> 10:11:59,720
only shows up once.
13142
10:11:59,720 --> 10:12:02,720
And what we do instead is we connect things together
13143
10:12:02,720 --> 10:12:05,720
and model my name as a connection to the record
13144
10:12:05,720 --> 10:12:07,720
that has my actual name in it,
13145
10:12:07,720 --> 10:12:09,720
rather than putting my name all these other places.
13146
10:12:09,720 --> 10:12:12,720
And so the idea is to pull duplicate data out
13147
10:12:12,720 --> 10:12:14,720
and make only one copy of it.
13148
10:12:14,720 --> 10:12:17,720
So there is the users, and in there is the user's name,
13149
10:12:17,720 --> 10:12:20,720
and the user name shows up only here,
13150
10:12:20,720 --> 10:12:24,720
and everything else points to the particular user entry.
13151
10:12:24,720 --> 10:12:26,720
So that's the idea.
13152
10:12:26,720 --> 10:12:29,720
And so here is our first application.
13153
10:12:29,720 --> 10:12:32,720
We are working as a startup.
13154
10:12:32,720 --> 10:12:34,720
We just quit all of our jobs,
13155
10:12:34,720 --> 10:12:37,720
and we are going to build a music management application.
13156
10:12:37,720 --> 10:12:38,720
I mean, what a great idea.
13157
10:12:38,720 --> 10:12:40,720
Don't you think that'll be quite successful?
13158
10:12:40,720 --> 10:12:42,720
And so we have mocked up,
13159
10:12:42,720 --> 10:12:44,720
and we have figured out that this is what our
13160
10:12:44,720 --> 10:12:46,720
music management application.
13161
10:12:46,720 --> 10:12:48,720
We want to track people's tracks,
13162
10:12:48,720 --> 10:12:51,720
know something about what artists and albums
13163
10:12:51,720 --> 10:12:52,720
and genre they are,
13164
10:12:52,720 --> 10:12:54,720
and have ratings and how many times we've played them,
13165
10:12:54,720 --> 10:12:55,720
and how long they are.
13166
10:12:55,720 --> 10:12:58,720
Well, that's the data that our application needs to represent.
13167
10:12:58,720 --> 10:13:01,720
And we've done testing on this, and wireframes,
13168
10:13:01,720 --> 10:13:02,720
and everyone loves this.
13169
10:13:02,720 --> 10:13:04,720
It's a great user interface.
13170
10:13:04,720 --> 10:13:06,720
And so this is how it's got to look.
13171
10:13:06,720 --> 10:13:09,720
But we're going to have billions and billions of tracks
13172
10:13:09,720 --> 10:13:10,720
in these things, and so we want to come up
13173
10:13:10,720 --> 10:13:13,720
with an efficient database to handle this.
13174
10:13:13,720 --> 10:13:15,720
And so we're going to take a look at this
13175
10:13:15,720 --> 10:13:16,720
and look at each of the columns,
13176
10:13:16,720 --> 10:13:18,720
and we're going to ask ourselves,
13177
10:13:18,720 --> 10:13:23,720
is this column part of one of our existing objects,
13178
10:13:23,720 --> 10:13:27,720
our existing tables, or is this object
13179
10:13:27,720 --> 10:13:29,720
have to create a new table?
13180
10:13:29,720 --> 10:13:31,720
And then once we've defined those different objects,
13181
10:13:31,720 --> 10:13:34,720
we connect the tables together and model the connections.
13182
10:13:34,720 --> 10:13:37,720
Now, a little trick to kind of make it a little easier
13183
10:13:37,720 --> 10:13:40,720
on ourselves is we can look in these columns,
13184
10:13:40,720 --> 10:13:42,720
and look in the columns that have duplicate information
13185
10:13:42,720 --> 10:13:44,720
vertically that's string information.
13186
10:13:44,720 --> 10:13:47,720
So a rating is just a number like zero through five.
13187
10:13:47,720 --> 10:13:50,720
So we don't worry too much about integers and numbers
13188
10:13:50,720 --> 10:13:52,720
and that kind of stuff, or whatever.
13189
10:13:52,720 --> 10:13:53,720
But we do look for strings.
13190
10:13:53,720 --> 10:13:55,720
And the problem here is we got like these strings
13191
10:13:55,720 --> 10:13:58,720
occur many times, and so these are the problems.
13192
10:13:58,720 --> 10:14:01,720
And so we have to put these things where there is
13193
10:14:01,720 --> 10:14:04,720
replication of string data kind of in the vertical dimension.
13194
10:14:04,720 --> 10:14:07,720
We have to put those in different tables.
13195
10:14:07,720 --> 10:14:09,720
And so we'll start up.
13196
10:14:09,720 --> 10:14:12,720
Now, the first question that you have to ask yourself
13197
10:14:12,720 --> 10:14:14,720
when you're going to draw this picture of how this data
13198
10:14:14,720 --> 10:14:16,720
is in multiple tables and connected together
13199
10:14:16,720 --> 10:14:19,720
is what is the first one that you're going to write down?
13200
10:14:19,720 --> 10:14:21,720
And this is an interesting debate,
13201
10:14:21,720 --> 10:14:23,720
and often people are sitting in a conference room,
13202
10:14:23,720 --> 10:14:25,720
and people who have experience kind of know what to do.
13203
10:14:25,720 --> 10:14:28,720
Usually if it's a multi-user system,
13204
10:14:28,720 --> 10:14:30,720
like a learning management system,
13205
10:14:30,720 --> 10:14:32,720
the users might be the central concept.
13206
10:14:32,720 --> 10:14:34,720
Perhaps the courses might be the central concept.
13207
10:14:34,720 --> 10:14:36,720
This is a single user system,
13208
10:14:36,720 --> 10:14:38,720
and so you can think, well,
13209
10:14:38,720 --> 10:14:40,720
what is really this application about?
13210
10:14:40,720 --> 10:14:41,720
It's not about people.
13211
10:14:41,720 --> 10:14:43,720
It's one person.
13212
10:14:43,720 --> 10:14:45,720
But it is about tracks.
13213
10:14:45,720 --> 10:14:47,720
And so we can say, okay,
13214
10:14:47,720 --> 10:14:51,720
here we'll take the track is probably the sort of
13215
10:14:51,720 --> 10:14:55,720
most foundational notion of this application.
13216
10:14:55,720 --> 10:14:57,720
And then we can take and say, okay,
13217
10:14:57,720 --> 10:15:01,720
now that we've decided that tracks are the foundational notion,
13218
10:15:01,720 --> 10:15:05,720
which of these columns are simply an attribute of the track?
13219
10:15:05,720 --> 10:15:09,720
Not really the cheating way and the easy way.
13220
10:15:09,720 --> 10:15:11,720
And this particular one is like these numbers,
13221
10:15:11,720 --> 10:15:14,720
all these numbers, like this number and these numbers.
13222
10:15:14,720 --> 10:15:16,720
Not that one.
13223
10:15:16,720 --> 10:15:18,720
They just go along with track.
13224
10:15:18,720 --> 10:15:19,720
And so we'll put that in.
13225
10:15:19,720 --> 10:15:22,720
We've got the track title, rating, length, and count,
13226
10:15:22,720 --> 10:15:25,720
and we put that in.
13227
10:15:25,720 --> 10:15:28,720
And then the question is we've got the remaining things are,
13228
10:15:28,720 --> 10:15:30,720
we've got the artist, we've got the album,
13229
10:15:30,720 --> 10:15:32,720
and we've got the genre.
13230
10:15:32,720 --> 10:15:34,720
And so we can say, okay, well, we can't,
13231
10:15:34,720 --> 10:15:36,720
we've got some vertical duplication,
13232
10:15:36,720 --> 10:15:37,720
so we're going to say, okay,
13233
10:15:37,720 --> 10:15:40,720
this track probably belongs to an album.
13234
10:15:40,720 --> 10:15:45,720
So let's pull out the album into its own table.
13235
10:15:45,720 --> 10:15:48,720
Oops.
13236
10:15:48,720 --> 10:15:51,720
Pull the album out into its own table.
13237
10:16:03,720 --> 10:16:06,720
Pull the album out into its own table.
13238
10:16:06,720 --> 10:16:07,720
And so that pulls that out.
13239
10:16:07,720 --> 10:16:08,720
And then you say, okay,
13240
10:16:08,720 --> 10:16:10,720
what would be the next thing that we're going to pull out?
13241
10:16:10,720 --> 10:16:12,720
So we've pulled out the track.
13242
10:16:12,720 --> 10:16:14,720
We've got this taken care of, this taken care of, that taken,
13243
10:16:14,720 --> 10:16:16,720
now we've got the album.
13244
10:16:16,720 --> 10:16:18,720
Well, albums belong to artists.
13245
10:16:18,720 --> 10:16:21,720
So let's take out the artist.
13246
10:16:21,720 --> 10:16:24,720
And then we'll pick where the genre belongs,
13247
10:16:24,720 --> 10:16:26,720
and we'll just say that the genre belongs to the track.
13248
10:16:26,720 --> 10:16:28,720
And so because there might be albums
13249
10:16:28,720 --> 10:16:30,720
with more than one different genre.
13250
10:16:30,720 --> 10:16:32,720
So each album is not necessarily a rock album.
13251
10:16:32,720 --> 10:16:34,720
It could have a rock track and a country track,
13252
10:16:34,720 --> 10:16:36,720
et cetera, et cetera, et cetera.
13253
10:16:36,720 --> 10:16:38,720
And so now what we've got is we've got four tables, right?
13254
10:16:38,720 --> 10:16:39,720
We've got a track table.
13255
10:16:39,720 --> 10:16:41,720
We've got an album table, an artist table,
13256
10:16:41,720 --> 10:16:42,720
and a genre table.
13257
10:16:42,720 --> 10:16:44,720
And if we sort of double check,
13258
10:16:44,720 --> 10:16:46,720
all of the columns that had vertical duplication in them
13259
10:16:46,720 --> 10:16:50,720
now have their own little table.
13260
10:16:50,720 --> 10:16:52,720
So we can eliminate,
13261
10:16:52,720 --> 10:16:55,720
the next thing we'll do is to show how we're going to eliminate
13262
10:16:55,720 --> 10:16:59,720
this vertical data replication
13263
10:16:59,720 --> 10:17:02,720
by showing how you represent these relationships
13264
10:17:02,720 --> 10:17:08,720
that we just created inside of the database.
13265
10:17:08,720 --> 10:17:11,720
Now we're going to represent these relationships in the database.
13266
10:17:11,720 --> 10:17:13,720
And again, what we're trying to solve here
13267
10:17:13,720 --> 10:17:15,720
is this notion of database normalization,
13268
10:17:15,720 --> 10:17:17,720
third normal form.
13269
10:17:17,720 --> 10:17:19,720
There is so much theory, right?
13270
10:17:19,720 --> 10:17:21,720
But in this lecture,
13271
10:17:21,720 --> 10:17:23,720
I'm just going to condense this down to
13272
10:17:23,720 --> 10:17:25,720
don't replicate string data
13273
10:17:25,720 --> 10:17:27,720
and use what are called keys,
13274
10:17:27,720 --> 10:17:30,720
use integer keys to point at those things.
13275
10:17:30,720 --> 10:17:32,720
And we're going to use these integers then to point.
13276
10:17:32,720 --> 10:17:34,720
So assign each row an integer,
13277
10:17:34,720 --> 10:17:36,720
and then we're going to point from one row to another
13278
10:17:36,720 --> 10:17:37,720
using those integers.
13279
10:17:37,720 --> 10:17:40,720
And so we're going to add these special key columns
13280
10:17:40,720 --> 10:17:42,720
to each of the tables.
13281
10:17:42,720 --> 10:17:44,720
And help in the database will even give us help
13282
10:17:44,720 --> 10:17:46,720
managing those.
13283
10:17:46,720 --> 10:17:48,720
So we still need to keep track of
13284
10:17:48,720 --> 10:17:51,720
who is the creator of the album,
13285
10:17:51,720 --> 10:17:53,720
which album a track belongs to.
13286
10:17:53,720 --> 10:17:55,720
We've got to create these relationships
13287
10:17:55,720 --> 10:17:58,720
and we have to come up with ways to store those relationships.
13288
10:17:58,720 --> 10:18:01,720
And so the idea is we're going to have
13289
10:18:01,720 --> 10:18:04,720
a column in a table which is the key column.
13290
10:18:04,720 --> 10:18:06,720
And we're going to call this the ID column.
13291
10:18:06,720 --> 10:18:07,720
And so this is a row,
13292
10:18:07,720 --> 10:18:08,720
it might have many bits of data here,
13293
10:18:08,720 --> 10:18:11,720
but in this case it's just the name of an artist.
13294
10:18:11,720 --> 10:18:14,720
So this album is going to belong to an artist.
13295
10:18:14,720 --> 10:18:17,720
And we're going to assign a number inside the database.
13296
10:18:17,720 --> 10:18:21,720
And so that Led Zeppelin is one and AC-DC is two.
13297
10:18:21,720 --> 10:18:24,720
And so we have this key, this is called a primary key.
13298
10:18:24,720 --> 10:18:26,720
And then later when we want to say that the
13299
10:18:26,720 --> 10:18:31,720
who made who album really was done by AC-DC,
13300
10:18:31,720 --> 10:18:33,720
we put the number two in.
13301
10:18:33,720 --> 10:18:36,720
And so the difference here is instead of saying AC-DC
13302
10:18:36,720 --> 10:18:39,720
in this record we just put the number two
13303
10:18:39,720 --> 10:18:41,720
once we've established this number.
13304
10:18:41,720 --> 10:18:44,720
So we assign keys and then we have these pointers
13305
10:18:44,720 --> 10:18:45,720
that point back.
13306
10:18:45,720 --> 10:18:47,720
And so that's how we model a relationship
13307
10:18:47,720 --> 10:18:50,720
with these small integer numbers.
13308
10:18:50,720 --> 10:18:53,720
And so there are three basic kind of keys that we use.
13309
10:18:53,720 --> 10:18:56,720
One is the primary key and that is that little ID column
13310
10:18:56,720 --> 10:18:58,720
that is just a number.
13311
10:18:58,720 --> 10:19:00,720
But once we give Led Zeppelin the number one,
13312
10:19:00,720 --> 10:19:05,720
Led Zeppelin has got the key one for the rest of that database.
13313
10:19:05,720 --> 10:19:08,720
The logical key is the text area that we use
13314
10:19:08,720 --> 10:19:09,720
that you might look up.
13315
10:19:09,720 --> 10:19:12,720
So the title of the band or the title of the album,
13316
10:19:12,720 --> 10:19:13,720
that's the logical key.
13317
10:19:13,720 --> 10:19:15,720
And then the foreign key is one of these keys
13318
10:19:15,720 --> 10:19:18,720
that is really pointing to the primary key of another row.
13319
10:19:18,720 --> 10:19:21,720
So that's called a foreign key.
13320
10:19:21,720 --> 10:19:24,720
And you might think that you want to use something
13321
10:19:24,720 --> 10:19:26,720
like an email address as the primary key
13322
10:19:26,720 --> 10:19:28,720
for a user table or something like that.
13323
10:19:28,720 --> 10:19:30,720
The logical key should always be separate
13324
10:19:30,720 --> 10:19:32,720
and there should always be a primary key,
13325
10:19:32,720 --> 10:19:33,720
that integer number.
13326
10:19:33,720 --> 10:19:35,720
Because things like logical keys do change.
13327
10:19:35,720 --> 10:19:37,720
People do get new email addresses.
13328
10:19:37,720 --> 10:19:39,720
And if you've got that email address as a foreign key
13329
10:19:39,720 --> 10:19:42,720
pointing all over the place, it doesn't work out so well.
13330
10:19:42,720 --> 10:19:45,720
And so that's why you use these small integer numbers
13331
10:19:45,720 --> 10:19:47,720
that have no meaning outside.
13332
10:19:47,720 --> 10:19:49,720
So sometimes if you're on a system and you see a URL
13333
10:19:49,720 --> 10:19:52,720
and you see some number like 422,016,
13334
10:19:52,720 --> 10:19:54,720
you're like, oh, that turns out to probably be
13335
10:19:54,720 --> 10:19:56,720
my primary key in their database.
13336
10:19:56,720 --> 10:19:58,720
So sometimes you can look in a URL
13337
10:19:58,720 --> 10:20:00,720
and you can see these primary keys in the URL,
13338
10:20:00,720 --> 10:20:04,720
but they don't mean anything outside of that particular system.
13339
10:20:04,720 --> 10:20:07,720
So like I said, a foreign key is a key that is
13340
10:20:07,720 --> 10:20:10,720
really pointing at a row in a different table.
13341
10:20:10,720 --> 10:20:13,720
So the album has a primary key for it,
13342
10:20:13,720 --> 10:20:16,720
but the artist underscore ID points to a row
13343
10:20:16,720 --> 10:20:19,720
in the artist table, as we will soon see.
13344
10:20:19,720 --> 10:20:21,720
I have a naming convention.
13345
10:20:21,720 --> 10:20:24,720
And in my naming convention, on this lecture,
13346
10:20:24,720 --> 10:20:26,720
I use ID for the primary key.
13347
10:20:26,720 --> 10:20:29,720
And then artist underscore ID, I use uppercase
13348
10:20:29,720 --> 10:20:30,720
for the table names.
13349
10:20:30,720 --> 10:20:34,720
And then artist underscore ID says this is a key,
13350
10:20:34,720 --> 10:20:37,720
this is just a key that points to the ID key
13351
10:20:37,720 --> 10:20:38,720
of the artist table.
13352
10:20:38,720 --> 10:20:40,720
And so that's what I do, so you'll see.
13353
10:20:40,720 --> 10:20:41,720
And all my stuff, I'll use that.
13354
10:20:41,720 --> 10:20:43,720
It's a convention.
13355
10:20:43,720 --> 10:20:46,720
It's not something SQL forces you to do.
13356
10:20:46,720 --> 10:20:48,720
But you will find when you go to organizations
13357
10:20:48,720 --> 10:20:49,720
and work on their databases,
13358
10:20:49,720 --> 10:20:51,720
these conventions are very important.
13359
10:20:51,720 --> 10:20:53,720
So I can do something and you can understand
13360
10:20:53,720 --> 10:20:55,720
the rules in which I created.
13361
10:20:55,720 --> 10:20:58,720
Some of these, you'll find this used by some people.
13362
10:20:58,720 --> 10:21:00,720
You'll find completely different conventions,
13363
10:21:00,720 --> 10:21:01,720
and that'll be okay.
13364
10:21:01,720 --> 10:21:03,720
Whatever convention your organization uses,
13365
10:21:03,720 --> 10:21:05,720
learn that convention.
13366
10:21:05,720 --> 10:21:08,720
So now we're going to talk about how we put these keys in
13367
10:21:08,720 --> 10:21:11,720
and then how we actually make the connections
13368
10:21:11,720 --> 10:21:13,720
from one row to another row.
13369
10:21:17,720 --> 10:21:19,720
So now that we know what a primary key,
13370
10:21:19,720 --> 10:21:20,720
logical key, and foreign key are,
13371
10:21:20,720 --> 10:21:22,720
we're going to actually start putting these together
13372
10:21:22,720 --> 10:21:26,720
and creating tables that have these kind of values in them.
13373
10:21:26,720 --> 10:21:28,720
So when we were done, we drew this picture
13374
10:21:28,720 --> 10:21:30,720
that was sort of a logical model
13375
10:21:30,720 --> 10:21:33,720
of how our data would be spread across four tables
13376
10:21:33,720 --> 10:21:35,720
and how those tables are connected.
13377
10:21:35,720 --> 10:21:38,720
Now we have to take this and we have to map it in a way
13378
10:21:38,720 --> 10:21:41,720
that leads to the columns
13379
10:21:41,720 --> 10:21:44,720
and the needed columns in each of our database tables.
13380
10:21:44,720 --> 10:21:45,720
And so here's what we do.
13381
10:21:45,720 --> 10:21:47,720
We basically have to take,
13382
10:21:47,720 --> 10:21:49,720
and for each of these,
13383
10:21:49,720 --> 10:21:51,720
when we're going to build a track table,
13384
10:21:51,720 --> 10:21:53,720
when we're going to build a track table,
13385
10:21:53,720 --> 10:21:54,720
we add a primary key.
13386
10:21:54,720 --> 10:21:57,720
So we just added an ID field to every one of these things.
13387
10:21:57,720 --> 10:21:59,720
And that's so we have a place to store
13388
10:21:59,720 --> 10:22:02,720
the sequence number of this particular row.
13389
10:22:02,720 --> 10:22:03,720
We have logical keys.
13390
10:22:03,720 --> 10:22:04,720
We've just marked those.
13391
10:22:04,720 --> 10:22:05,720
Those are strings.
13392
10:22:05,720 --> 10:22:07,720
And then we have things like, you know,
13393
10:22:07,720 --> 10:22:08,720
rating, length, and count.
13394
10:22:08,720 --> 10:22:09,720
They just kind of go in here.
13395
10:22:09,720 --> 10:22:12,720
And now we have to model a relationship.
13396
10:22:12,720 --> 10:22:14,720
So what we do is we, in the table,
13397
10:22:14,720 --> 10:22:16,720
the relationship starts from,
13398
10:22:16,720 --> 10:22:18,720
we put one more column in,
13399
10:22:18,720 --> 10:22:20,720
and this is the one I will name album ID,
13400
10:22:20,720 --> 10:22:22,720
and that just is an integer column
13401
10:22:22,720 --> 10:22:25,720
that's going to record the album ID.
13402
10:22:25,720 --> 10:22:28,720
So this might be 16, and then 16 goes in here.
13403
10:22:28,720 --> 10:22:31,720
So there's one of these columns that's a foreign key
13404
10:22:31,720 --> 10:22:32,720
that points to this.
13405
10:22:32,720 --> 10:22:33,720
And that's why it's foreign.
13406
10:22:33,720 --> 10:22:35,720
This is a key that's not in the track table.
13407
10:22:35,720 --> 10:22:38,720
This is a key in the album table that we're pointing to.
13408
10:22:38,720 --> 10:22:40,720
And so there's a foreign key.
13409
10:22:40,720 --> 10:22:42,720
And that's what we have to do.
13410
10:22:42,720 --> 10:22:44,720
And we just do that over and over and over again.
13411
10:22:44,720 --> 10:22:47,720
And we quickly convert that picture
13412
10:22:47,720 --> 10:22:49,720
that was a logical picture
13413
10:22:49,720 --> 10:22:52,720
to having every table has a primary key.
13414
10:22:52,720 --> 10:22:54,720
And every time we have a starting point,
13415
10:22:54,720 --> 10:22:56,720
we have a foreign key, foreign key,
13416
10:22:56,720 --> 10:22:57,720
and then foreign key.
13417
10:22:57,720 --> 10:22:59,720
And then we mark these things as logical key,
13418
10:22:59,720 --> 10:23:00,720
logical key, logical key,
13419
10:23:00,720 --> 10:23:02,720
and we'll see how we do that.
13420
10:23:02,720 --> 10:23:03,720
And so that's the picture.
13421
10:23:03,720 --> 10:23:05,720
Now we have a picture of exactly
13422
10:23:05,720 --> 10:23:07,720
how we're going to lay these tables out
13423
10:23:07,720 --> 10:23:10,720
in the fields that we need in these tables.
13424
10:23:10,720 --> 10:23:19,720
So we're going to do a create table statement.
13425
10:23:19,720 --> 10:23:26,720
And I've got this create table statement sitting there.
13426
10:23:26,720 --> 10:23:29,720
And so this one's going to be a little bit different.
13427
10:23:29,720 --> 10:23:32,720
We're going to say create table artist.
13428
10:23:32,720 --> 10:23:35,720
And the ID field is integer.
13429
10:23:35,720 --> 10:23:38,720
And we're going to add all of this stuff.
13430
10:23:38,720 --> 10:23:43,720
This is adding to the column to tell it additional stuff.
13431
10:23:43,720 --> 10:23:44,720
It's a primary key,
13432
10:23:44,720 --> 10:23:46,720
which means we're going to use it to look up a lot.
13433
10:23:46,720 --> 10:23:47,720
It's automatically incremented,
13434
10:23:47,720 --> 10:23:49,720
which means the database is actually going to provide
13435
10:23:49,720 --> 10:23:51,720
this number for us as we insert records.
13436
10:23:51,720 --> 10:23:53,720
It's not allowed to be null.
13437
10:23:53,720 --> 10:23:55,720
It's not allowed to be empty.
13438
10:23:55,720 --> 10:23:56,720
And it's supposed to be unique.
13439
10:23:56,720 --> 10:24:01,720
And then the artist is going to have a name column,
13440
10:24:01,720 --> 10:24:03,720
a name column that's just text.
13441
10:24:03,720 --> 10:24:07,720
So let's do that.
13442
10:24:07,720 --> 10:24:08,720
We already have our users.
13443
10:24:08,720 --> 10:24:11,720
And now we're going to do a create table in this SQL.
13444
10:24:11,720 --> 10:24:12,720
And you can do that.
13445
10:24:12,720 --> 10:24:13,720
That's okay.
13446
10:24:13,720 --> 10:24:14,720
That's totally fine.
13447
10:24:14,720 --> 10:24:15,720
And we have to get this right.
13448
10:24:15,720 --> 10:24:17,720
And we say away we go.
13449
10:24:17,720 --> 10:24:20,720
And so now if I take a look at database structure,
13450
10:24:20,720 --> 10:24:23,720
I've got a users table as well as that users table
13451
10:24:23,720 --> 10:24:26,720
we were playing with before and this artist table.
13452
10:24:26,720 --> 10:24:31,720
Let me go ahead and delete this users table just to say goodbye.
13453
10:24:31,720 --> 10:24:33,720
Okay, so now we have the artist table.
13454
10:24:33,720 --> 10:24:34,720
And we take a look.
13455
10:24:34,720 --> 10:24:35,720
And it's got an ID.
13456
10:24:35,720 --> 10:24:36,720
And it knows all about this stuff.
13457
10:24:36,720 --> 10:24:39,720
Okay?
13458
10:24:39,720 --> 10:24:42,720
So that created the table.
13459
10:24:42,720 --> 10:24:43,720
We're going to keep doing this.
13460
10:24:43,720 --> 10:24:46,720
The next thing that we're going to show here is we're going to show
13461
10:24:46,720 --> 10:24:48,720
the foreign key, right?
13462
10:24:48,720 --> 10:24:50,720
So artist ID is just an integer.
13463
10:24:50,720 --> 10:24:53,720
In some database languages like MySQL and Oracle,
13464
10:24:53,720 --> 10:24:56,720
you would put more stuff here to say this is a foreign key, blah, blah, blah.
13465
10:24:56,720 --> 10:25:00,720
But in SQLite, we keep it simple and just say that is an integer column.
13466
10:25:00,720 --> 10:25:01,720
That's a foreign key.
13467
10:25:01,720 --> 10:25:04,720
The album table has a primary key and a foreign key,
13468
10:25:04,720 --> 10:25:12,720
and then the title.
13469
10:25:12,720 --> 10:25:17,720
So we'll go back and we'll grab that text out of my little page.
13470
10:25:17,720 --> 10:25:19,720
This create table.
13471
10:25:19,720 --> 10:25:22,720
Go back to execute SQL.
13472
10:25:22,720 --> 10:25:28,720
And then run that.
13473
10:25:28,720 --> 10:25:32,720
And we'll continue with just the genre table has an ID on it.
13474
10:25:32,720 --> 10:25:36,720
And primary key, you'll just copy and paste these.
13475
10:25:36,720 --> 10:25:38,720
That whole thing, you do that over and over and over again.
13476
10:25:38,720 --> 10:25:43,720
So we'll go in here and run that one.
13477
10:25:43,720 --> 10:25:50,720
And so the last one we're going to do is the track table.
13478
10:25:50,720 --> 10:25:52,720
And the only thing that's kind of weird about the track table
13479
10:25:52,720 --> 10:25:54,720
is it's got two foreign keys, right?
13480
10:25:54,720 --> 10:25:56,720
It's got an album ID and a genre ID.
13481
10:25:56,720 --> 10:25:59,720
Once you draw the picture, you just sort of literally translate these things.
13482
10:25:59,720 --> 10:26:02,720
It's got two foreign keys and a primary key that's pretty much
13483
10:26:02,720 --> 10:26:08,720
just like all those other primary keys.
13484
10:26:08,720 --> 10:26:12,720
And integer counts an integer and lengths an integer, all that stuff.
13485
10:26:12,720 --> 10:26:14,720
And now we've got it.
13486
10:26:14,720 --> 10:26:15,720
So if we take a look at our database structure,
13487
10:26:15,720 --> 10:26:19,720
we're going to see that our album, genre, and track are all set up.
13488
10:26:19,720 --> 10:26:23,720
And these are no columns that we just made with those create statements.
13489
10:26:23,720 --> 10:26:29,720
Okay?
13490
10:26:29,720 --> 10:26:30,720
So now let's insert some data.
13491
10:26:30,720 --> 10:26:34,720
This first insert statement is kind of important to take a look at.
13492
10:26:34,720 --> 10:26:37,720
So insert into, by the way, the keywords can be upper or lowercase,
13493
10:26:37,720 --> 10:26:39,720
table name, columns.
13494
10:26:39,720 --> 10:26:41,720
Now, this table has two columns.
13495
10:26:41,720 --> 10:26:43,720
It has ID and name.
13496
10:26:43,720 --> 10:26:46,720
But we told the database that ID was auto increment.
13497
10:26:46,720 --> 10:26:48,720
So it's going to actually give us the number.
13498
10:26:48,720 --> 10:26:51,720
It's going to assign the number rather than make us assign.
13499
10:26:51,720 --> 10:26:53,720
We could make it be one, two, three.
13500
10:26:53,720 --> 10:26:55,720
But we say, hey, database, you're good at this.
13501
10:26:55,720 --> 10:26:57,720
Why don't you make it one, two, three?
13502
10:26:57,720 --> 10:27:01,720
And so there is going to be a record that it adds Led Zeppelin.
13503
10:27:01,720 --> 10:27:05,720
So let's take a look at that.
13504
10:27:05,720 --> 10:27:10,720
So we'll insert Led Zeppelin.
13505
10:27:10,720 --> 10:27:13,720
Oops.
13506
10:27:13,720 --> 10:27:16,720
Over to SQL.
13507
10:27:16,720 --> 10:27:18,720
Insert Led Zeppelin and run it.
13508
10:27:18,720 --> 10:27:22,720
So now if I look at database structure and I look at the,
13509
10:27:22,720 --> 10:27:25,720
let's look at browse data and look at the artist database,
13510
10:27:25,720 --> 10:27:27,720
you will see that I put Led Zeppelin in,
13511
10:27:27,720 --> 10:27:30,720
but this ID field here was auto incremented.
13512
10:27:30,720 --> 10:27:33,720
And so it was put there by the database.
13513
10:27:33,720 --> 10:27:44,720
And now when we do the next insert, which is ACDC,
13514
10:27:44,720 --> 10:27:47,720
and we take a look at the data,
13515
10:27:47,720 --> 10:27:49,720
we will see that ACDC is two.
13516
10:27:49,720 --> 10:27:52,720
Now, if you're writing this in a program,
13517
10:27:52,720 --> 10:27:54,720
if you're going to write this in a program,
13518
10:27:54,720 --> 10:27:57,720
you can get these numbers back from the database in your program,
13519
10:27:57,720 --> 10:27:59,720
but I'm not writing this in a program,
13520
10:27:59,720 --> 10:28:02,720
so I have to remember that one is Zeppelin
13521
10:28:02,720 --> 10:28:04,720
and two is ACDC.
13522
10:28:04,720 --> 10:28:06,720
So I'm going to keep myself a little cheat sheet here
13523
10:28:06,720 --> 10:28:09,720
to remember that because everywhere else in the program
13524
10:28:09,720 --> 10:28:11,720
that we're going to say Led Zeppelin,
13525
10:28:11,720 --> 10:28:13,720
I've got to say one now because the artist,
13526
10:28:13,720 --> 10:28:17,720
the artist ID of one means Led Zeppelin in those rows.
13527
10:28:17,720 --> 10:28:19,720
And so now we're going to go back
13528
10:28:19,720 --> 10:28:22,720
and we're going to take a look at the next one.
13529
10:28:22,720 --> 10:28:24,720
And now we're going to put the genre in.
13530
10:28:24,720 --> 10:28:27,720
If you think about it, we're working from the leaves out.
13531
10:28:27,720 --> 10:28:29,720
The track will be the last table that will update
13532
10:28:29,720 --> 10:28:31,720
because you have to define the keys
13533
10:28:31,720 --> 10:28:33,720
for things like rock and metal and Led Zeppelin
13534
10:28:33,720 --> 10:28:35,720
and all those other things.
13535
10:28:35,720 --> 10:28:38,720
And again, even though the genre table has two columns,
13536
10:28:38,720 --> 10:28:40,720
ID and name, we're only going to specify the name
13537
10:28:40,720 --> 10:28:45,720
and let the database assign the value.
13538
10:28:45,720 --> 10:28:47,720
So I'm going to insert both of these
13539
10:28:47,720 --> 10:28:51,720
and use the semicolon trick.
13540
10:28:51,720 --> 10:28:55,720
Put a semicolon here and a semicolon there.
13541
10:28:55,720 --> 10:28:58,720
And run that.
13542
10:28:58,720 --> 10:29:00,720
And so if I take a look at my browse data
13543
10:29:00,720 --> 10:29:02,720
and I look at the genre,
13544
10:29:02,720 --> 10:29:05,720
it's assigned one to rock and two to metal.
13545
10:29:05,720 --> 10:29:07,720
I'm going to write that down.
13546
10:29:07,720 --> 10:29:11,720
One rock, two metal.
13547
10:29:11,720 --> 10:29:13,720
I should have done something like rock and country
13548
10:29:13,720 --> 10:29:14,720
because I can't even tell the difference
13549
10:29:14,720 --> 10:29:17,720
between rock and metal, but whatever.
13550
10:29:17,720 --> 10:29:22,720
My musical skill is not what's at issue in this class.
13551
10:29:22,720 --> 10:29:25,720
So now we're going to put an album in.
13552
10:29:25,720 --> 10:29:27,720
The album is the first thing that has a foreign key.
13553
10:29:27,720 --> 10:29:31,720
So if you remember the thing, the album points to artist.
13554
10:29:31,720 --> 10:29:34,720
And so that means it has a foreign key of artist ID.
13555
10:29:34,720 --> 10:29:36,720
And so we have to explicitly say this
13556
10:29:36,720 --> 10:29:40,720
because the system doesn't know which artist who made who is.
13557
10:29:40,720 --> 10:29:44,720
But we know that who made who is ACDC and that's two.
13558
10:29:44,720 --> 10:29:46,720
And so we know to put artist ID in.
13559
10:29:46,720 --> 10:29:49,720
So we'll say insert into album title artist ID.
13560
10:29:49,720 --> 10:29:51,720
And so we have to know what this two number is.
13561
10:29:51,720 --> 10:29:59,720
And of course because we have our handy little cheat sheet,
13562
10:29:59,720 --> 10:30:02,720
we can go over to execute and run that.
13563
10:30:02,720 --> 10:30:07,720
And I'll put a semicolon there and a semicolon there and run it.
13564
10:30:07,720 --> 10:30:16,720
And so now we have in the album field, we now have this.
13565
10:30:16,720 --> 10:30:18,720
And so this was assigned.
13566
10:30:18,720 --> 10:30:22,720
And so who made who, you still have to write down that.
13567
10:30:22,720 --> 10:30:30,720
Who made who is album one and album two is Led Zeppelin four.
13568
10:30:30,720 --> 10:30:34,720
That makes it even more complex because the name of the album is at Roman numeral four.
13569
10:30:34,720 --> 10:30:36,720
I'm sure I can figure that out.
13570
10:30:36,720 --> 10:30:37,720
Okay.
13571
10:30:37,720 --> 10:30:42,720
So the next thing that we're going to do is we're going to insert the track record.
13572
10:30:42,720 --> 10:30:45,720
Now if you think about the track record, the track has two foreign keys.
13573
10:30:45,720 --> 10:30:50,720
And it's got a lot of stuff.
13574
10:30:50,720 --> 10:30:51,720
It's got the title.
13575
10:30:51,720 --> 10:30:52,720
It's got the rating length count.
13576
10:30:52,720 --> 10:30:53,720
But then we got the two foreign keys.
13577
10:30:53,720 --> 10:30:56,720
And so we have to know these numbers.
13578
10:30:56,720 --> 10:31:02,720
So this two one, this two one, this one two is the genre.
13579
10:31:02,720 --> 10:31:07,720
We're specifying the genre and the album that this track is from by those numbers.
13580
10:31:07,720 --> 10:31:10,720
Now, again, we have to use this cheat sheet.
13581
10:31:10,720 --> 10:31:15,720
But if this was a program, the program would know that one was Zeppelin
13582
10:31:15,720 --> 10:31:19,720
and our one was who made who and two was Led Zeppelin four.
13583
10:31:19,720 --> 10:31:24,720
And so this kind of stuff is easier for the program to understand
13584
10:31:24,720 --> 10:31:26,720
than for us to keep track of and understand.
13585
10:31:26,720 --> 10:31:29,720
But just so we can get through these few records.
13586
10:31:29,720 --> 10:31:32,720
And that's why I rely so heavily on my cheat sheet.
13587
10:31:32,720 --> 10:31:36,720
So here we are all with all these numbers.
13588
10:31:36,720 --> 10:31:38,720
The foreign keys are the tricky part here.
13589
10:31:38,720 --> 10:31:40,720
Everything else is really quite straightforward.
13590
10:31:40,720 --> 10:31:50,720
So now I'm going to insert four records into my track table.
13591
10:31:50,720 --> 10:31:53,720
And then run that.
13592
10:31:53,720 --> 10:31:54,720
Okay.
13593
10:31:54,720 --> 10:31:57,720
So I'll browse data and I look at my track table.
13594
10:31:57,720 --> 10:32:01,720
This column here, this ID, that's the primary key of the track table.
13595
10:32:01,720 --> 10:32:03,720
And then here are the two foreign keys.
13596
10:32:03,720 --> 10:32:08,720
Now, the interesting thing is now there is replication in these columns,
13597
10:32:08,720 --> 10:32:12,720
but the numbers are what's being replicated and that's okay.
13598
10:32:12,720 --> 10:32:17,720
We went a long time just not to put Led Zeppelin four in twice.
13599
10:32:17,720 --> 10:32:20,720
We could have made this a string, but by making this an integer,
13600
10:32:20,720 --> 10:32:23,720
it saves tons of storage and makes it super fast.
13601
10:32:23,720 --> 10:32:28,720
That turns out to be one of the key things that makes databases super fast
13602
10:32:28,720 --> 10:32:32,720
is using these integers.
13603
10:32:32,720 --> 10:32:33,720
So we take a look at all this stuff.
13604
10:32:33,720 --> 10:32:37,720
We see that in a sense by using these little numbers,
13605
10:32:37,720 --> 10:32:39,720
we are pointing to rows in other tables.
13606
10:32:39,720 --> 10:32:41,720
The foreign keys are always pointing.
13607
10:32:41,720 --> 10:32:43,720
They always point to their ID.
13608
10:32:43,720 --> 10:32:44,720
So these foreign keys are out here.
13609
10:32:44,720 --> 10:32:46,720
This is the primary key up here.
13610
10:32:46,720 --> 10:32:49,720
And they always point to a row in another table.
13611
10:32:49,720 --> 10:32:51,720
And so we have modeled all those relationships.
13612
10:32:51,720 --> 10:32:54,720
And you will notice that in this entire database,
13613
10:32:54,720 --> 10:33:01,720
the who made who only appears once.
13614
10:33:01,720 --> 10:33:03,720
The word rock only appears once.
13615
10:33:03,720 --> 10:33:06,720
The word ACDC only appears once.
13616
10:33:06,720 --> 10:33:09,720
What we have is we have duplication in our data,
13617
10:33:09,720 --> 10:33:12,720
but we are duplicating the relationships,
13618
10:33:12,720 --> 10:33:16,720
i.e. these little integer numbers, not duplicating the data itself.
13619
10:33:16,720 --> 10:33:20,720
And in something this small, it seems irrelevant.
13620
10:33:20,720 --> 10:33:22,720
But if you have billions of records,
13621
10:33:22,720 --> 10:33:25,720
or hundreds of millions of records, it is very relevant.
13622
10:33:25,720 --> 10:33:27,720
Very, very relevant.
13623
10:33:27,720 --> 10:33:29,720
So the next thing we are going to do is take a look
13624
10:33:29,720 --> 10:33:31,720
at how you actually reconnect all this stuff together
13625
10:33:31,720 --> 10:33:36,720
once we have sort of blown it out using these foreign keys
13626
10:33:36,720 --> 10:33:39,720
and hand-constructing all these relationships,
13627
10:33:39,720 --> 10:33:45,720
now how we bring it back together to show the data to the user.
13628
10:33:45,720 --> 10:33:48,720
So now that we have carefully constructed our relationships
13629
10:33:48,720 --> 10:33:53,720
in the tables, we need to reconstruct the data to show our users.
13630
10:33:53,720 --> 10:33:56,720
And you can kind of see how you would go pull this stuff together,
13631
10:33:56,720 --> 10:33:58,720
but there is a wonderful capability in relational databases
13632
10:33:58,720 --> 10:34:02,720
called join that brings this all back together.
13633
10:34:02,720 --> 10:34:06,720
And so we have done this for efficiency of storage,
13634
10:34:06,720 --> 10:34:08,720
efficiency of scanning, etc.
13635
10:34:08,720 --> 10:34:12,720
But we do need to traverse these foreign keys at times.
13636
10:34:12,720 --> 10:34:16,720
And the database software will do this for us automatically.
13637
10:34:16,720 --> 10:34:20,720
So the join operation basically is a way to specify in a select statement
13638
10:34:20,720 --> 10:34:23,720
that you want to pull data out of more than one table
13639
10:34:23,720 --> 10:34:26,720
and then specifying using what is called the on clause
13640
10:34:26,720 --> 10:34:30,720
exactly how you want that data pulled out.
13641
10:34:30,720 --> 10:34:32,720
And so here we go.
13642
10:34:32,720 --> 10:34:37,720
We already have a table, an album table to the artist table,
13643
10:34:37,720 --> 10:34:39,720
and the foreign key.
13644
10:34:39,720 --> 10:34:43,720
And we want to, in effect, pull data from both the album and the artist,
13645
10:34:43,720 --> 10:34:45,720
the album title and the artist name.
13646
10:34:45,720 --> 10:34:47,720
And we want to show that.
13647
10:34:47,720 --> 10:34:51,720
And so we're going to say select, which is the same select statement.
13648
10:34:51,720 --> 10:34:52,720
Here's a little different syntax.
13649
10:34:52,720 --> 10:34:54,720
This is the list of fields.
13650
10:34:54,720 --> 10:34:56,720
This is table.field.
13651
10:34:56,720 --> 10:35:02,720
So it's the album title and the artist.name, comma there, from the album.
13652
10:35:02,720 --> 10:35:05,720
And I always start with where the little arrow starts from,
13653
10:35:05,720 --> 10:35:07,720
album joined with.
13654
10:35:07,720 --> 10:35:11,720
So that is going to walk down this connection from album to artist.
13655
10:35:11,720 --> 10:35:14,720
Album joined with artist.
13656
10:35:14,720 --> 10:35:16,720
Don't say with, I just say it.
13657
10:35:16,720 --> 10:35:20,720
On, and then this is the conditions upon which that join is going to happen.
13658
10:35:20,720 --> 10:35:24,720
When the album's artist ID, which is this column here,
13659
10:35:24,720 --> 10:35:31,720
album's artist ID matches, think of that as is equal to or matches the artist's ID.
13660
10:35:31,720 --> 10:35:36,720
And so it only connects the rows here when there is a match between these two tables.
13661
10:35:36,720 --> 10:35:43,720
And so if we look at this and we see that this one matches this one and this one matches that one.
13662
10:35:43,720 --> 10:35:50,720
And so the join connects conditionally and it connects when the on clause is satisfied.
13663
10:35:50,720 --> 10:35:54,720
And so when this whole join runs, this is what we get.
13664
10:35:54,720 --> 10:35:56,720
So you select all this stuff.
13665
10:35:56,720 --> 10:35:57,720
Now this is an abstraction.
13666
10:35:57,720 --> 10:35:58,720
Are you writing a loop?
13667
10:35:58,720 --> 10:36:00,720
Are you doing two nested loops?
13668
10:36:00,720 --> 10:36:02,720
How are you exactly bringing all this data together?
13669
10:36:02,720 --> 10:36:05,720
We don't care about that because that's the beauty of SQL.
13670
10:36:05,720 --> 10:36:09,720
That's the beauty of how we do this in a database.
13671
10:36:09,720 --> 10:36:13,720
So now if we can just run this command, so let's grab this command.
13672
10:36:13,720 --> 10:36:19,720
Select track title, genre name, from track, join genre, that exact query.
13673
10:36:19,720 --> 10:36:21,720
Case of keywords doesn't matter.
13674
10:36:21,720 --> 10:36:24,720
And we go over here and we run this as SQL.
13675
10:36:24,720 --> 10:36:26,720
And we run it.
13676
10:36:26,720 --> 10:36:34,720
We get, oops, I got too far.
13677
10:36:34,720 --> 10:36:35,720
Let's do this one.
13678
10:36:35,720 --> 10:36:37,720
So let's do that one there.
13679
10:36:37,720 --> 10:36:40,720
Select artist name.
13680
10:36:40,720 --> 10:36:42,720
I have to add that one to my little cheat sheet.
13681
10:36:42,720 --> 10:36:46,720
The next time you see the cheat sheet, it'll be right.
13682
10:36:46,720 --> 10:36:53,720
So the title, so this is coming from one table and that's coming from another table.
13683
10:36:53,720 --> 10:36:55,720
And so that's one.
13684
10:36:55,720 --> 10:37:00,720
So here is something we can do that gives us a little more detail on that.
13685
10:37:00,720 --> 10:37:04,720
We can say, so this is where the connection is.
13686
10:37:04,720 --> 10:37:09,720
So you can think of the join as sort of spreading one table and connecting it to the other table.
13687
10:37:09,720 --> 10:37:13,720
And so what we're going to show here is it's exactly the same.
13688
10:37:13,720 --> 10:37:18,720
The thing we're going to do is we're going to add these two columns so you can see where the match happens.
13689
10:37:18,720 --> 10:37:20,720
And so this is one table.
13690
10:37:20,720 --> 10:37:22,720
This is another table.
13691
10:37:22,720 --> 10:37:26,720
And these are the kind of columns in common, even though they're not.
13692
10:37:26,720 --> 10:37:28,720
They're the columns that match.
13693
10:37:28,720 --> 10:37:30,720
This is where the on clause is happening, right?
13694
10:37:30,720 --> 10:37:44,720
We have taken this table joined with this table on these two things connecting with each other.
13695
10:37:44,720 --> 10:37:48,720
So you can almost, in some language, some variants of SQL, this would even be a where clause.
13696
10:37:48,720 --> 10:37:54,720
So you connect these two rows, but only connect them when those two numbers match.
13697
10:37:54,720 --> 10:38:04,720
So you can see, I mean, if we run this, I'll just run this.
13698
10:38:04,720 --> 10:38:08,720
And again, you just see this is where it connects.
13699
10:38:08,720 --> 10:38:18,720
Now, interestingly, we can see what happens and what the purpose of the on clause is if we omit it.
13700
10:38:18,720 --> 10:38:23,720
So this is exactly the same as that previous query, except there's no on clause.
13701
10:38:23,720 --> 10:38:27,720
So it's select all four of those fields from the track joined with the genre.
13702
10:38:27,720 --> 10:38:33,720
So it's basically taking the track table and the genre with a join, but no on clause.
13703
10:38:33,720 --> 10:38:36,720
So it's not filtering for matches.
13704
10:38:36,720 --> 10:38:37,720
This is a match.
13705
10:38:37,720 --> 10:38:38,720
This is a match.
13706
10:38:38,720 --> 10:38:39,720
That's a match.
13707
10:38:39,720 --> 10:38:40,720
That's a match.
13708
10:38:40,720 --> 10:38:44,720
But we don't have an on clause, so the matchingness doesn't matter.
13709
10:38:44,720 --> 10:38:47,720
And so you're going to get all possible combinations.
13710
10:38:47,720 --> 10:38:56,720
And literally, if there were 10 on one side and 30 on the other side, you would get 300 rows in that join.
13711
10:38:56,720 --> 10:39:00,720
So it'd be all combinations, except the on clause reduces the combinations.
13712
10:39:00,720 --> 10:39:02,720
And you might think, whoa, this is really inefficient.
13713
10:39:02,720 --> 10:39:08,720
And I will say that's what my first reaction was when I first saw this, but it's not inefficient.
13714
10:39:08,720 --> 10:39:09,720
That's the beauty of abstraction.
13715
10:39:09,720 --> 10:39:10,720
That's the beauty of SQL.
13716
10:39:10,720 --> 10:39:14,720
You say, do it, and it just figures that out.
13717
10:39:14,720 --> 10:39:20,720
So let me grab this, and you will see that we can run this one as well.
13718
10:39:20,720 --> 10:39:26,720
And that kind of gives you why the on clause is important, because now we have a whole bunch of these things.
13719
10:39:26,720 --> 10:39:29,720
And the on clause just filters that out.
13720
10:39:29,720 --> 10:39:35,720
So if we would just add the on clause back in, then that would only show the ones we showed on the previous slide.
13721
10:39:35,720 --> 10:39:37,720
So that's why the on clause is important.
13722
10:39:37,720 --> 10:39:42,720
The join is like all possible combinations of all pairs of rows between these two tables.
13723
10:39:42,720 --> 10:39:45,720
On is, oh, but only where these two things match.
13724
10:39:45,720 --> 10:39:57,720
And you might think that it's inefficient, but the on clause turns out to be the way it becomes efficient.
13725
10:39:57,720 --> 10:40:12,720
So now we're going to do the same thing where we're just going to take the track title and the genre.
13726
10:40:12,720 --> 10:40:14,720
We're going to connect that together.
13727
10:40:14,720 --> 10:40:15,720
So we select this.
13728
10:40:15,720 --> 10:40:20,720
We need to join from one table, join to the genre table with an on clause.
13729
10:40:20,720 --> 10:40:22,720
And so we're going to make those connections.
13730
10:40:22,720 --> 10:40:32,720
And the only thing we're going to look at is the title and the genre name.
13731
10:40:32,720 --> 10:40:34,720
Oh, oops.
13732
10:40:34,720 --> 10:40:37,720
And then run that.
13733
10:40:37,720 --> 10:40:39,720
And so we got the title and genre name.
13734
10:40:39,720 --> 10:40:47,720
Now the thing you'll notice is for the first time, we now have replication of string data in a vertical dimension.
13735
10:40:47,720 --> 10:40:51,720
That's okay, because the data is not replicated in the database.
13736
10:40:51,720 --> 10:40:54,720
The data is now replicated as a result of the join.
13737
10:40:54,720 --> 10:41:00,720
And so we are going to reconstruct what the user wants to see, which the user originally all the way back to the beginning
13738
10:41:00,720 --> 10:41:04,720
wanted to see the duplicate information in the vertical axis.
13739
10:41:04,720 --> 10:41:06,720
But now we're reconstructing it.
13740
10:41:06,720 --> 10:41:11,720
We didn't waste the space or performance in our database, but we still have to show them.
13741
10:41:11,720 --> 10:41:16,720
And so now the next thing we're going to do is a monster.
13742
10:41:16,720 --> 10:41:19,720
We are going to reconstruct across all four tables.
13743
10:41:19,720 --> 10:41:21,720
And you might think this is really hard.
13744
10:41:21,720 --> 10:41:26,720
And sure, it's going to be a little tricky, but as long as you follow the naming convention
13745
10:41:26,720 --> 10:41:31,720
and the naming convention makes sense, we're going to do a select from the track's title, the artist's name,
13746
10:41:31,720 --> 10:41:33,720
the album's title, and the genre name.
13747
10:41:33,720 --> 10:41:38,720
From the track, join genre, join the album, join artists.
13748
10:41:38,720 --> 10:41:41,720
And so the joins follow the little arrows, right?
13749
10:41:41,720 --> 10:41:46,720
And then the on clause qualifies each of those arrows when to follow the arrow.
13750
10:41:46,720 --> 10:41:49,720
And then this becomes pretty easy.
13751
10:41:49,720 --> 10:41:50,720
It's a foreign key.
13752
10:41:50,720 --> 10:41:55,720
The track's genre ID, that's a foreign key, equals genre.id.
13753
10:41:55,720 --> 10:42:00,720
The primary, that's primary key, that's a foreign key because I name it that way.
13754
10:42:00,720 --> 10:42:04,720
And I know that this goes to that genre table because I name it that way.
13755
10:42:04,720 --> 10:42:10,720
And track's album ID is equal to the album's ID, foreign key, primary key.
13756
10:42:10,720 --> 10:42:14,720
And album's artist ID is equal to artist's ID.
13757
10:42:14,720 --> 10:42:18,720
After a while, you can type these pretty fast as long as you follow a naming convention
13758
10:42:18,720 --> 10:42:20,720
and you know the naming convention.
13759
10:42:20,720 --> 10:42:25,720
So this looks like it's really hard to do, but after a while, it's really just a pattern.
13760
10:42:25,720 --> 10:42:30,720
So let's go ahead and run that one.
13761
10:42:30,720 --> 10:42:36,720
And it will, assuming we've done everything right, replicate all the data.
13762
10:42:36,720 --> 10:42:39,720
So there's all kinds of vertical data now being replicated.
13763
10:42:39,720 --> 10:42:41,720
Every column has vertical data.
13764
10:42:41,720 --> 10:42:46,720
Again, it's not in the database, the select and the join are reconstructing vertical data
13765
10:42:46,720 --> 10:42:50,720
as it needs to be shown to the user.
13766
10:42:50,720 --> 10:42:55,720
And so, if you've been following along, probably a couple hours later now,
13767
10:42:55,720 --> 10:42:59,720
we started with a picture that was our mock-up of what we wanted our user interface to look like.
13768
10:42:59,720 --> 10:43:04,720
And it had vertical stuff, and we're like, ah, we can't put that in a database model.
13769
10:43:04,720 --> 10:43:07,720
And then we carefully built a database model that didn't have the data,
13770
10:43:07,720 --> 10:43:09,720
and then we're like, ah, we've got to reconstruct it.
13771
10:43:09,720 --> 10:43:11,720
So we use join to reconstruct it.
13772
10:43:11,720 --> 10:43:16,720
And so, after all that, we went here with a clean little model with four tables
13773
10:43:16,720 --> 10:43:20,720
all beautifully connected together, and then we had to join it all back together.
13774
10:43:20,720 --> 10:43:22,720
So join reconstructs it.
13775
10:43:22,720 --> 10:43:27,720
And again, the key is the storage is efficient, the scanning is efficient,
13776
10:43:27,720 --> 10:43:31,720
and we still use the join to produce the output that we ultimately want
13777
10:43:31,720 --> 10:43:38,720
with all the vertical replication that our users really want to see.
13778
10:43:38,720 --> 10:43:44,720
So one more kind of relationship, that was called a one-to-many relationship.
13779
10:43:44,720 --> 10:43:46,720
That was actually three one-to-many relationships.
13780
10:43:46,720 --> 10:43:56,720
And the other major relationship is what's called a many-to-many relationship.
13781
10:43:56,720 --> 10:44:00,720
We're going to do some code walkthroughs, actually running some code.
13782
10:44:00,720 --> 10:44:03,720
And if you want to follow along with the code,
13783
10:44:03,720 --> 10:44:07,720
the sample code is here in the materials of my Python for Everybody website.
13784
10:44:07,720 --> 10:44:09,720
So you can take a look at that.
13785
10:44:09,720 --> 10:44:13,720
So the code we're going to look at is from the database chapter.
13786
10:44:13,720 --> 10:44:17,720
And we're going to look at tracks.py.
13787
10:44:17,720 --> 10:44:21,720
So a lot of the lectures that I give in this database chapter are just about SQL.
13788
10:44:21,720 --> 10:44:24,720
And this is really about SQL and Python.
13789
10:44:24,720 --> 10:44:27,720
So I'll go through this in some detail.
13790
10:44:27,720 --> 10:44:30,720
So the code that I'm going through is in tracks.
13791
10:44:30,720 --> 10:44:34,720
There's also tracks.zip that you can grab that has these two things.
13792
10:44:34,720 --> 10:44:42,720
It's got this library.xml file, which you can export from your, if you have iTunes,
13793
10:44:42,720 --> 10:44:45,720
you can export this, or you can just play with my iTunes.
13794
10:44:45,720 --> 10:44:48,720
And so this is also going to review how to read XML.
13795
10:44:48,720 --> 10:44:51,720
So we're going to actually pull all this data.
13796
10:44:51,720 --> 10:44:58,720
And this XML that Apple produces out of iTunes is a little weird
13797
10:44:58,720 --> 10:45:00,720
in that it's kind of key values.
13798
10:45:00,720 --> 10:45:02,720
And so you see key value pairs.
13799
10:45:02,720 --> 10:45:04,720
And it even uses the word dictionary.
13800
10:45:04,720 --> 10:45:07,720
And so it's like, I'm going to make a dictionary that has this,
13801
10:45:07,720 --> 10:45:09,720
then a dictionary within a dictionary.
13802
10:45:09,720 --> 10:45:12,720
This, to me, would be so nice if it was JSON,
13803
10:45:12,720 --> 10:45:16,720
because it's really a list of dictionaries.
13804
10:45:16,720 --> 10:45:20,720
This is a dictionary, then another dictionary, then another dictionary,
13805
10:45:20,720 --> 10:45:22,720
and then the key for that dictionary.
13806
10:45:22,720 --> 10:45:27,720
And it's a weird, weird format.
13807
10:45:27,720 --> 10:45:29,720
But we'll write some Python to be able to read it.
13808
10:45:29,720 --> 10:45:34,720
And so you export that from iTunes.
13809
10:45:34,720 --> 10:45:38,720
And you can use my file, or you can use your file.
13810
10:45:38,720 --> 10:45:41,720
It might be more fun to use your file.
13811
10:45:41,720 --> 10:45:43,720
So here's tracks.py.
13812
10:45:43,720 --> 10:45:45,720
We're going to do some XML.
13813
10:45:45,720 --> 10:45:47,720
And so we import that.
13814
10:45:47,720 --> 10:45:51,720
We're going to import SQLite 3 because we want to talk to the database.
13815
10:45:51,720 --> 10:45:53,720
And then we're going to make a database connection.
13816
10:45:53,720 --> 10:45:58,720
And in this, once we run this, you'll see that that file will exist.
13817
10:45:58,720 --> 10:46:03,720
And so right now, if I'm in my tracks data, that file doesn't exist.
13818
10:46:03,720 --> 10:46:07,720
But what we'll see is this is going to actually create it.
13819
10:46:07,720 --> 10:46:11,720
Now remember that we have a cursor, which is sort of our, like a file handle.
13820
10:46:11,720 --> 10:46:13,720
It's really a database handle, as it were.
13821
10:46:13,720 --> 10:46:17,720
And in order to sort of bootstrap this nicely,
13822
10:46:17,720 --> 10:46:21,720
we are going, because this code is going to run all the time,
13823
10:46:21,720 --> 10:46:24,720
it's going to run and read all of library.xml.
13824
10:46:24,720 --> 10:46:28,720
And later things, we won't wipe out the database every time.
13825
10:46:28,720 --> 10:46:33,720
And so I'm executing a script, which is a series of SQL commands
13826
10:46:33,720 --> 10:46:35,720
separated by semicolons.
13827
10:46:35,720 --> 10:46:38,720
So I'm going to throw away the artist table, album table, and track table.
13828
10:46:38,720 --> 10:46:41,720
Very similar to the stuff we covered in lecture.
13829
10:46:41,720 --> 10:46:43,720
And then I'm going to do the create table.
13830
10:46:43,720 --> 10:46:45,720
And I'm doing this all automatically.
13831
10:46:45,720 --> 10:46:47,720
And you'll notice this is a triple-coded string.
13832
10:46:47,720 --> 10:46:50,720
So this is just one big, long string here.
13833
10:46:50,720 --> 10:46:53,720
And it happens to know that it's SQL.
13834
10:46:53,720 --> 10:46:55,720
I'll thank you, Adam, for that.
13835
10:46:55,720 --> 10:46:57,720
And so it creates all these things.
13836
10:46:57,720 --> 10:47:00,720
Now it's not quite as rich as the data model we built,
13837
10:47:00,720 --> 10:47:02,720
because there's no genres in here.
13838
10:47:02,720 --> 10:47:05,720
And so it's artist, album, track.
13839
10:47:05,720 --> 10:47:08,720
And then there's a foreign key for album ID and a foreign key for artist ID,
13840
10:47:08,720 --> 10:47:15,720
which it's sort of a subset of what we're doing.
13841
10:47:15,720 --> 10:47:20,720
And so when that's done, that actually creates all the tables.
13842
10:47:20,720 --> 10:47:23,720
And we'll see those in a moment once we run the code.
13843
10:47:23,720 --> 10:47:26,720
Then it asks for a file name for the XML.
13844
10:47:26,720 --> 10:47:29,720
And so that's what that is.
13845
10:47:29,720 --> 10:47:34,720
And I wrote a function that does a lookup.
13846
10:47:34,720 --> 10:47:38,720
It's really weird, because if you look at these files,
13847
10:47:38,720 --> 10:47:42,720
like in this dictionary, there is a key.
13848
10:47:42,720 --> 10:47:45,720
And so the key of this dictionary,
13849
10:47:45,720 --> 10:47:47,720
this really should have been a key value pair.
13850
10:47:47,720 --> 10:47:52,720
But so there's this weird thing where the key for an object
13851
10:47:52,720 --> 10:47:54,720
is inside of the object.
13852
10:47:54,720 --> 10:48:00,720
And so we're going to loop through all the children
13853
10:48:00,720 --> 10:48:05,720
in this outer dictionary and find a child tag
13854
10:48:05,720 --> 10:48:06,720
that has a particular key.
13855
10:48:06,720 --> 10:48:08,720
And so you'll see how this works.
13856
10:48:08,720 --> 10:48:12,720
And this was something I was going to use over and over again.
13857
10:48:12,720 --> 10:48:14,720
And so the first thing we're going to do
13858
10:48:14,720 --> 10:48:17,720
is we're going to just parse the string, and this is the string.
13859
10:48:17,720 --> 10:48:21,720
And then this, of course, is an XML ET object.
13860
10:48:21,720 --> 10:48:24,720
And then we're going to say, we're going to do a find all.
13861
10:48:24,720 --> 10:48:26,720
And so this shows how the find all,
13862
10:48:26,720 --> 10:48:28,720
we're going to go the third level dictionaries.
13863
10:48:28,720 --> 10:48:32,720
We want to see all of the tracks.
13864
10:48:32,720 --> 10:48:35,720
And so we have a dictionary, and a dictionary, and a dictionary.
13865
10:48:35,720 --> 10:48:40,720
And so what we want is all of these guys.
13866
10:48:40,720 --> 10:48:43,720
All those guys right there.
13867
10:48:43,720 --> 10:48:45,720
Track ID.
13868
10:48:45,720 --> 10:48:47,720
So we're going to get a list of all those.
13869
10:48:47,720 --> 10:48:50,720
That'll be the first one.
13870
10:48:50,720 --> 10:48:51,720
This will be the second one.
13871
10:48:51,720 --> 10:48:59,720
Because the find all says, go to the, find the dictionary key,
13872
10:48:59,720 --> 10:49:02,720
then a dictionary tag within that, and a dictionary tag.
13873
10:49:02,720 --> 10:49:05,720
And then we'll tell how many things we got.
13874
10:49:05,720 --> 10:49:06,720
And then we're going to loop through,
13875
10:49:06,720 --> 10:49:12,720
and entry is going to iterate through each of these.
13876
10:49:12,720 --> 10:49:15,720
And see, we'll get our name, and our artist.
13877
10:49:15,720 --> 10:49:18,720
Another one bites the dust, a queen, and away we go.
13878
10:49:18,720 --> 10:49:21,720
And then the next time through the loop, we'll hit this one.
13879
10:49:21,720 --> 10:49:22,720
Okay?
13880
10:49:22,720 --> 10:49:27,720
So then what we're going to do is we're going to go through all those entries,
13881
10:49:27,720 --> 10:49:31,720
and if there is no track ID, and if that's this track ID field,
13882
10:49:31,720 --> 10:49:32,720
where are you hiding?
13883
10:49:32,720 --> 10:49:34,720
Track ID.
13884
10:49:34,720 --> 10:49:37,720
If we don't have that, we're going to continue.
13885
10:49:37,720 --> 10:49:39,720
And then we're going to look up the name, artist, album,
13886
10:49:39,720 --> 10:49:42,720
play count, rating, and total time.
13887
10:49:42,720 --> 10:49:45,720
Okay?
13888
10:49:45,720 --> 10:49:49,720
And so here they are, play count.
13889
10:49:49,720 --> 10:49:54,720
A lot of those things that we had in the sample lecture that I did.
13890
10:49:54,720 --> 10:49:56,720
And we're going to look those things up.
13891
10:49:56,720 --> 10:49:58,720
And we're going to do some sanity checking.
13892
10:49:58,720 --> 10:50:01,720
If we didn't get a name or an artist or an album, we're going to continue.
13893
10:50:01,720 --> 10:50:03,720
We're going to print them out.
13894
10:50:03,720 --> 10:50:09,720
And then we are going to ask for, get,
13895
10:50:09,720 --> 10:50:13,720
remember how you have to get the primary key of a row so you can use it.
13896
10:50:13,720 --> 10:50:18,720
So the way we're going to do this is we're going to do an insert or ignore.
13897
10:50:18,720 --> 10:50:23,720
And so this or ignore basically says, because I said that the artist's name,
13898
10:50:23,720 --> 10:50:27,720
go up here, I said the artist's name is unique.
13899
10:50:27,720 --> 10:50:30,720
Which means if I try to attempt to insert the same artist twice,
13900
10:50:30,720 --> 10:50:32,720
it will blow up.
13901
10:50:32,720 --> 10:50:35,720
Okay, because I put this constraint on that.
13902
10:50:35,720 --> 10:50:40,720
Except when I say insert or ignore, that basically says, hey,
13903
10:50:40,720 --> 10:50:43,720
if it's already there, don't insert it again.
13904
10:50:43,720 --> 10:50:47,720
So what I'm doing here is insert or ignore into artist.
13905
10:50:47,720 --> 10:50:49,720
So this is putting a new row into the artist table,
13906
10:50:49,720 --> 10:50:54,720
unless there's already a row in that artist table.
13907
10:50:54,720 --> 10:50:57,720
And the syntax right here, you know,
13908
10:50:57,720 --> 10:51:01,720
the question mark is sort of where this artist variable goes.
13909
10:51:01,720 --> 10:51:03,720
And this is a tuple.
13910
10:51:03,720 --> 10:51:06,720
But I have to sort of put this comma in to force it to be a tuple.
13911
10:51:06,720 --> 10:51:09,720
So this is the way you have a one tuple.
13912
10:51:09,720 --> 10:51:12,720
And then what I need to know is I need to know the primary key
13913
10:51:12,720 --> 10:51:14,720
of this particular artist row.
13914
10:51:14,720 --> 10:51:18,720
Now this line may or may not have actually done the insert.
13915
10:51:18,720 --> 10:51:23,720
And so I need to know what the ID for that particular artist is.
13916
10:51:23,720 --> 10:51:25,720
So I do a select ID from artist where name equals.
13917
10:51:25,720 --> 10:51:30,720
Now it either was already there or I'm getting it fresh and brand new.
13918
10:51:30,720 --> 10:51:33,720
So I do an artist ID equals I fetch one row
13919
10:51:33,720 --> 10:51:36,720
and it's going to be the first thing given that I only selected ID.
13920
10:51:36,720 --> 10:51:40,720
And so this artist ID is going to be the ID.
13921
10:51:40,720 --> 10:51:47,720
Now I have the foreign key for the album title, right?
13922
10:51:47,720 --> 10:51:51,720
And so now I'm going to insert into the title artist ID.
13923
10:51:51,720 --> 10:51:53,720
This is the foreign key to the artist table.
13924
10:51:53,720 --> 10:51:57,720
And I got this value that I just moments ago retrieved.
13925
10:51:57,720 --> 10:51:59,720
And I got the album title.
13926
10:51:59,720 --> 10:52:02,720
But this also is insert or ignore.
13927
10:52:02,720 --> 10:52:07,720
Because now if you look, I have unique on the album title.
13928
10:52:07,720 --> 10:52:09,720
Yep, unique's on the album title.
13929
10:52:09,720 --> 10:52:12,720
So that'll do nothing.
13930
10:52:12,720 --> 10:52:13,720
It doesn't blow up.
13931
10:52:13,720 --> 10:52:15,720
Or ignore says don't blow up.
13932
10:52:15,720 --> 10:52:17,720
Just do nothing.
13933
10:52:17,720 --> 10:52:19,720
Because this next line is going to select it.
13934
10:52:19,720 --> 10:52:23,720
And I grab the album's foreign key for either the existing row
13935
10:52:23,720 --> 10:52:25,720
or the new row.
13936
10:52:25,720 --> 10:52:28,720
And then I'm going to insert or replace.
13937
10:52:28,720 --> 10:52:32,720
So what this basically says is if the unique constraint would be violated,
13938
10:52:32,720 --> 10:52:35,720
this turns into an update.
13939
10:52:35,720 --> 10:52:38,720
Now not all SQLs have this but SQLite has this
13940
10:52:38,720 --> 10:52:42,720
that basically says insert or replace.
13941
10:52:42,720 --> 10:52:44,720
Some SQLs are totally standard.
13942
10:52:44,720 --> 10:52:47,720
Some things we do like this is this select statement
13943
10:52:47,720 --> 10:52:50,720
is a totally standard part of SQL.
13944
10:52:50,720 --> 10:52:53,720
Then they insert is totally standard but insert or replace
13945
10:52:53,720 --> 10:52:56,720
and insert or ignore is not totally standard.
13946
10:52:56,720 --> 10:52:57,720
But that's okay.
13947
10:52:57,720 --> 10:52:59,720
It works for SQLite which is what we're doing.
13948
10:52:59,720 --> 10:53:03,720
And so we have the title, album ID, length, rating, and count.
13949
10:53:03,720 --> 10:53:05,720
And then we have a tuple that does all that stuff.
13950
10:53:05,720 --> 10:53:10,720
And of course the title is unique.
13951
10:53:10,720 --> 10:53:12,720
The title is unique in the track table as well.
13952
10:53:12,720 --> 10:53:14,720
And so we've inserted that.
13953
10:53:14,720 --> 10:53:19,720
So the clever bit here is dealing with new or existing names
13954
10:53:19,720 --> 10:53:21,720
in these three lines.
13955
10:53:21,720 --> 10:53:24,720
And we see that pattern twice here where we're doing that.
13956
10:53:24,720 --> 10:53:28,720
Okay, so there's not much left to do except run this code.
13957
10:53:28,720 --> 10:53:30,720
Hopefully it runs.
13958
10:53:30,720 --> 10:53:36,720
Python 3 tracks.py
13959
10:53:36,720 --> 10:53:39,720
and library.xml.
13960
10:53:39,720 --> 10:53:41,720
Whoosh!
13961
10:53:41,720 --> 10:53:45,720
Okay, so that is my...
13962
10:53:45,720 --> 10:53:51,720
So we found 404 of those dictionaries, 3D dictionaries.
13963
10:53:51,720 --> 10:53:53,720
And now it's starting to insert them.
13964
10:53:53,720 --> 10:53:55,720
Insert them, insert them, insert them.
13965
10:53:55,720 --> 10:53:58,720
And we can take a look at...
13966
10:53:58,720 --> 10:54:01,720
So we can do an ls-l or dir on Windows.
13967
10:54:01,720 --> 10:54:04,720
We'll see that we made a track database.
13968
10:54:04,720 --> 10:54:06,720
We extracted the data from this library
13969
10:54:06,720 --> 10:54:08,720
and we made a track database.
13970
10:54:08,720 --> 10:54:10,720
And we have all these foreign keys.
13971
10:54:10,720 --> 10:54:13,720
So let's go and take a look at the SQLite browser.
13972
10:54:13,720 --> 10:54:18,720
File, open database, track dbsqlite.
13973
10:54:18,720 --> 10:54:20,720
And come on up.
13974
10:54:20,720 --> 10:54:22,720
Where'd you hide?
13975
10:54:22,720 --> 10:54:24,720
I got it minimized, so there you go.
13976
10:54:24,720 --> 10:54:26,720
Let's look at the database structure.
13977
10:54:26,720 --> 10:54:29,720
We have an album, this is the structure.
13978
10:54:29,720 --> 10:54:32,720
Artist and track, we have no genre.
13979
10:54:32,720 --> 10:54:34,720
And this is all like we did it by hand
13980
10:54:34,720 --> 10:54:37,720
except Python did all this work for us.
13981
10:54:37,720 --> 10:54:40,720
If we take a look at the data and we start from the outside in,
13982
10:54:40,720 --> 10:54:45,720
we have the artist names and their primary keys.
13983
10:54:45,720 --> 10:54:48,720
There's the artist names and primary keys.
13984
10:54:48,720 --> 10:54:54,720
And then we have the albums and we have the artist IDs.
13985
10:54:54,720 --> 10:54:57,720
See the artist IDs, how nice those are.
13986
10:54:57,720 --> 10:55:00,720
So we have the primary key here and the foreign key there
13987
10:55:00,720 --> 10:55:02,720
and then we have the title.
13988
10:55:02,720 --> 10:55:05,720
And if we get to the track,
13989
10:55:05,720 --> 10:55:08,720
we have the album ID and away we go.
13990
10:55:08,720 --> 10:55:12,720
So if I was clever, I could be able to type some SQL.
13991
10:55:12,720 --> 10:55:14,720
Oh, great.
13992
10:55:14,720 --> 10:55:17,720
If I was smart, I'd have had this in a paste buffer.
13993
10:55:17,720 --> 10:55:37,720
So select track.title, album.title, artist.name, I think.
13994
10:55:37,720 --> 10:55:42,720
Artist has names and albums have titles, yes.
13995
10:55:42,720 --> 10:55:52,720
Okay, so I can do that from track, join, album.
13996
10:55:52,720 --> 10:55:57,720
Oops, album, join.
13997
10:55:57,720 --> 10:56:02,720
Let me make that a little bigger.
13998
10:56:02,720 --> 10:56:04,720
Bring that over here.
13999
10:56:04,720 --> 10:56:10,720
Album, track, join, album, join, artist.
14000
10:56:10,720 --> 10:56:21,720
I need an on clause and I can say track.album.
14001
10:56:21,720 --> 10:56:23,720
ID equals album.
14002
10:56:23,720 --> 10:56:28,720
Notice how I know the name that I named these things
14003
10:56:28,720 --> 10:56:34,720
and album.artist.
14004
10:56:34,720 --> 10:56:42,720
This is so great when you use a naming convention, artist.id.
14005
10:56:42,720 --> 10:56:45,720
Golly, I think that might work.
14006
10:56:45,720 --> 10:56:49,720
So let's just see what we get when we type that into the SQL box here.
14007
10:56:49,720 --> 10:56:53,720
Execute SQL.
14008
10:56:53,720 --> 10:56:54,720
Run.
14009
10:56:54,720 --> 10:56:57,720
Yay, I got it right the first time.
14010
10:56:57,720 --> 10:57:01,720
So that's basically my nice little joined up track list.
14011
10:57:01,720 --> 10:57:04,720
Oh, I'm so happy that I got that right the first time.
14012
10:57:04,720 --> 10:57:09,720
Okay, well, so you can play with this yourself.
14013
10:57:09,720 --> 10:57:14,720
Play with this tracks, maybe make an export of your own iTunes library
14014
10:57:14,720 --> 10:57:16,720
and run it with that.
14015
10:57:16,720 --> 10:57:23,720
And so I hope that you found this particular bit of code useful, okay?
14016
10:57:23,720 --> 10:57:28,720
Cheers.
14017
10:57:28,720 --> 10:57:31,720
So our last major topic is called many-to-many relationships
14018
10:57:31,720 --> 10:57:33,720
and up till now everything that we've done
14019
10:57:33,720 --> 10:57:36,720
is what's called a one-to-many relationship.
14020
10:57:36,720 --> 10:57:40,720
And that is there are many tracks associated with one album.
14021
10:57:40,720 --> 10:57:43,720
There are many albums associated with one artist.
14022
10:57:43,720 --> 10:57:46,720
There are many tracks associated with one genre.
14023
10:57:46,720 --> 10:57:49,720
And you can think of labeling and as you look at data models
14024
10:57:49,720 --> 10:57:51,720
they put little labels on each arrow
14025
10:57:51,720 --> 10:57:54,720
that tell you which end of the arrow is the many
14026
10:57:54,720 --> 10:57:56,720
and which end of the arrow is the one.
14027
10:57:56,720 --> 10:57:59,720
And so in this case, the foreign key is pointing to
14028
10:57:59,720 --> 10:58:01,720
there are many of these rows over here,
14029
10:58:01,720 --> 10:58:04,720
many rows that point to one row over here.
14030
10:58:04,720 --> 10:58:06,720
So it's a many-to-one relationship.
14031
10:58:06,720 --> 10:58:07,720
There are various ways.
14032
10:58:07,720 --> 10:58:11,720
Sometimes I'll put two arrows at this end and one arrow at that end.
14033
10:58:11,720 --> 10:58:14,720
But whatever it is, this kind of thing we've been showing
14034
10:58:14,720 --> 10:58:16,720
is a many-to-one relationship.
14035
10:58:16,720 --> 10:58:18,720
And that's probably the most common thing.
14036
10:58:18,720 --> 10:58:22,720
But there are times when you just can't model things
14037
10:58:22,720 --> 10:58:25,720
with a one-to-many relationship.
14038
10:58:25,720 --> 10:58:27,720
So like if you have a mother and children,
14039
10:58:27,720 --> 10:58:31,720
well that's a many-to-one relationship and it's just fine
14040
10:58:31,720 --> 10:58:32,720
and that works fine.
14041
10:58:32,720 --> 10:58:35,720
But sometimes you have a many-to-many relationship
14042
10:58:35,720 --> 10:58:38,720
in that there might be many books.
14043
10:58:38,720 --> 10:58:42,720
One book has many authors and each author has many books.
14044
10:58:42,720 --> 10:58:44,720
And so you don't have like the one side.
14045
10:58:44,720 --> 10:58:45,720
There's no one.
14046
10:58:45,720 --> 10:58:49,720
And so you have to end up building a table that what we call
14047
10:58:49,720 --> 10:58:50,720
I call it a connector table.
14048
10:58:50,720 --> 10:58:53,720
They call it a junction table on Wikipedia.
14049
10:58:53,720 --> 10:58:56,720
But we need a little table that allows us to break
14050
10:58:56,720 --> 10:58:59,720
a many-to-many relationship into an effect
14051
10:58:59,720 --> 10:59:02,720
two many-to-one relationships and a connector table.
14052
10:59:02,720 --> 10:59:04,720
And so this is a connector table.
14053
10:59:04,720 --> 10:59:06,720
So you could think of this as, you know,
14054
10:59:06,720 --> 10:59:08,720
there are many, many links here
14055
10:59:08,720 --> 10:59:12,720
but we don't have a way to model the many over here to here.
14056
10:59:12,720 --> 10:59:15,720
And so what you do is you basically say,
14057
10:59:15,720 --> 10:59:16,720
oh there's a lot of these things.
14058
10:59:16,720 --> 10:59:18,720
There's many that go to the one.
14059
10:59:18,720 --> 10:59:20,720
The many that go to the one.
14060
10:59:20,720 --> 10:59:23,720
And in here you sort of create that manyness
14061
10:59:23,720 --> 10:59:24,720
that you want to create.
14062
10:59:24,720 --> 10:59:28,720
So it's probably just as easy to look at a sample of this.
14063
10:59:28,720 --> 10:59:32,720
So let's imagine a learning management system
14064
10:59:32,720 --> 10:59:35,720
where you're taking a class and there are some people
14065
10:59:35,720 --> 10:59:37,720
that are teachers and some people that are students
14066
10:59:37,720 --> 10:59:40,720
and many students are members of many classes.
14067
10:59:40,720 --> 10:59:43,720
A student can be part of many classes
14068
10:59:43,720 --> 10:59:45,720
and a class has many students in it.
14069
10:59:45,720 --> 10:59:47,720
So you can't really find the one end.
14070
10:59:47,720 --> 10:59:50,720
And so what we do is we make a table called a membership.
14071
10:59:50,720 --> 10:59:53,720
And in that table of membership we actually often
14072
10:59:53,720 --> 10:59:55,720
don't put a primary key in at all.
14073
10:59:55,720 --> 10:59:58,720
We simply put in two foreign keys.
14074
10:59:58,720 --> 11:00:00,720
And if we're going to put a uniqueness constraint
14075
11:00:00,720 --> 11:00:04,720
we put a combination of the two foreign keys
14076
11:00:04,720 --> 11:00:06,720
as the uniqueness constraint.
14077
11:00:06,720 --> 11:00:09,720
So we say there can be duplicate user IDs
14078
11:00:09,720 --> 11:00:11,720
and duplicate course IDs but there can only be,
14079
11:00:11,720 --> 11:00:14,720
you know, user ID, course ID combinations.
14080
11:00:14,720 --> 11:00:15,720
That has to be unique.
14081
11:00:15,720 --> 11:00:20,720
So you can make unique be more than one column.
14082
11:00:20,720 --> 11:00:23,720
And so if you imagine a course table and a user table
14083
11:00:23,720 --> 11:00:25,720
there's a user ID, the name and email
14084
11:00:25,720 --> 11:00:27,720
and the course has a title and an ID.
14085
11:00:27,720 --> 11:00:29,720
And then we have this little table that just is
14086
11:00:29,720 --> 11:00:33,720
the connector table that shows the points out.
14087
11:00:33,720 --> 11:00:35,720
And so we can expand this membership.
14088
11:00:35,720 --> 11:00:37,720
So let's take a look at how that works.
14089
11:00:37,720 --> 11:00:41,720
So we're going to create some tables
14090
11:00:41,720 --> 11:00:46,720
and these are very classic tables
14091
11:00:46,720 --> 11:00:49,720
because these are the one end of it.
14092
11:00:49,720 --> 11:00:51,720
So these are the one end of it.
14093
11:00:51,720 --> 11:00:55,720
So it has a primary key, a title, a logical key, email.
14094
11:00:55,720 --> 11:00:57,720
There's a primary key for course and then there's text.
14095
11:00:57,720 --> 11:00:59,720
So we have this unique to kind of indicate
14096
11:00:59,720 --> 11:01:00,720
that it's a logical key.
14097
11:01:00,720 --> 11:01:02,720
We're not going to allow ourselves
14098
11:01:02,720 --> 11:01:04,720
to put any duplicates in here.
14099
11:01:04,720 --> 11:01:09,720
Now the connector database here is a table member
14100
11:01:09,720 --> 11:01:12,720
and it has two foreign keys, user ID and course ID.
14101
11:01:12,720 --> 11:01:15,720
And you can easily model some data here.
14102
11:01:15,720 --> 11:01:17,720
So I'm going to model role which is going to be
14103
11:01:17,720 --> 11:01:21,720
zero equals student and one equals instructor.
14104
11:01:21,720 --> 11:01:23,720
And then I'm going to indicate that the primary key
14105
11:01:23,720 --> 11:01:26,720
or uniqueness constraint is the combination
14106
11:01:26,720 --> 11:01:28,720
of the user ID and a course ID.
14107
11:01:28,720 --> 11:01:30,720
Now when we say the primary key,
14108
11:01:30,720 --> 11:01:34,720
it both limits our ability to insert duplicates
14109
11:01:34,720 --> 11:01:37,720
but it also allows the database to optimize its scanning
14110
11:01:37,720 --> 11:01:40,720
because it knows that that combination is always unique
14111
11:01:40,720 --> 11:01:43,720
and so it can organize its disk structure
14112
11:01:43,720 --> 11:01:46,720
and storage structure to understand
14113
11:01:46,720 --> 11:01:48,720
how to look things up more efficiently.
14114
11:01:48,720 --> 11:01:50,720
Knowing that once it's found a user ID,
14115
11:01:50,720 --> 11:01:52,720
course ID combination, it doesn't have to look any farther
14116
11:01:52,720 --> 11:01:53,720
because they're unique.
14117
11:01:53,720 --> 11:01:55,720
And so all of these contracts that we add
14118
11:01:55,720 --> 11:01:59,720
speed things up, save storage and makes things more efficient.
14119
11:01:59,720 --> 11:02:02,720
But in ways we don't always know exactly how they happened.
14120
11:02:02,720 --> 11:02:05,720
And so let's go ahead and make these.
14121
11:02:05,720 --> 11:02:07,720
Let's go ahead and make these guys.
14122
11:02:07,720 --> 11:02:10,720
I think I will start with a new database.
14123
11:02:10,720 --> 11:02:16,720
I'm going to call it LMS for Learning Management System.
14124
11:02:16,720 --> 11:02:19,720
No, I don't really want to do that one.
14125
11:02:19,720 --> 11:02:22,720
And so I'm going to not create the table.
14126
11:02:22,720 --> 11:02:24,720
I'm going to do everything in SQL.
14127
11:02:24,720 --> 11:02:27,720
And so let me see if it's in my cheat sheet.
14128
11:02:27,720 --> 11:02:28,720
Nope, that's not in my cheat sheet.
14129
11:02:28,720 --> 11:02:30,720
So I have to fix the cheat sheet again for you.
14130
11:02:30,720 --> 11:02:32,720
By the time you see the cheat sheet,
14131
11:02:32,720 --> 11:02:33,720
all these things will be in there.
14132
11:02:33,720 --> 11:02:39,720
So I'm going to go in here and I'm going to grab create table user.
14133
11:02:39,720 --> 11:02:41,720
Actually, I'm going to grab them all.
14134
11:02:41,720 --> 11:02:44,720
Watch this.
14135
11:02:44,720 --> 11:02:46,720
Grab them all.
14136
11:02:46,720 --> 11:02:48,720
Highlight all these.
14137
11:02:48,720 --> 11:02:50,720
Go over to SQL iBrowser.
14138
11:02:50,720 --> 11:02:52,720
Blast them all in.
14139
11:02:52,720 --> 11:02:54,720
And then I'll put a semicolon at the end
14140
11:02:54,720 --> 11:02:59,720
of each one of the statements.
14141
11:02:59,720 --> 11:03:01,720
And I want to run them.
14142
11:03:01,720 --> 11:03:05,720
So does it look good?
14143
11:03:05,720 --> 11:03:07,720
Yep, yep, yep.
14144
11:03:07,720 --> 11:03:08,720
So I got a course.
14145
11:03:08,720 --> 11:03:11,720
I got membership, two foreign keys, and I got user.
14146
11:03:11,720 --> 11:03:14,720
So that all looks good.
14147
11:03:14,720 --> 11:03:20,720
So now we're going to have to insert some data in.
14148
11:03:20,720 --> 11:03:22,720
And we're going to insert from the outside in.
14149
11:03:22,720 --> 11:03:24,720
And so we're going to just put the name and email.
14150
11:03:24,720 --> 11:03:27,720
The ID will be automatically assigned for the users.
14151
11:03:27,720 --> 11:03:29,720
And we're going to do the same thing.
14152
11:03:29,720 --> 11:03:33,720
And the ID and the courses will be automatically assigned.
14153
11:03:33,720 --> 11:03:37,720
So let me just grab all this stuff.
14154
11:03:37,720 --> 11:03:38,720
Go into SQL.
14155
11:03:38,720 --> 11:03:40,720
That has the semicolons at the end already for me.
14156
11:03:40,720 --> 11:03:42,720
Thank you very much.
14157
11:03:42,720 --> 11:03:44,720
Now I'm going to run it.
14158
11:03:44,720 --> 11:03:49,720
And if I take a look at my data, now I've got primary keys for the courses.
14159
11:03:49,720 --> 11:03:52,720
And I've got primary keys for the users.
14160
11:03:52,720 --> 11:03:54,720
And I've got nothing in the membership table.
14161
11:03:54,720 --> 11:03:57,720
And I, of course, have to remember what these values are
14162
11:03:57,720 --> 11:04:01,720
because Jane is one, and Ed is two, and Sue is three, right?
14163
11:04:01,720 --> 11:04:05,720
And Python is one, SQL is two, is three.
14164
11:04:05,720 --> 11:04:09,720
And so when I go into membership, I've got two foreign keys here and a role.
14165
11:04:09,720 --> 11:04:13,720
And they just have to be for the course person combination.
14166
11:04:13,720 --> 11:04:17,720
And so it's a little tricky to figure all this stuff out.
14167
11:04:17,720 --> 11:04:19,720
But again, these are just numbers.
14168
11:04:19,720 --> 11:04:22,720
And if you look at these numbers, user ID, course ID, role.
14169
11:04:22,720 --> 11:04:24,720
Well, user ID one is in course one.
14170
11:04:24,720 --> 11:04:26,720
User ID is in course as the teacher.
14171
11:04:26,720 --> 11:04:30,720
User ID two is in course one as the student, et cetera, et cetera, et cetera.
14172
11:04:30,720 --> 11:04:34,720
So I'm making these connections by just putting these little numbers in.
14173
11:04:34,720 --> 11:04:39,720
And once again, conveniently, I have all my semicolons perfectly in place.
14174
11:04:39,720 --> 11:04:42,720
So I go to SQL.
14175
11:04:42,720 --> 11:04:44,720
And then I run that.
14176
11:04:44,720 --> 11:04:48,720
And then I take and I look at my membership data, and there it is.
14177
11:04:48,720 --> 11:04:52,720
So two foreign keys and a bit of data modeled at the connection.
14178
11:04:52,720 --> 11:04:53,720
That's the way we say that.
14179
11:04:53,720 --> 11:04:56,720
The role is modeled at the connection.
14180
11:04:56,720 --> 11:05:01,720
So now we build all this stuff up, we can write some queries that take a look at this.
14181
11:05:01,720 --> 11:05:07,720
And so what we're going to do is we're going to look at who's in what course and what role are they.
14182
11:05:07,720 --> 11:05:11,720
And we're going to sort this in a nice way.
14183
11:05:11,720 --> 11:05:14,720
So let's just take a quick look at the code we're writing.
14184
11:05:14,720 --> 11:05:20,720
We're going to do a select from three tables, the user name, the member role, the course title.
14185
11:05:20,720 --> 11:05:25,720
So in effect, we're not showing any of the foreign keys or the primary keys.
14186
11:05:25,720 --> 11:05:29,720
We're going to go from the user table, join to the member table, join to the course table.
14187
11:05:29,720 --> 11:05:31,720
This is pretty easy to write.
14188
11:05:31,720 --> 11:05:33,720
You know there are three tables you want to go across.
14189
11:05:33,720 --> 11:05:37,720
The on clause is also very easy to write, right?
14190
11:05:37,720 --> 11:05:46,720
The on clause models each of these connections, where the member's user ID is equal to the user's ID.
14191
11:05:46,720 --> 11:05:51,720
And where the member's course ID is equal to the course ID.
14192
11:05:51,720 --> 11:05:58,720
So we're going to concatenate all three of these tables together, but we're going to only keep rows where it matters.
14193
11:05:58,720 --> 11:06:03,720
Now this role doesn't participate, but we're going to print that out.
14194
11:06:03,720 --> 11:06:11,720
And we're going to order it by the course title first, and then the member role second, and the name third.
14195
11:06:11,720 --> 11:06:24,720
And so let's run that.
14196
11:06:24,720 --> 11:06:25,720
So we've reconnected it.
14197
11:06:25,720 --> 11:06:27,720
So Ed's the teacher of the PHP class.
14198
11:06:27,720 --> 11:06:29,720
Sue is the student in the PHP class.
14199
11:06:29,720 --> 11:06:31,720
Jane is the teacher in the Python class.
14200
11:06:31,720 --> 11:06:33,720
Ed's a student, and Sue are students in the Python class.
14201
11:06:33,720 --> 11:06:38,720
Ed's the teacher in the SQL class, and Jane is the student in the SQL class.
14202
11:06:38,720 --> 11:06:45,720
And so we have many people, there are many students in many classes there, and so we have modeled that.
14203
11:06:45,720 --> 11:06:48,720
But we model that with this sort of table.
14204
11:06:48,720 --> 11:06:55,720
And if you look at a piece of software that I've written called Sugi, which is a standalone learning management system that's built with learning tools,
14205
11:06:55,720 --> 11:07:04,720
you will see in anything we're in membership where we have a user table, we have a context which is also the course table,
14206
11:07:04,720 --> 11:07:07,720
and then we have a membership table, and you look, here's these foreign keys.
14207
11:07:07,720 --> 11:07:17,720
Like that's the many side, that's the one side, many to one, and so this is now an effect of many to many between these two,
14208
11:07:17,720 --> 11:07:22,720
but then it's modeled as a series of many to one, many to one relationships.
14209
11:07:22,720 --> 11:07:28,720
And you see this all the time in all kinds of things where membership or other kinds of things are necessary,
14210
11:07:28,720 --> 11:07:31,720
many to one, or many to many.
14211
11:07:31,720 --> 11:07:38,720
So, with all that, there's so much to learn. It's both easy and complex at the same time.
14212
11:07:38,720 --> 11:07:42,720
It's easy when someone shows you how to do it, but at some point you will learn how to build database models,
14213
11:07:42,720 --> 11:07:46,720
and you realize, oh, it wasn't so bad. It takes a while to get used to them.
14214
11:07:46,720 --> 11:07:49,720
This really just is a quick walk.
14215
11:07:49,720 --> 11:07:58,720
The bottom line is, what we just did seems like it was, wow, that's nice. Do you really have to do that?
14216
11:07:58,720 --> 11:08:02,720
And the answer is, if you're going to scale it all, you absolutely have to,
14217
11:08:02,720 --> 11:08:05,720
because you simply can't read and write data sequentially.
14218
11:08:05,720 --> 11:08:10,720
You can't read through and update one little piece of data in a file by reading all the way through
14219
11:08:10,720 --> 11:08:13,720
and then writing a new copy of the file. That could take seconds,
14220
11:08:13,720 --> 11:08:18,720
and in a system like an online system, you get a hundredth of a second to do something like that,
14221
11:08:18,720 --> 11:08:22,720
and the databases make it so that happens in a thousandth of a second.
14222
11:08:22,720 --> 11:08:25,720
So, ultimately, you simply have to take advantage of this.
14223
11:08:25,720 --> 11:08:29,720
You just can't, if you're going to modify data, you can read data from flat files,
14224
11:08:29,720 --> 11:08:33,720
but even if you're going to read a lot of data, if it's big, it slows down terribly.
14225
11:08:33,720 --> 11:08:38,720
So, it might seem like there's a trade-off that you could debate whether this is worth it,
14226
11:08:38,720 --> 11:08:42,720
but if you're going to deal with a lot of data, you've got no choice.
14227
11:08:42,720 --> 11:08:45,720
It's really not as much a trade-off as you think.
14228
11:08:45,720 --> 11:08:49,720
So, this has been a quick romp through databases.
14229
11:08:49,720 --> 11:08:52,720
We talked a little bit about indexes. There are constraints.
14230
11:08:52,720 --> 11:08:55,720
We talked a little bit about the not null stuff. We've talked about that.
14231
11:08:55,720 --> 11:08:57,720
The uniqueness, that's a constraint.
14232
11:08:57,720 --> 11:09:00,720
Another whole area is what's called transactions,
14233
11:09:00,720 --> 11:09:03,720
and that's the locking of little areas.
14234
11:09:03,720 --> 11:09:07,720
So, you can read an area, then lock it, and then update it to make sure no one else reads it.
14235
11:09:07,720 --> 11:09:12,720
And so, they make sure they either get the version before you looked at it
14236
11:09:12,720 --> 11:09:14,720
or before you change it or after you change it.
14237
11:09:14,720 --> 11:09:23,720
And so, that's how you make sure that you can't do things having to do with bank account balances
14238
11:09:23,720 --> 11:09:24,720
and get yourself in trouble.
14239
11:09:24,720 --> 11:09:27,720
So, these are a lot of SQL. It's really fascinating.
14240
11:09:27,720 --> 11:09:33,720
SQL is a fascinating thing to use and learn and performance tune and enjoy.
14241
11:09:33,720 --> 11:09:38,720
So, relational databases are cool. This gets us started.
14242
11:09:38,720 --> 11:09:43,720
The big thing is don't allow replication vertically of string data.
14243
11:09:43,720 --> 11:09:46,720
Pull that out into a separate table, establish a primary key,
14244
11:09:46,720 --> 11:09:48,720
and then have foreign keys that point to that primary key.
14245
11:09:48,720 --> 11:09:51,720
It is not just how much data you store.
14246
11:09:51,720 --> 11:09:54,720
It's sort of a compression way as a way of compressing data.
14247
11:09:54,720 --> 11:09:57,720
You might think strings take no data, but they do.
14248
11:09:57,720 --> 11:10:01,720
Numbers take a lot less data, and it's both how much data that's stored
14249
11:10:01,720 --> 11:10:03,720
but also how much data has to be scanned.
14250
11:10:03,720 --> 11:10:10,720
And that way joins work. That's part of the magic of why Oracle is such a successful company.
14251
11:10:10,720 --> 11:10:14,720
It's a bit of art form, and it's something that you can work your whole life
14252
11:10:14,720 --> 11:10:16,720
and always get better at.
14253
11:10:22,720 --> 11:10:25,720
Hello, and welcome to our code walkthrough on the roster code.
14254
11:10:25,720 --> 11:10:29,720
So, the learning objective of this is to do a many-to-many table.
14255
11:10:29,720 --> 11:10:34,720
And so, the idea is that we're going to, just like we talked about in lecture,
14256
11:10:34,720 --> 11:10:37,720
we're going to have a set of users, we're going to have a set of courses,
14257
11:10:37,720 --> 11:10:41,720
and then we're going to have a connector table or a many-to-many table
14258
11:10:41,720 --> 11:10:44,720
that basically has two foreign keys.
14259
11:10:44,720 --> 11:10:50,720
So, we are going to use the integer.null primary key auto-increment unique
14260
11:10:50,720 --> 11:10:58,720
as the way to get auto-assignment of the primary keys in the user table and the course table.
14261
11:10:58,720 --> 11:11:03,720
And then we're going to say that the name, which is like a logical key,
14262
11:11:03,720 --> 11:11:06,720
and then the course title, we're going to mark those as unique.
14263
11:11:06,720 --> 11:11:09,720
And we're going to take advantage of that in a moment.
14264
11:11:09,720 --> 11:11:11,720
So, you'll see how we take advantage of that.
14265
11:11:11,720 --> 11:11:16,720
So, what unique means is if you try to insert the same string into this column,
14266
11:11:16,720 --> 11:11:21,720
you know, like Chuck twice, then it's going to fail the second time
14267
11:11:21,720 --> 11:11:24,720
because it's going to refuse to create a new record.
14268
11:11:24,720 --> 11:11:26,720
And so, if we just kind of like take a look,
14269
11:11:26,720 --> 11:11:29,720
we're going to get our roster data from this sample JSON,
14270
11:11:29,720 --> 11:11:32,720
which is just an array of arrays.
14271
11:11:32,720 --> 11:11:35,720
And this is the person's name, the class that they're in,
14272
11:11:35,720 --> 11:11:38,720
and whether they are a teacher or a student.
14273
11:11:38,720 --> 11:11:40,720
And so, we're going to read that.
14274
11:11:40,720 --> 11:11:43,720
So, we need the JSON library and the SQLite library.
14275
11:11:43,720 --> 11:11:46,720
We make a database connection, and we get a cursor.
14276
11:11:46,720 --> 11:11:50,720
The cursor is kind of more like the file handle.
14277
11:11:50,720 --> 11:11:52,720
You send SQL commands to the cursor,
14278
11:11:52,720 --> 11:11:54,720
and then you read the cursor to get the data back.
14279
11:11:54,720 --> 11:11:57,720
The connection can create more than one cursor,
14280
11:11:57,720 --> 11:11:59,720
so you can have more than one set of commands.
14281
11:11:59,720 --> 11:12:05,720
But the cursor is generally like the file handle to the database server.
14282
11:12:05,720 --> 11:12:08,720
And we are going to execute a big script,
14283
11:12:08,720 --> 11:12:10,720
and you'll notice this is a triple-quoted string
14284
11:12:10,720 --> 11:12:12,720
that goes all the way down to here.
14285
11:12:12,720 --> 11:12:15,720
And so, some people would just give this to a unit text file
14286
11:12:15,720 --> 11:12:17,720
and have you cut and paste this,
14287
11:12:17,720 --> 11:12:21,720
and then go run that in your SQLite browser to create them.
14288
11:12:21,720 --> 11:12:24,720
But that's okay, because what we're going to do
14289
11:12:24,720 --> 11:12:26,720
is we're going to set this up.
14290
11:12:26,720 --> 11:12:31,720
It will either reconnect to existing file named rosterdb.sqlite,
14291
11:12:31,720 --> 11:12:33,720
and if I look where I'm at, I do an ls,
14292
11:12:33,720 --> 11:12:35,720
we find that that file is not there.
14293
11:12:35,720 --> 11:12:37,720
So, the first time I run it, it's going to create it.
14294
11:12:37,720 --> 11:12:40,720
But I want this to start fresh every time,
14295
11:12:40,720 --> 11:12:42,720
so I'm going to wipe out the tables if they exist.
14296
11:12:42,720 --> 11:12:44,720
That way, you can run it over and over and over again,
14297
11:12:44,720 --> 11:12:46,720
in case you make a mistake here.
14298
11:12:46,720 --> 11:12:47,720
Now, I don't have a mistake,
14299
11:12:47,720 --> 11:12:50,720
or hopefully I don't have a mistake on this.
14300
11:12:50,720 --> 11:12:53,720
So, we're going to drop three tables,
14301
11:12:53,720 --> 11:12:55,720
and we're going to create three tables.
14302
11:12:55,720 --> 11:12:59,720
And here, we're going to create the table
14303
11:12:59,720 --> 11:13:03,720
that has two foreign keys, user ID, course ID,
14304
11:13:03,720 --> 11:13:06,720
that are sort of going outwards from the member table,
14305
11:13:06,720 --> 11:13:09,720
and then we're going to model a little bit of the data at the role.
14306
11:13:09,720 --> 11:13:14,720
And I guess this, and again, this is straight from the lecture.
14307
11:13:14,720 --> 11:13:18,720
And the primary key is actually a composite primary key,
14308
11:13:18,720 --> 11:13:20,720
because we're going to look up,
14309
11:13:20,720 --> 11:13:23,720
and it's going to force this to be the combination of user ID
14310
11:13:23,720 --> 11:13:24,720
and course ID to be unique.
14311
11:13:24,720 --> 11:13:27,720
But there can be many user IDs and many course IDs,
14312
11:13:27,720 --> 11:13:32,720
but only one particular combination of a value for user ID and course ID.
14313
11:13:32,720 --> 11:13:34,720
And so, that's what we're basically saying.
14314
11:13:34,720 --> 11:13:37,720
You can be a member of a course, but you can only do that once.
14315
11:13:37,720 --> 11:13:41,720
You can't be like a member of the course a bunch of times.
14316
11:13:41,720 --> 11:13:46,720
So, we're going to, oh, that should be roster data sample.
14317
11:13:46,720 --> 11:13:51,720
That's okay to, oops, fix a bug.
14318
11:13:51,720 --> 11:13:54,720
Save that, roster data sample.
14319
11:13:54,720 --> 11:13:57,720
And so, that's just this file, and it's really just an array,
14320
11:13:57,720 --> 11:13:59,720
and then each row is an array,
14321
11:13:59,720 --> 11:14:02,720
and it's a way for us to get this roster data in.
14322
11:14:02,720 --> 11:14:07,720
And so, once we do load s on JSON, we're parsing it,
14323
11:14:07,720 --> 11:14:10,720
and then this is going to be an array of arrays.
14324
11:14:10,720 --> 11:14:14,720
And so, for entry in JSON data,
14325
11:14:14,720 --> 11:14:17,720
so entry is going to be one of these things.
14326
11:14:17,720 --> 11:14:20,720
So, entry itself is a row.
14327
11:14:20,720 --> 11:14:23,720
So, an entry sub zero is the name,
14328
11:14:23,720 --> 11:14:27,720
and entry sub one is the title, name, that's the sub zero,
14329
11:14:27,720 --> 11:14:33,720
and that's the sub one of the particular entry that we're looking at.
14330
11:14:33,720 --> 11:14:36,720
And we're going to print it out just for yux as a tuple.
14331
11:14:36,720 --> 11:14:38,720
So, we make, that's what the two parentheses are.
14332
11:14:38,720 --> 11:14:41,720
This inner thing is a two tuple.
14333
11:14:41,720 --> 11:14:45,720
And we're then going to take the person,
14334
11:14:45,720 --> 11:14:49,720
and we're going to do an insert, and this is new, or ignore.
14335
11:14:49,720 --> 11:14:54,720
So, what the, or ignore means is if this insert would cause an error,
14336
11:14:54,720 --> 11:14:58,720
please don't blow up, don't, just ignore that I tried to insert it.
14337
11:14:58,720 --> 11:15:01,720
And so, this is our trick, and it's a beautiful trick.
14338
11:15:01,720 --> 11:15:05,720
It's like a gorgeously beautiful trick here.
14339
11:15:05,720 --> 11:15:09,720
If we insert the name Chuck twice,
14340
11:15:09,720 --> 11:15:13,720
or ignore will just mean that nothing happens, meaning it's already there.
14341
11:15:13,720 --> 11:15:16,720
Okay, so if it's already there, if it's not there, it'll put it in.
14342
11:15:16,720 --> 11:15:19,720
And the unique will guarantee that it only goes in once.
14343
11:15:19,720 --> 11:15:23,720
So, we just, in effect, always attempt to insert it.
14344
11:15:23,720 --> 11:15:25,720
And if it's been there once, then it's all set.
14345
11:15:25,720 --> 11:15:29,720
And so, this insert or ignore is a super powerful mechanism.
14346
11:15:29,720 --> 11:15:31,720
I use it all the time.
14347
11:15:31,720 --> 11:15:35,720
And we have a placeholder in the form of a question mark,
14348
11:15:35,720 --> 11:15:37,720
and then we have, so one of these days,
14349
11:15:37,720 --> 11:15:39,720
we'll have two things that we're asking for.
14350
11:15:39,720 --> 11:15:40,720
As a matter of fact, here it is.
14351
11:15:40,720 --> 11:15:41,720
There's a tuple down here.
14352
11:15:41,720 --> 11:15:44,720
But this is kind of a tuple with one item in it, name,
14353
11:15:44,720 --> 11:15:47,720
and that name is then going to substitute in for there
14354
11:15:47,720 --> 11:15:51,720
while avoiding SQL injection.
14355
11:15:51,720 --> 11:15:53,720
So, this runs.
14356
11:15:53,720 --> 11:15:55,720
It may or may not insert a new record,
14357
11:15:55,720 --> 11:15:58,720
but if Chuck or whomever the name is is not there,
14358
11:15:58,720 --> 11:16:00,720
it will give us a new record.
14359
11:16:00,720 --> 11:16:02,720
And then we are going to get back the ID.
14360
11:16:02,720 --> 11:16:06,720
And so, this is the logical key, and this is the primary key.
14361
11:16:06,720 --> 11:16:10,720
And that primary key is going to be auto-constructed for us,
14362
11:16:10,720 --> 11:16:12,720
and so we need to know what it is.
14363
11:16:12,720 --> 11:16:16,720
So, we say select ID from user where name equals
14364
11:16:16,720 --> 11:16:17,720
and then that same name.
14365
11:16:17,720 --> 11:16:20,720
So, that's Chuck, and so that gives us one.
14366
11:16:20,720 --> 11:16:23,720
And then what we do is we're going to fetch one record
14367
11:16:23,720 --> 11:16:24,720
from the cursor because that's a select
14368
11:16:24,720 --> 11:16:26,720
and it gives us back a cursor.
14369
11:16:26,720 --> 11:16:28,720
There's only hopefully one record there because it's unique.
14370
11:16:28,720 --> 11:16:30,720
I could put a limit one in there,
14371
11:16:30,720 --> 11:16:31,720
but that would be kind of redundant
14372
11:16:31,720 --> 11:16:34,720
because the name is a unique key.
14373
11:16:34,720 --> 11:16:37,720
And then the sub-zero just means if there were more than one thing
14374
11:16:37,720 --> 11:16:40,720
that I was selecting, which we'll see in a bit,
14375
11:16:40,720 --> 11:16:42,720
the sub-zero is just the first thing.
14376
11:16:42,720 --> 11:16:46,720
And so, this is going to give us the integer user ID
14377
11:16:46,720 --> 11:16:50,720
that was assigned, or if we're coming through later
14378
11:16:50,720 --> 11:16:53,720
for Chuck, you know, Chuck later, Charlie later,
14379
11:16:53,720 --> 11:16:54,720
that will be the old one.
14380
11:16:54,720 --> 11:16:57,720
So, this is inserted if it doesn't exist,
14381
11:16:57,720 --> 11:17:00,720
and this is get the newly created ID field
14382
11:17:00,720 --> 11:17:02,720
or the original ID field.
14383
11:17:02,720 --> 11:17:06,720
And so, part of this works by having both a logical key
14384
11:17:06,720 --> 11:17:07,720
and a primary key.
14385
11:17:07,720 --> 11:17:10,720
The primary key is auto-generated,
14386
11:17:10,720 --> 11:17:12,720
but the name is a logical key and it's unique.
14387
11:17:12,720 --> 11:17:15,720
And so, that's our trick to get that assigned thing.
14388
11:17:15,720 --> 11:17:18,720
Before, we just looked at it in the user interface
14389
11:17:18,720 --> 11:17:22,720
of SQLite browser and wrote it down,
14390
11:17:22,720 --> 11:17:23,720
but this is how we do it in code.
14391
11:17:23,720 --> 11:17:25,720
So, we need to know what that key is,
14392
11:17:25,720 --> 11:17:26,720
whether it was new or not.
14393
11:17:26,720 --> 11:17:29,720
And then we do the exact same pattern for the course,
14394
11:17:29,720 --> 11:17:31,720
except we're inserting the course title.
14395
11:17:31,720 --> 11:17:36,720
So, that's no big deal.
14396
11:17:36,720 --> 11:17:40,720
And so, we're going to get the user ID, course ID.
14397
11:17:40,720 --> 11:17:43,720
And then what we're going to do is we're going to insert
14398
11:17:43,720 --> 11:17:44,720
or replace.
14399
11:17:44,720 --> 11:17:47,720
So, this is basically if they're,
14400
11:17:47,720 --> 11:17:50,720
remember that this user ID, course ID combination
14401
11:17:50,720 --> 11:17:54,720
is the primary key for this member table.
14402
11:17:54,720 --> 11:17:56,720
If there is a duplicate,
14403
11:17:56,720 --> 11:18:00,720
if this combination is already there,
14404
11:18:00,720 --> 11:18:02,720
this becomes effectively an update state.
14405
11:18:02,720 --> 11:18:04,720
And we have these two number values.
14406
11:18:04,720 --> 11:18:07,720
Now, what's missing here is the role is not there.
14407
11:18:07,720 --> 11:18:12,720
And so, user ID, course ID, this is the SQL bit.
14408
11:18:12,720 --> 11:18:15,720
And now we have a tuple with two items in it.
14409
11:18:15,720 --> 11:18:18,720
And that's because we have two question marks.
14410
11:18:18,720 --> 11:18:19,720
And then we commit it.
14411
11:18:19,720 --> 11:18:21,720
And as I mentioned before,
14412
11:18:21,720 --> 11:18:23,720
sometimes you want to commit every time through.
14413
11:18:23,720 --> 11:18:27,720
The commit is, it turns out that these things are less costly,
14414
11:18:27,720 --> 11:18:30,720
but that's because it's not always writing all the way to disk.
14415
11:18:30,720 --> 11:18:32,720
Whereas when you enter the commit,
14416
11:18:32,720 --> 11:18:34,720
it's going to go and write everything to disk,
14417
11:18:34,720 --> 11:18:36,720
pause until it's complete,
14418
11:18:36,720 --> 11:18:38,720
and then your program doesn't continue.
14419
11:18:38,720 --> 11:18:42,720
So, sometimes we don't run this every single time through.
14420
11:18:42,720 --> 11:18:43,720
Okay?
14421
11:18:43,720 --> 11:18:44,720
So, let's just go ahead and run this.
14422
11:18:44,720 --> 11:18:46,720
The only thing we're going to see is the output
14423
11:18:46,720 --> 11:18:49,720
of the name and the title as it's running.
14424
11:18:49,720 --> 11:18:56,720
So, if I do python3roster.py,
14425
11:18:56,720 --> 11:18:58,720
hopefully I can hit enter.
14426
11:18:58,720 --> 11:18:59,720
So, you'll notice, by the way,
14427
11:18:59,720 --> 11:19:02,720
that this SQLite now exists, right?
14428
11:19:02,720 --> 11:19:04,720
And it has no data in it.
14429
11:19:04,720 --> 11:19:08,720
So, let me see if I can open this database and see it.
14430
11:19:08,720 --> 11:19:10,720
So, you see that there's no data.
14431
11:19:10,720 --> 11:19:11,720
So, we're the code.
14432
11:19:11,720 --> 11:19:15,720
We've run this code, in effect, up to this point.
14433
11:19:15,720 --> 11:19:18,720
So, we've done all the create tables and all that stuff.
14434
11:19:18,720 --> 11:19:20,720
So, the create tables are there.
14435
11:19:20,720 --> 11:19:21,720
So, all this data is here.
14436
11:19:21,720 --> 11:19:23,720
It did it.
14437
11:19:23,720 --> 11:19:25,720
We haven't started putting any data into it yet
14438
11:19:25,720 --> 11:19:26,720
because if we look at browse data,
14439
11:19:26,720 --> 11:19:29,720
we're not finding anything in here.
14440
11:19:29,720 --> 11:19:30,720
Okay?
14441
11:19:30,720 --> 11:19:32,720
There's no data to browse.
14442
11:19:32,720 --> 11:19:34,720
Now, hopefully we won't have locked ourselves
14443
11:19:34,720 --> 11:19:37,720
because we are sitting right here.
14444
11:19:37,720 --> 11:19:39,720
And when I hit enter over here,
14445
11:19:39,720 --> 11:19:40,720
then it's going to go,
14446
11:19:40,720 --> 11:19:42,720
and it's just going to run really fast.
14447
11:19:42,720 --> 11:19:43,720
So, I'll hit enter.
14448
11:19:43,720 --> 11:19:44,720
It'll read it.
14449
11:19:44,720 --> 11:19:46,720
And so, it inserted all of those things.
14450
11:19:46,720 --> 11:19:48,720
And now it's been changed.
14451
11:19:48,720 --> 11:19:50,720
And if I hit refresh over here,
14452
11:19:50,720 --> 11:19:52,720
we will see in the user,
14453
11:19:52,720 --> 11:19:55,720
it just sort of assigned user IDs, right?
14454
11:19:55,720 --> 11:19:57,720
The column's auto-assigned.
14455
11:19:57,720 --> 11:19:59,720
We will find in the course that those courses
14456
11:19:59,720 --> 11:20:01,720
are all auto-assigned.
14457
11:20:01,720 --> 11:20:02,720
There's the courses.
14458
11:20:02,720 --> 11:20:05,720
And there's no duplicates because this is unique, right?
14459
11:20:05,720 --> 11:20:08,720
And so, these are the newly created things.
14460
11:20:08,720 --> 11:20:12,720
But then membership is user ID, course ID.
14461
11:20:12,720 --> 11:20:14,720
And so, again, the primary key,
14462
11:20:14,720 --> 11:20:17,720
as it were, the unique constraint slash primary key
14463
11:20:17,720 --> 11:20:19,720
is the combination of these things.
14464
11:20:19,720 --> 11:20:20,720
And I haven't put anything in roll.
14465
11:20:20,720 --> 11:20:22,720
And so, if you scroll through these,
14466
11:20:22,720 --> 11:20:25,720
you'll see all of the users who are members
14467
11:20:25,720 --> 11:20:29,720
of the courses that they're part of, okay?
14468
11:20:29,720 --> 11:20:31,720
So, there you go.
14469
11:20:31,720 --> 11:20:35,720
And I'll leave it up to you to come up with a join.
14470
11:20:35,720 --> 11:20:38,720
I'll leave it up to you to figure out
14471
11:20:38,720 --> 11:20:39,720
how to put the roll in.
14472
11:20:39,720 --> 11:20:42,720
But I just wanted to kind of give you
14473
11:20:42,720 --> 11:20:44,720
a bit of a walkthrough of this code base.
14474
11:20:44,720 --> 11:20:48,720
And in particular, the tricks of the uniqueness keys,
14475
11:20:48,720 --> 11:20:51,720
the auto-increment keys, the logical key uniqueness,
14476
11:20:51,720 --> 11:20:53,720
kind of composite primary key,
14477
11:20:53,720 --> 11:20:56,720
and then the trick of insert or ignore.
14478
11:20:56,720 --> 11:20:59,720
And then the quick select that comes right afterwards
14479
11:20:59,720 --> 11:21:04,720
to get the newly generated ID or to get the old ID.
14480
11:21:04,720 --> 11:21:06,720
You can insert or replace,
14481
11:21:06,720 --> 11:21:10,720
which is a combination of a insert and an update.
14482
11:21:10,720 --> 11:21:14,720
So, I hope you found this example useful
14483
11:21:14,720 --> 11:21:25,720
and can apply it and basically create many-to-many tables.
14484
11:21:25,720 --> 11:21:27,720
We are doing some code walkthroughs.
14485
11:21:27,720 --> 11:21:29,720
If you want to follow along with the code,
14486
11:21:29,720 --> 11:21:31,720
you can download the source code
14487
11:21:31,720 --> 11:21:36,720
from the Python for Everybody website, okay?
14488
11:21:36,720 --> 11:21:40,720
So, the code we're playing with today is twfriends.py.
14489
11:21:40,720 --> 11:21:44,720
And this is a step beyond the simple TW Spider.
14490
11:21:44,720 --> 11:21:46,720
It is a restartable spider.
14491
11:21:46,720 --> 11:21:48,720
But we're going to data model things a little bit differently.
14492
11:21:48,720 --> 11:21:49,720
We're going to have two tables,
14493
11:21:49,720 --> 11:21:53,720
and we're going to have a many-to-many relationship,
14494
11:21:53,720 --> 11:21:56,720
except that it's sort of a many-to-many relationship
14495
11:21:56,720 --> 11:21:59,720
between the same table, which is okay.
14496
11:21:59,720 --> 11:22:05,720
Friends is a, Twitter Friends are a directional relationship.
14497
11:22:05,720 --> 11:22:09,720
And so, we start out here in twfriends.py.
14498
11:22:09,720 --> 11:22:12,720
Remember that the file hidden.py,
14499
11:22:12,720 --> 11:22:14,720
I'll show it to you, but I'm not going to open it
14500
11:22:14,720 --> 11:22:16,720
because I've got my keys and secrets in it.
14501
11:22:16,720 --> 11:22:19,720
So, this hidden.py file, you've got to edit that,
14502
11:22:19,720 --> 11:22:22,720
and you've got to go to apps.twitter.com
14503
11:22:22,720 --> 11:22:24,720
and get your keys and put them in there.
14504
11:22:24,720 --> 11:22:26,720
Otherwise, these things won't work.
14505
11:22:26,720 --> 11:22:29,720
But, if you have Twitter and you set your API keys up
14506
11:22:29,720 --> 11:22:32,720
and you put them in hidden.py, then all these things will work.
14507
11:22:32,720 --> 11:22:34,720
It's kind of fun, actually, and impressive.
14508
11:22:34,720 --> 11:22:37,720
Not hard to do, actually.
14509
11:22:37,720 --> 11:22:41,720
So, the Twitter URL, that's my library
14510
11:22:41,720 --> 11:22:44,720
that reads hidden.py and augments the URL
14511
11:22:44,720 --> 11:22:46,720
and does all the OAuth stuff.
14512
11:22:46,720 --> 11:22:48,720
JSON and SSL because Twitter doesn't,
14513
11:22:48,720 --> 11:22:51,720
I mean, because Python doesn't accept any certificates,
14514
11:22:51,720 --> 11:22:55,720
even if they're good certificates, so we kind of crush that.
14515
11:22:55,720 --> 11:22:57,720
Here's our friends list that we're going to hit.
14516
11:22:57,720 --> 11:23:00,720
We're going to make a database, friends.sqlite.
14517
11:23:00,720 --> 11:23:03,720
Now, here we're doing create table if not exists.
14518
11:23:03,720 --> 11:23:05,720
So, what this really is saying is,
14519
11:23:05,720 --> 11:23:07,720
I want this to be a restartable process
14520
11:23:07,720 --> 11:23:09,720
and I don't want to lose the data.
14521
11:23:09,720 --> 11:23:15,720
We're starting out, we do not have SQLite,
14522
11:23:15,720 --> 11:23:18,720
any SQLite files, and so this is going to create the database
14523
11:23:18,720 --> 11:23:21,720
and create these tables, but the second time we run it,
14524
11:23:21,720 --> 11:23:23,720
we're not going to recreate the tables.
14525
11:23:23,720 --> 11:23:25,720
We're going to be able to restart this
14526
11:23:25,720 --> 11:23:31,720
because we're going to run out of rate limit
14527
11:23:31,720 --> 11:23:34,720
before we finish this, so we just have to wait.
14528
11:23:34,720 --> 11:23:36,720
We're going to have a people table,
14529
11:23:36,720 --> 11:23:39,720
and we're going to have a primary key in the name.
14530
11:23:39,720 --> 11:23:41,720
The name is going to be unique,
14531
11:23:41,720 --> 11:23:43,720
and whether or not we've retrieved it,
14532
11:23:43,720 --> 11:23:45,720
and that's kind of from a previous one,
14533
11:23:45,720 --> 11:23:49,720
but then there's the who follows who, the from ID to to ID,
14534
11:23:49,720 --> 11:23:51,720
and so this is a direction,
14535
11:23:51,720 --> 11:23:53,720
and we're going to put a uniqueness constraint in,
14536
11:23:53,720 --> 11:23:56,720
just like we do in many to manys that basically says,
14537
11:23:56,720 --> 11:23:59,720
the combination of from ID and to ID has got to be unique.
14538
11:23:59,720 --> 11:24:01,720
We don't allow ourselves,
14539
11:24:01,720 --> 11:24:04,720
to put duplicates of the combination,
14540
11:24:04,720 --> 11:24:06,720
so from ID can be one in many records,
14541
11:24:06,720 --> 11:24:08,720
and to ID can be one in many records,
14542
11:24:08,720 --> 11:24:11,720
but one one is only allowed once,
14543
11:24:11,720 --> 11:24:14,720
and this is the crud we have to do to convince Python
14544
11:24:14,720 --> 11:24:17,720
to accept the Twitter certificate,
14545
11:24:17,720 --> 11:24:20,720
and so this is similar to some of the other stuff that we've done.
14546
11:24:20,720 --> 11:24:23,720
We're going to enter a Twitter account or quit,
14547
11:24:23,720 --> 11:24:26,720
and if we hit enter by itself,
14548
11:24:26,720 --> 11:24:29,720
then we will actually go and retrieve the data
14549
11:24:29,720 --> 11:24:32,720
then we will actually go and retrieve a record
14550
11:24:32,720 --> 11:24:34,720
that was not yet retrieved,
14551
11:24:34,720 --> 11:24:39,720
and now we're actually pulling out two values, ID and name,
14552
11:24:39,720 --> 11:24:41,720
and so we will grab,
14553
11:24:41,720 --> 11:24:44,720
fetch one is going to give us a two-tuple basically,
14554
11:24:44,720 --> 11:24:46,720
and we're going to store that in ID and account.
14555
11:24:46,720 --> 11:24:48,720
Of course that's like,
14556
11:24:48,720 --> 11:24:50,720
this is coming back with a two-tuple,
14557
11:24:50,720 --> 11:24:52,720
first of which is the ID from the database.
14558
11:24:52,720 --> 11:24:55,720
Limit one means we're only going to get one of these,
14559
11:24:55,720 --> 11:24:56,720
or zero of these.
14560
11:24:56,720 --> 11:24:57,720
If there are zero of these,
14561
11:24:57,720 --> 11:25:00,720
that means there are no unretrieved Twitter accounts.
14562
11:25:00,720 --> 11:25:01,720
Retrieved equals zero.
14563
11:25:01,720 --> 11:25:02,720
Well, you'll see in a second
14564
11:25:02,720 --> 11:25:06,720
that all the new accounts we put in
14565
11:25:06,720 --> 11:25:08,720
are the ones for which we haven't retrieved,
14566
11:25:08,720 --> 11:25:09,720
and again, given that our rate limit,
14567
11:25:09,720 --> 11:25:13,720
we want to know which ones we've retrieved, okay?
14568
11:25:13,720 --> 11:25:18,720
And so what we're going to do next
14569
11:25:18,720 --> 11:25:20,720
is we're going to check to see
14570
11:25:20,720 --> 11:25:24,720
if the person that we just checked,
14571
11:25:24,720 --> 11:25:25,720
which means the length of the account is greater
14572
11:25:25,720 --> 11:25:27,720
than we just were entered,
14573
11:25:27,720 --> 11:25:30,720
we're going to check to see if they're already there, okay?
14574
11:25:30,720 --> 11:25:33,720
And we're going to select ID from people where name equals,
14575
11:25:33,720 --> 11:25:35,720
so that's the one we just entered,
14576
11:25:35,720 --> 11:25:37,720
and we're going to fetch one and grab the first thing
14577
11:25:37,720 --> 11:25:41,720
because we only got one thing in the select statement here.
14578
11:25:41,720 --> 11:25:45,720
And if this person that we just asked to see
14579
11:25:45,720 --> 11:25:47,720
is not in the table,
14580
11:25:47,720 --> 11:25:49,720
that means this is going to fail,
14581
11:25:49,720 --> 11:25:51,720
we're going to do an insert or ignore.
14582
11:25:51,720 --> 11:25:52,720
This or ignore is kind of redundant
14583
11:25:52,720 --> 11:25:54,720
because we just checked to see if it was there,
14584
11:25:54,720 --> 11:25:57,720
but we'll put that in just to be safe,
14585
11:25:57,720 --> 11:25:59,720
and we're going to put the name in
14586
11:25:59,720 --> 11:26:02,720
for the new account that we're looking at,
14587
11:26:02,720 --> 11:26:06,720
and we're indicating that retrieved is zero,
14588
11:26:06,720 --> 11:26:09,720
so that we will know that we haven't retrieved it yet.
14589
11:26:09,720 --> 11:26:11,720
You'll see that we'll update that in a second.
14590
11:26:11,720 --> 11:26:14,720
We commit it so that later selects will see this,
14591
11:26:14,720 --> 11:26:18,720
so that you've got to do the commit.
14592
11:26:18,720 --> 11:26:21,720
This later select wouldn't see the one we just inserted,
14593
11:26:21,720 --> 11:26:24,720
and we're going to ask how many rows were affected,
14594
11:26:24,720 --> 11:26:26,720
and if it's not equal to one,
14595
11:26:26,720 --> 11:26:29,720
then we're going to complain about we inserted it,
14596
11:26:29,720 --> 11:26:31,720
and we are going to do this thing.
14597
11:26:31,720 --> 11:26:35,720
We're going to ask, hey, remember there was an ID up there?
14598
11:26:35,720 --> 11:26:37,720
Doo doo doo.
14599
11:26:37,720 --> 11:26:39,720
Right here, ID, integer, primary key,
14600
11:26:39,720 --> 11:26:43,720
and we did not insert this here,
14601
11:26:43,720 --> 11:26:45,720
but we want to know what that ID is,
14602
11:26:45,720 --> 11:26:47,720
and every time I was showing you that in lectures,
14603
11:26:47,720 --> 11:26:49,720
I was saying it's really easy in Python to do this,
14604
11:26:49,720 --> 11:26:51,720
and that's what we're saying.
14605
11:26:51,720 --> 11:26:53,720
This cursor did the insert,
14606
11:26:53,720 --> 11:26:55,720
but one of the things that happens is after the insert,
14607
11:26:55,720 --> 11:26:57,720
we're going to grab the last row ID,
14608
11:26:57,720 --> 11:27:03,720
which is the primary key that was assigned by SQL.
14609
11:27:03,720 --> 11:27:06,720
Okay, and so that means that one way or another,
14610
11:27:06,720 --> 11:27:08,720
coming through this code here in line 45,
14611
11:27:08,720 --> 11:27:10,720
one way or another, we're either going to know
14612
11:27:10,720 --> 11:27:13,720
the ID of the user that was there before,
14613
11:27:13,720 --> 11:27:15,720
or we just inserted one,
14614
11:27:15,720 --> 11:27:17,720
and so we're going to know the primary key of the current user,
14615
11:27:17,720 --> 11:27:18,720
and you'll see why we need that.
14616
11:27:18,720 --> 11:27:22,720
So ID is the primary key of the current user
14617
11:27:22,720 --> 11:27:24,720
that we entered right here.
14618
11:27:24,720 --> 11:27:25,720
Okay?
14619
11:27:25,720 --> 11:27:28,720
And now what we're going to do is do the Twitter URL augment
14620
11:27:28,720 --> 11:27:30,720
with the OAuth and all the keys and the secrets
14621
11:27:30,720 --> 11:27:31,720
and hidden not PY.
14622
11:27:31,720 --> 11:27:33,720
Instead, we're going to go through, let's count 1000.
14623
11:27:33,720 --> 11:27:36,720
Let's go count, what the heck, let's go 200,
14624
11:27:36,720 --> 11:27:38,720
up to 200 friends.
14625
11:27:38,720 --> 11:27:39,720
Save.
14626
11:27:39,720 --> 11:27:40,720
No, let's do 100.
14627
11:27:40,720 --> 11:27:41,720
We'll keep it that way.
14628
11:27:41,720 --> 11:27:43,720
And then we're going to retrieve it,
14629
11:27:43,720 --> 11:27:46,720
and we're retrieving the account.
14630
11:27:46,720 --> 11:27:48,720
We're not going to print the nasty URL out.
14631
11:27:48,720 --> 11:27:49,720
We could.
14632
11:27:49,720 --> 11:27:52,720
Then we're going to open the URL with a connection,
14633
11:27:52,720 --> 11:27:53,720
and then we're going to read that,
14634
11:27:53,720 --> 11:27:55,720
and we're going to get the UTF-8 data from this,
14635
11:27:55,720 --> 11:27:57,720
and then we're going to decode that,
14636
11:27:57,720 --> 11:27:59,720
and we're going to have the Unicode data,
14637
11:27:59,720 --> 11:28:02,720
so the data string is a internal Python string
14638
11:28:02,720 --> 11:28:05,720
with all that data representing all the wonderful characters.
14639
11:28:05,720 --> 11:28:08,720
And of course, we're going to ask URLOpen
14640
11:28:08,720 --> 11:28:12,720
to give us back the headers as a dictionary using this call,
14641
11:28:12,720 --> 11:28:17,720
and we can see how many we have left for the remaining.
14642
11:28:17,720 --> 11:28:21,720
What's the remaining rate limit that we have.
14643
11:28:21,720 --> 11:28:23,720
So then what we're going to do is parse the data
14644
11:28:23,720 --> 11:28:25,720
with JSON load S.
14645
11:28:25,720 --> 11:28:29,720
If, oh wait, I need to continue in here.
14646
11:28:29,720 --> 11:28:31,720
Continue.
14647
11:28:31,720 --> 11:28:33,720
Save.
14648
11:28:33,720 --> 11:28:36,720
If we are going to parse this data, we'll print it out.
14649
11:28:36,720 --> 11:28:38,720
So that means that this died,
14650
11:28:38,720 --> 11:28:41,720
which means it's not syntactically correct JSON, basically.
14651
11:28:41,720 --> 11:28:44,720
And who knows if we're ever going to see that,
14652
11:28:44,720 --> 11:28:47,720
but at least when it blows up, it'll print this data out.
14653
11:28:47,720 --> 11:28:50,720
We'll have to catch it, and then it'll continue.
14654
11:28:50,720 --> 11:28:52,720
Actually, I'll make this a break.
14655
11:28:52,720 --> 11:28:55,720
Because if that's blowing up that bad, we should quit.
14656
11:28:55,720 --> 11:28:59,720
Now, I don't yet know what happens
14657
11:28:59,720 --> 11:29:01,720
when this rate limit says you can't have it.
14658
11:29:01,720 --> 11:29:05,720
But I do know that I expect when it's successful
14659
11:29:05,720 --> 11:29:08,720
that there will be a key of users
14660
11:29:08,720 --> 11:29:11,720
in this outer dictionary that we're going to get.
14661
11:29:11,720 --> 11:29:14,720
And if this outer dictionary,
14662
11:29:14,720 --> 11:29:17,720
if users is not in the parse dictionary,
14663
11:29:17,720 --> 11:29:19,720
then I'm going to dump out this data
14664
11:29:19,720 --> 11:29:22,720
so that at least I can debug what happens
14665
11:29:22,720 --> 11:29:25,720
when I've got some broken JSON.
14666
11:29:25,720 --> 11:29:28,720
So the difference between this code,
14667
11:29:28,720 --> 11:29:33,720
this code is going to fail when the JSON is syntactically bad,
14668
11:29:33,720 --> 11:29:36,720
meaning a curly brace isn't right or whatever.
14669
11:29:36,720 --> 11:29:39,720
This code will trigger when I get good JSON,
14670
11:29:39,720 --> 11:29:43,720
but I don't have a users key in it.
14671
11:29:43,720 --> 11:29:47,720
So then, once we've retrieved it,
14672
11:29:47,720 --> 11:29:49,720
we're pretty happy with it.
14673
11:29:49,720 --> 11:29:52,720
We're going to update for our account that we're retrieving.
14674
11:29:52,720 --> 11:29:56,720
We're going to set this as one of our retrieved accounts.
14675
11:29:56,720 --> 11:29:59,720
And then what we're going to do is write a loop
14676
11:29:59,720 --> 11:30:03,720
that goes through all the friends of this particular user
14677
11:30:03,720 --> 11:30:05,720
that we're asking and gets their screen name.
14678
11:30:05,720 --> 11:30:07,720
Prints it out.
14679
11:30:07,720 --> 11:30:10,720
And then we're going to check to see
14680
11:30:10,720 --> 11:30:13,720
if this one is already in our people database
14681
11:30:13,720 --> 11:30:15,720
because this is a spider.
14682
11:30:15,720 --> 11:30:17,720
We're grabbing accounts.
14683
11:30:17,720 --> 11:30:20,720
And so we'll do a friend ID
14684
11:30:20,720 --> 11:30:23,720
and do a fetch one, grab the sub-zero thing.
14685
11:30:23,720 --> 11:30:26,720
And if that works, if this person's not in there,
14686
11:30:26,720 --> 11:30:28,720
this fetch one is going to blow up,
14687
11:30:28,720 --> 11:30:30,720
which means we're going to drop down to the accept code.
14688
11:30:30,720 --> 11:30:34,720
But if it does work, we have friend ID is,
14689
11:30:34,720 --> 11:30:37,720
you know, they're in there
14690
11:30:37,720 --> 11:30:39,720
and they're already in our database.
14691
11:30:39,720 --> 11:30:41,720
They just weren't retrieved.
14692
11:30:41,720 --> 11:30:44,720
And so now, if the friend ID wasn't there,
14693
11:30:44,720 --> 11:30:47,720
we're going to do an insert into setting retrieve to zero
14694
11:30:47,720 --> 11:30:50,720
and then we're going to commit.
14695
11:30:50,720 --> 11:30:54,720
Now, remember row count is how many rows
14696
11:30:54,720 --> 11:30:56,720
were affected by this last transaction,
14697
11:30:56,720 --> 11:30:58,720
cur.row count, and we're going to die.
14698
11:30:58,720 --> 11:31:02,720
If that insert doesn't work, this is unlikely,
14699
11:31:02,720 --> 11:31:05,720
unless somehow we've ran out of disk drive or something.
14700
11:31:05,720 --> 11:31:09,720
And we're going to grab the friend ID as the key,
14701
11:31:09,720 --> 11:31:11,720
the last row that was inserted.
14702
11:31:11,720 --> 11:31:12,720
We're only going to insert one row,
14703
11:31:12,720 --> 11:31:14,720
so it's basically the primary key
14704
11:31:14,720 --> 11:31:16,720
of the row that we just inserted.
14705
11:31:16,720 --> 11:31:19,720
So if you look at this code right here,
14706
11:31:19,720 --> 11:31:21,720
it comes out the bottom one way or another
14707
11:31:21,720 --> 11:31:23,720
with friend ID successful.
14708
11:31:23,720 --> 11:31:27,720
Friend ID is either they're already in our database
14709
11:31:27,720 --> 11:31:28,720
or they're not.
14710
11:31:28,720 --> 11:31:30,720
And if we insert them, then we have it.
14711
11:31:30,720 --> 11:31:33,720
And so now, this count new and count old
14712
11:31:33,720 --> 11:31:35,720
is just so I can print out a nice print out.
14713
11:31:35,720 --> 11:31:38,720
Now we are going to insert into the friend table,
14714
11:31:38,720 --> 11:31:40,720
which is called the follows table in this case,
14715
11:31:40,720 --> 11:31:42,720
from ID and to ID.
14716
11:31:42,720 --> 11:31:47,720
Those are the two outward pointing foreign keys.
14717
11:31:47,720 --> 11:31:50,720
And we have the ID of the account
14718
11:31:50,720 --> 11:31:52,720
that we are retrieving the friends of
14719
11:31:52,720 --> 11:31:54,720
and then this particular friend.
14720
11:31:54,720 --> 11:31:56,720
And so we're inserting the connection
14721
11:31:56,720 --> 11:31:58,720
from this person to that person.
14722
11:31:58,720 --> 11:32:00,720
And then we commit it.
14723
11:32:00,720 --> 11:32:03,720
We want to commit these again so that later selects,
14724
11:32:03,720 --> 11:32:05,720
when the loop goes back up,
14725
11:32:05,720 --> 11:32:09,720
later selects get all of that data that's going on.
14726
11:32:09,720 --> 11:32:11,720
So we do want to commit from time to time
14727
11:32:11,720 --> 11:32:13,720
and then we close the cursor at the very end.
14728
11:32:13,720 --> 11:32:19,720
So let's run this and see what happens.
14729
11:32:19,720 --> 11:32:29,720
Okay, so Python twfriends.py. Oh, of course.
14730
11:32:29,720 --> 11:32:32,720
I am a refugee from Python 2,
14731
11:32:32,720 --> 11:32:35,720
so I always forget to type Python 3.
14732
11:32:35,720 --> 11:32:38,720
Okay, so we're going to start.
14733
11:32:38,720 --> 11:32:40,720
If we take a look right now,
14734
11:32:40,720 --> 11:32:42,720
I'm going to start another tab over here
14735
11:32:42,720 --> 11:32:46,720
and ls-l star sqlite.
14736
11:32:46,720 --> 11:32:49,720
Now that sqlite file is there, right?
14737
11:32:49,720 --> 11:32:51,720
And it's actually made the tables.
14738
11:32:51,720 --> 11:32:53,720
If you go up here, it ran all this stuff.
14739
11:32:53,720 --> 11:32:55,720
Create the tables, yada yada,
14740
11:32:55,720 --> 11:32:57,720
and we're sitting right here at this line.
14741
11:32:57,720 --> 11:32:58,720
As a matter of fact, I think,
14742
11:32:58,720 --> 11:33:00,720
without causing too much trouble,
14743
11:33:00,720 --> 11:33:03,720
I can open that database
14744
11:33:03,720 --> 11:33:05,720
and get into this database right here
14745
11:33:05,720 --> 11:33:07,720
and there is no data in the follows table
14746
11:33:07,720 --> 11:33:09,720
and there is no data in the people table.
14747
11:33:09,720 --> 11:33:11,720
It's completely empty, okay?
14748
11:33:11,720 --> 11:33:13,720
So we're waiting for the first one.
14749
11:33:13,720 --> 11:33:17,720
And I'll go with mine, Dr. Chuck.
14750
11:33:17,720 --> 11:33:19,720
So it's retrieving the 100 friends
14751
11:33:19,720 --> 11:33:21,720
and they all were brand new.
14752
11:33:21,720 --> 11:33:24,720
They're all inserted, right?
14753
11:33:24,720 --> 11:33:26,720
And so now if I hit refresh,
14754
11:33:26,720 --> 11:33:31,720
we will see that Dr. Chuck is retrieved.
14755
11:33:31,720 --> 11:33:32,720
Who follows?
14756
11:33:32,720 --> 11:33:34,720
So these are all the people I follow.
14757
11:33:34,720 --> 11:33:36,720
One follows two.
14758
11:33:36,720 --> 11:33:37,720
So if we look at here,
14759
11:33:37,720 --> 11:33:39,720
we see that Dr. Chuck follows Stephanie Teasley.
14760
11:33:39,720 --> 11:33:41,720
Because we grabbed the followers of Dr. Chuck,
14761
11:33:41,720 --> 11:33:43,720
you know, we're gonna have a record
14762
11:33:43,720 --> 11:33:45,720
in all of the follows
14763
11:33:45,720 --> 11:33:47,720
for all the ones that I did, right?
14764
11:33:47,720 --> 11:33:49,720
So these are all the people I followed
14765
11:33:49,720 --> 11:33:52,720
and we put them in, okay?
14766
11:33:52,720 --> 11:33:55,720
So we can go back
14767
11:33:55,720 --> 11:33:58,720
and we can, let's see, grab somebody.
14768
11:33:58,720 --> 11:34:02,720
Let's go grab Stephanie Teasley.
14769
11:34:02,720 --> 11:34:06,720
And let's pull out her friends.
14770
11:34:06,720 --> 11:34:10,720
So we grabbed a hundred of her folks.
14771
11:34:10,720 --> 11:34:11,720
I got 14 left.
14772
11:34:11,720 --> 11:34:13,720
That's my x-rate limit.
14773
11:34:13,720 --> 11:34:14,720
So I did Stephanie Teasley,
14774
11:34:14,720 --> 11:34:16,720
so let's go back here.
14775
11:34:16,720 --> 11:34:18,720
So you'll notice there's 101.
14776
11:34:18,720 --> 11:34:21,720
There's probably gonna be, oh, 182.
14777
11:34:21,720 --> 11:34:22,720
That's interesting.
14778
11:34:22,720 --> 11:34:24,720
So we've retrieved Dr. Chuck and Stephanie Teasley
14779
11:34:24,720 --> 11:34:27,720
and let's go take a look in the friends table,
14780
11:34:27,720 --> 11:34:30,720
the follows table, okay?
14781
11:34:30,720 --> 11:34:32,720
So we have all the people I follow.
14782
11:34:32,720 --> 11:34:34,720
Now all the people Stephanie follows.
14783
11:34:34,720 --> 11:34:37,720
Okay, so there we go.
14784
11:34:37,720 --> 11:34:39,720
So let's go ahead and do somebody else.
14785
11:34:39,720 --> 11:34:43,720
Let's see, I think we both follow Tim McKay.
14786
11:34:43,720 --> 11:34:50,720
Where's Tim McKay?
14787
11:34:50,720 --> 11:34:52,720
Yeah, let's follow Tim McKay.
14788
11:34:52,720 --> 11:34:53,720
Let's see who Tim follows.
14789
11:34:53,720 --> 11:34:57,720
See if we can get like an overlap.
14790
11:34:57,720 --> 11:34:59,720
Oh, we revisited some.
14791
11:34:59,720 --> 11:35:05,720
Let's see if we can see this in the follows.
14792
11:35:05,720 --> 11:35:07,720
Let's see people.
14793
11:35:07,720 --> 11:35:09,720
So we've got Dr. Chuck retrieved
14794
11:35:09,720 --> 11:35:17,720
and Tim McKay's somewhere down here.
14795
11:35:17,720 --> 11:35:19,720
You know, it might take us a while
14796
11:35:19,720 --> 11:35:23,720
before we get any really good overlaps.
14797
11:35:23,720 --> 11:35:25,720
Let's see.
14798
11:35:25,720 --> 11:35:28,720
Let's do a database call.
14799
11:35:28,720 --> 11:35:35,720
Let's see, let's do a database SQL.
14800
11:35:35,720 --> 11:35:40,720
Select.
14801
11:35:40,720 --> 11:35:46,720
Count.
14802
11:35:46,720 --> 11:35:51,720
Eh.
14803
11:35:51,720 --> 11:35:53,720
Okay, so let's just run this some more.
14804
11:35:53,720 --> 11:35:54,720
It's clearly working.
14805
11:35:54,720 --> 11:35:57,720
Now one thing I can do here is I can hit enter
14806
11:35:57,720 --> 11:36:00,720
and it will just pick one randomly.
14807
11:36:00,720 --> 11:36:03,720
So it grabbed live EDU TV and I can,
14808
11:36:03,720 --> 11:36:05,720
and let's see how many I got left.
14809
11:36:05,720 --> 11:36:06,720
We got 12 left.
14810
11:36:06,720 --> 11:36:10,720
And now I can hit enter again and it picks another one.
14811
11:36:10,720 --> 11:36:12,720
That was the next one.
14812
11:36:12,720 --> 11:36:13,720
I was kind of picking them in order.
14813
11:36:13,720 --> 11:36:14,720
Is it picking them in order?
14814
11:36:14,720 --> 11:36:16,720
Let's go to people.
14815
11:36:16,720 --> 11:36:18,720
Yeah, it's picking these.
14816
11:36:18,720 --> 11:36:20,720
So we can see that it's going to just do
14817
11:36:20,720 --> 11:36:23,720
the first unretrieved person, who's Nancy.
14818
11:36:23,720 --> 11:36:25,720
Let's let it retrieve Nancy.
14819
11:36:25,720 --> 11:36:27,720
So it grabbed Nancy, new.
14820
11:36:27,720 --> 11:36:28,720
So we're finding some.
14821
11:36:28,720 --> 11:36:30,720
And this table's getting really big.
14822
11:36:30,720 --> 11:36:32,720
And so if we look at the people table,
14823
11:36:32,720 --> 11:36:35,720
we now have 455 people.
14824
11:36:35,720 --> 11:36:40,720
And we have 467 following records.
14825
11:36:40,720 --> 11:36:43,720
And so there we go.
14826
11:36:43,720 --> 11:36:44,720
Oops.
14827
11:36:44,720 --> 11:36:45,720
Hit enter.
14828
11:36:45,720 --> 11:36:47,720
It does another one.
14829
11:36:47,720 --> 11:36:48,720
And away we go.
14830
11:36:48,720 --> 11:36:50,720
So you get the idea.
14831
11:36:50,720 --> 11:36:54,720
I can type quit to finish.
14832
11:36:54,720 --> 11:37:00,720
And just to give you a little interesting
14833
11:37:00,720 --> 11:37:02,720
bit of code to show you how to do selects,
14834
11:37:02,720 --> 11:37:04,720
I'm going to do this TW join.
14835
11:37:04,720 --> 11:37:06,720
Now you'll notice that we're not talking.
14836
11:37:06,720 --> 11:37:08,720
Oh, let's show you one thing.
14837
11:37:08,720 --> 11:37:14,720
LSBonSL friends star SQL lite.
14838
11:37:14,720 --> 11:37:16,720
So this database has it.
14839
11:37:16,720 --> 11:37:20,720
So I can restart this process and run it again.
14840
11:37:20,720 --> 11:37:22,720
And the database is still there.
14841
11:37:22,720 --> 11:37:27,720
And so we just grab Swear Trek.
14842
11:37:27,720 --> 11:37:29,720
And so we can keep doing this.
14843
11:37:29,720 --> 11:37:32,720
And so this data, it keeps extending.
14844
11:37:32,720 --> 11:37:35,720
And so this is a restartable process.
14845
11:37:35,720 --> 11:37:36,720
I can run it.
14846
11:37:36,720 --> 11:37:39,720
And then tell it to grab the next unretrieved one.
14847
11:37:39,720 --> 11:37:42,720
And so away we go, right?
14848
11:37:42,720 --> 11:37:46,720
And so that's part of it.
14849
11:37:46,720 --> 11:37:51,720
So if I run out of my, I've got eight left.
14850
11:37:51,720 --> 11:37:53,720
Oh, how many do I have left, really?
14851
11:37:53,720 --> 11:37:58,720
Let's keep going.
14852
11:37:58,720 --> 11:37:59,720
How many do I got left?
14853
11:37:59,720 --> 11:38:01,720
I got five left.
14854
11:38:01,720 --> 11:38:02,720
Okay.
14855
11:38:02,720 --> 11:38:03,720
Wait.
14856
11:38:03,720 --> 11:38:04,720
Oh, I guess we'll just run it out.
14857
11:38:04,720 --> 11:38:06,720
So I got four left.
14858
11:38:06,720 --> 11:38:08,720
You know what I should do is I can't change the code.
14859
11:38:08,720 --> 11:38:10,720
At least I can't change the code.
14860
11:38:10,720 --> 11:38:13,720
I can stop the code and I can quit the code.
14861
11:38:13,720 --> 11:38:18,720
So what I'm going to do is I'm going to change this code a little bit really quick.
14862
11:38:18,720 --> 11:38:24,720
And I'm going to print the headers are rate limiting at the beginning
14863
11:38:24,720 --> 11:38:31,720
and at the end.
14864
11:38:31,720 --> 11:38:32,720
So now I can run it again.
14865
11:38:32,720 --> 11:38:33,720
I changed the code.
14866
11:38:33,720 --> 11:38:35,720
Hopefully I didn't make a Python error.
14867
11:38:35,720 --> 11:38:37,720
Tell it to go get another one and a Navarro.
14868
11:38:37,720 --> 11:38:41,720
And so I got three left.
14869
11:38:41,720 --> 11:38:42,720
Oops.
14870
11:38:42,720 --> 11:38:46,720
We'll see what happens when I run out of rate limit.
14871
11:38:46,720 --> 11:38:49,720
Run out of rate limit.
14872
11:38:49,720 --> 11:38:51,720
So we have one left.
14873
11:38:51,720 --> 11:38:53,720
Hit enter.
14874
11:38:53,720 --> 11:38:55,720
Hit control K.
14875
11:38:55,720 --> 11:38:57,720
Open source dot org.
14876
11:38:57,720 --> 11:38:58,720
So we have zero left.
14877
11:38:58,720 --> 11:38:59,720
That worked.
14878
11:38:59,720 --> 11:39:01,720
Now let's see what happens.
14879
11:39:01,720 --> 11:39:04,720
I don't know what happens next.
14880
11:39:04,720 --> 11:39:06,720
Oh, we blew up.
14881
11:39:06,720 --> 11:39:07,720
Too many requests.
14882
11:39:07,720 --> 11:39:10,720
Oh, we got an HTTP error 429.
14883
11:39:10,720 --> 11:39:17,720
So that means that, going for Mark Cuban, that was in line 48.
14884
11:39:17,720 --> 11:39:24,720
So the right thing to do would be in line 48.
14885
11:39:24,720 --> 11:39:27,720
We should really put this in a try accept block.
14886
11:39:27,720 --> 11:39:39,720
Try accept block because it gives us an error.
14887
11:39:39,720 --> 11:39:40,720
Print.
14888
11:39:40,720 --> 11:39:42,720
Oh, fiddlesticks.
14889
11:39:42,720 --> 11:39:44,720
How do I print the exception message?
14890
11:39:44,720 --> 11:39:53,720
I always am forgetting print failed to retrieve.
14891
11:39:53,720 --> 11:39:56,720
So we'll put that in.
14892
11:39:56,720 --> 11:40:04,720
Now if I run it.
14893
11:40:04,720 --> 11:40:13,720
And then I have to put a break here because that's not a good break.
14894
11:40:13,720 --> 11:40:14,720
Failed to retrieve.
14895
11:40:14,720 --> 11:40:15,720
Now I've got to figure out.
14896
11:40:15,720 --> 11:40:20,720
Oh, see, I never know how to print out the error message.
14897
11:40:20,720 --> 11:40:22,720
Yeah.
14898
11:40:22,720 --> 11:40:24,720
So I have to...
14899
11:40:24,720 --> 11:40:28,720
See, that's the weird thing about stuff is that I don't ever remember enough.
14900
11:40:28,720 --> 11:40:34,720
I don't remember the syntax, what I say here, to print the error message out.
14901
11:40:34,720 --> 11:40:37,720
So I'm going to go to Google.
14902
11:40:37,720 --> 11:40:48,720
And I'm going to say, print out the exception message in Python.
14903
11:40:48,720 --> 11:40:50,720
Print out the exception message in Python.
14904
11:40:50,720 --> 11:40:57,720
Oh, Python 3, hello.
14905
11:40:57,720 --> 11:41:03,720
Okay, so let's go find it here in the documentation.
14906
11:41:03,720 --> 11:41:09,720
Accept, accept.
14907
11:41:09,720 --> 11:41:11,720
Is this it?
14908
11:41:11,720 --> 11:41:28,720
Is this what I say?
14909
11:41:28,720 --> 11:41:34,720
I just want to print out the message.
14910
11:41:34,720 --> 11:41:36,720
Ah, that's it.
14911
11:41:36,720 --> 11:41:39,720
Accept.
14912
11:41:39,720 --> 11:41:55,720
Let's try this.
14913
11:41:55,720 --> 11:41:59,720
So this is part of Python programming, is like, for me at least.
14914
11:41:59,720 --> 11:42:06,720
Because I'm just not like a genius expert at this stuff.
14915
11:42:06,720 --> 11:42:09,720
This is one thing I like about Python, is you can guess stuff.
14916
11:42:09,720 --> 11:42:11,720
And sometimes you guess right.
14917
11:42:11,720 --> 11:42:12,720
So there we go.
14918
11:42:12,720 --> 11:42:13,720
We got the error.
14919
11:42:13,720 --> 11:42:14,720
We got the nice little error message.
14920
11:42:14,720 --> 11:42:16,720
And we see error 429, too many requests.
14921
11:42:16,720 --> 11:42:19,720
So that cleans that up nicely.
14922
11:42:19,720 --> 11:42:22,720
So we have run out of requests.
14923
11:42:22,720 --> 11:42:28,720
And on that, it is a good time to say thanks for listening.
14924
11:42:28,720 --> 11:42:35,720
And I hope that you found this valuable.
14925
11:42:35,720 --> 11:42:39,720
Hello, and welcome to our final chapter, retrieving and visualizing data.
14926
11:42:39,720 --> 11:42:44,720
In this chapter, we are going to basically bring this all together.
14927
11:42:44,720 --> 11:42:49,720
Databases, web services, code loops, logic.
14928
11:42:49,720 --> 11:42:55,720
And we're going to solve a problem that is a multi-step data analysis.
14929
11:42:55,720 --> 11:42:58,720
We're going to find some data on the internet.
14930
11:42:58,720 --> 11:43:01,720
Might be HTML, might be an API or whatever.
14931
11:43:01,720 --> 11:43:06,720
And we're going to write a relatively slow process that's going to pull data slowly.
14932
11:43:06,720 --> 11:43:08,720
Because these are all rate limited.
14933
11:43:08,720 --> 11:43:11,720
This is a slow and restartable process.
14934
11:43:11,720 --> 11:43:13,720
So you can start this.
14935
11:43:13,720 --> 11:43:17,720
And what we're going to do is we're going to have a database that's going to hold the data that we're pulling.
14936
11:43:17,720 --> 11:43:22,720
And so this might take several days, actually, if you really have to do it.
14937
11:43:22,720 --> 11:43:24,720
And then you'll build up your data in your database.
14938
11:43:24,720 --> 11:43:28,720
And then what you tend to do is you tend to produce two databases.
14939
11:43:28,720 --> 11:43:38,720
One is kind of a raw database that, you know, all of its data columns are aimed at helping you figure out what you've got to retrieve yet.
14940
11:43:38,720 --> 11:43:39,720
And what you haven't retrieved yet.
14941
11:43:39,720 --> 11:43:42,720
So that's kind of a crawling spidering process.
14942
11:43:42,720 --> 11:43:46,720
And then you find that the data is kind of nasty and ugly.
14943
11:43:46,720 --> 11:43:51,720
And you find that before you're going to do any analysis, you probably want to clean and process it.
14944
11:43:51,720 --> 11:43:55,720
So in a lot of these, you're going to go from a raw database to a clean one.
14945
11:43:55,720 --> 11:43:57,720
And this is going to be really large.
14946
11:43:57,720 --> 11:43:59,720
And this is going to be really small.
14947
11:43:59,720 --> 11:44:02,720
And you're going to do this sort of once, but slowly.
14948
11:44:02,720 --> 11:44:08,720
And you'll do this as many times as you need, changing this program, cleaning the data up over and over and over again.
14949
11:44:08,720 --> 11:44:10,720
And then you'll end up with really clean data.
14950
11:44:10,720 --> 11:44:11,720
And it's relatively small.
14951
11:44:11,720 --> 11:44:17,720
And you might run programs that will loop through this to do visualizations or analysis or some things or whatever.
14952
11:44:17,720 --> 11:44:23,720
And so you'll actually sort of use this database as a source of information.
14953
11:44:23,720 --> 11:44:24,720
OK.
14954
11:44:24,720 --> 11:44:28,720
So that's the basic pattern of what we're going to work with.
14955
11:44:28,720 --> 11:44:31,720
Now, this is what I call personal data mining.
14956
11:44:31,720 --> 11:44:36,720
And if you're going to do this seriously, Python is used in lots of data mining activities.
14957
11:44:36,720 --> 11:44:39,720
But if you're going to do data mining seriously with really, really large data sets,
14958
11:44:39,720 --> 11:44:46,720
we're doing small to medium-sized data sets as you might do sort of for individual personal research
14959
11:44:46,720 --> 11:44:51,720
versus like an organization research where you're processing the logs of a web server or something like that.
14960
11:44:51,720 --> 11:44:54,720
And there's lots and lots of wonderful technology.
14961
11:44:54,720 --> 11:44:58,720
And what's really cool is this technology just keeps getting better and better
14962
11:44:58,720 --> 11:45:04,720
because the whole data and mining data analysis natural language processing field is just so hot right now.
14963
11:45:04,720 --> 11:45:05,720
It's so awesome.
14964
11:45:05,720 --> 11:45:09,720
We're going to keep it simple and do stuff for ourselves for now.
14965
11:45:09,720 --> 11:45:15,720
And I gave you a bunch of sample code that's going to make it so that you can adapt this sample code
14966
11:45:15,720 --> 11:45:18,720
to solve the problems that you need to solve.
14967
11:45:18,720 --> 11:45:21,720
So like I said, this is more of a programming exercise.
14968
11:45:21,720 --> 11:45:23,720
Data mining might be a lot more complex.
14969
11:45:23,720 --> 11:45:27,720
If you're doing simple research, this might actually model what you do pretty well.
14970
11:45:27,720 --> 11:45:34,720
So the first thing that we're going to do is what's called use the Google's JSON API for geocoding.
14971
11:45:34,720 --> 11:45:36,720
And there are two versions of this.
14972
11:45:36,720 --> 11:45:41,720
One version requires a key and one version doesn't require a key.
14973
11:45:41,720 --> 11:45:45,720
Google used to make all this data available for free but with just a rate limit
14974
11:45:45,720 --> 11:45:48,720
but now they're making increasingly requiring a key.
14975
11:45:48,720 --> 11:45:52,720
So I give you code in this zip file that kind of does both.
14976
11:45:52,720 --> 11:45:57,720
If you really wanted to do something in production of taking user entered places and names
14977
11:45:57,720 --> 11:46:03,720
and getting precise latitude longitude coordinates so you can produce a nice little Google map like this.
14978
11:46:03,720 --> 11:46:08,720
But since Google has made a rate limited API,
14979
11:46:08,720 --> 11:46:13,720
I've actually pre-spided a copy of a Google data and I have my own sort of fake Google API
14980
11:46:13,720 --> 11:46:19,720
and so you can do your assignments and test all your code using my fake API
14981
11:46:19,720 --> 11:46:22,720
which has no rate limits and has no problems.
14982
11:46:22,720 --> 11:46:25,720
But it's only a limited set of the data.
14983
11:46:25,720 --> 11:46:32,720
And so this is the basic process and it's one of those things that it follows that basic personal data modeling.
14984
11:46:32,720 --> 11:46:34,720
Personal data mining pattern.
14985
11:46:34,720 --> 11:46:37,720
And so here's this API which is either Google or me.
14986
11:46:37,720 --> 11:46:41,720
I've got my own Dr. Chuck version of this, Dr. Chuck.Net version of this.
14987
11:46:41,720 --> 11:46:45,720
And there is an input queue of the location.
14988
11:46:45,720 --> 11:46:49,720
So this is the user data where they just put in the name of where they think they live.
14989
11:46:49,720 --> 11:46:52,720
University of Toobigan or something.
14990
11:46:52,720 --> 11:46:56,720
And so this is the queue of the things that are to be retrieved.
14991
11:46:56,720 --> 11:47:01,720
And in my case when I built this map for the first time, there was like 15,000.
14992
11:47:01,720 --> 11:47:04,720
And it took me days to get this.
14993
11:47:04,720 --> 11:47:05,720
And so it would stop.
14994
11:47:05,720 --> 11:47:11,720
And so what I would do is I would read the first one into this geoload.py,
14995
11:47:11,720 --> 11:47:13,720
check to see if I already had it in my database.
14996
11:47:13,720 --> 11:47:18,720
If I didn't already have the database, I would go into the API, pull the data down and I would put it in the database.
14997
11:47:18,720 --> 11:47:19,720
And then I would go to the next one.
14998
11:47:19,720 --> 11:47:20,720
The next one, the next one.
14999
11:47:20,720 --> 11:47:26,720
And so I might get a thousand in my database and then it blows up or I'm told I can't go any further.
15000
11:47:26,720 --> 11:47:27,720
So I wait 24 hours.
15001
11:47:27,720 --> 11:47:31,720
I start it up and it reads the first thousand and says, oh, they're all in the database already.
15002
11:47:31,720 --> 11:47:34,720
And then it starts at one thousand and one.
15003
11:47:34,720 --> 11:47:36,720
And then it adds that and adds that.
15004
11:47:36,720 --> 11:47:37,720
And then until it stops.
15005
11:47:37,720 --> 11:47:41,720
And so it took me several days of processing to get this data right.
15006
11:47:41,720 --> 11:47:45,720
Now, I didn't have a separate cleaning process because this data is pretty simple.
15007
11:47:45,720 --> 11:47:50,720
I was pulling out the JSON and latitude and longitude, etc.
15008
11:47:50,720 --> 11:47:54,720
And so I didn't have to do two separate processes to clean this data up.
15009
11:47:54,720 --> 11:47:56,720
It was clean enough right as I pulled it.
15010
11:47:56,720 --> 11:47:58,720
Because I was talking to an API.
15011
11:47:58,720 --> 11:48:02,720
If you're talking to the HTML, sometimes it gets nasty and ugly.
15012
11:48:02,720 --> 11:48:05,720
And so then I wrote this program that just reads through it.
15013
11:48:05,720 --> 11:48:12,720
It just does a select and, you know, reads through the stuff and it prints out some summary information and tells you what to do.
15014
11:48:12,720 --> 11:48:19,720
It also prints out and you'll see this pattern because, you know, I'm visualizing using browsers, HTML,
15015
11:48:19,720 --> 11:48:25,720
and this happens to be used in the Google Maps API and putting all the data in a little JavaScript file.
15016
11:48:25,720 --> 11:48:28,720
So these end up being assignment statements in JavaScript.
15017
11:48:28,720 --> 11:48:34,720
You can take a look at that file and all the data shows up as assignment statements in the JavaScript.
15018
11:48:34,720 --> 11:48:44,720
And then when this HTML loads, it reads this file and puts up all those pins as long as you have access to the in browser JavaScript API.
15019
11:48:44,720 --> 11:48:49,720
So the next thing we're going to talk about is page rank, which is spidering now HTML.
15020
11:48:49,720 --> 11:48:52,720
We talked a lot about this spider HTML, get some links.
15021
11:48:52,720 --> 11:48:59,720
And so up next, we're going to actually build a real database full featured search engine using page rank.
15022
11:49:04,720 --> 11:49:06,720
This is another worked code example.
15023
11:49:06,720 --> 11:49:11,720
You can download the sample code zip file if you want to follow along.
15024
11:49:11,720 --> 11:49:16,720
And the code that we're working on today is what I call the geodata code.
15025
11:49:16,720 --> 11:49:25,720
And that is code that is going to pull some locations from this file.
15026
11:49:25,720 --> 11:49:33,720
We're simulating or using the Google Places API to look places up and so we can visualize them on a map.
15027
11:49:33,720 --> 11:49:35,720
And so this is the basic picture.
15028
11:49:35,720 --> 11:49:41,720
If we take a look at this weir.data file, it's just a flat file that has a list of organizations.
15029
11:49:41,720 --> 11:49:46,720
And this actually was pulled from one of my MOOC surveys.
15030
11:49:46,720 --> 11:49:52,720
We just let people type in where they went to school and this is just a sample of them.
15031
11:49:52,720 --> 11:49:56,720
So this data is read in by this program geoload.py.
15032
11:49:56,720 --> 11:50:00,720
And if you recall, this Google geodata has rate limits.
15033
11:50:00,720 --> 11:50:03,720
It also has API keys, which we'll talk about in a bit too.
15034
11:50:03,720 --> 11:50:08,720
And so the idea is this is a restartable spider-like process.
15035
11:50:08,720 --> 11:50:14,720
And so we want to be able to run this and have it blow up and run it and start it and not lose what we've got.
15036
11:50:14,720 --> 11:50:19,720
So we're now using a database as well as an API.
15037
11:50:19,720 --> 11:50:25,720
But in order to work around the rate limits of this API, we're going to use the database with a restartable process.
15038
11:50:25,720 --> 11:50:29,720
And then we'll make some sense of this and then we'll visualize this.
15039
11:50:29,720 --> 11:50:34,720
But in the short term, let's start with geoload.py code.
15040
11:50:34,720 --> 11:50:37,720
Geoload.py, take a look here.
15041
11:50:37,720 --> 11:50:42,720
So a lot of this hopefully by now is somewhat familiar to you.
15042
11:50:42,720 --> 11:50:47,720
URL lib, JSON, SQLite.
15043
11:50:47,720 --> 11:50:53,720
And so I mentioned that the Google APIs, these used to be free and did not require an API key,
15044
11:50:53,720 --> 11:50:58,720
but increasingly they're making you do API keys for especially new ones.
15045
11:50:58,720 --> 11:51:05,720
And so what happens, you can go to your Google Places, go to Google APIs and get an API key.
15046
11:51:05,720 --> 11:51:09,720
And you can put it in here, it'll be this long, big long thing that looks like that.
15047
11:51:09,720 --> 11:51:13,720
And then if you have an API key, you can use the Places API.
15048
11:51:13,720 --> 11:51:19,720
And I've got a copy of a subset, not all of it, a subset of it here at this URL.
15049
11:51:19,720 --> 11:51:26,720
As a matter of fact, you can just go to this URL in a browser.
15050
11:51:26,720 --> 11:51:30,720
And it will tell you a list of the data that it knows about.
15051
11:51:30,720 --> 11:51:41,720
And I made it so that that does the same basic protocol with the address equals as the Google Places API.
15052
11:51:41,720 --> 11:51:46,720
So this will just change how we retrieve the data, either retrieve it from my server.
15053
11:51:46,720 --> 11:51:49,720
Nice thing about my server, it's got no rate limit.
15054
11:51:49,720 --> 11:51:53,720
It's really fast and you're not fighting with Google all the time.
15055
11:51:53,720 --> 11:51:59,720
And it means that perhaps if you're in a country that Google is not well supported, you can use my API.
15056
11:51:59,720 --> 11:52:04,720
And that's really strange that somehow my API is more reliable and available than the Google one.
15057
11:52:04,720 --> 11:52:06,720
But it's true.
15058
11:52:06,720 --> 11:52:08,720
So we're going to make a database.
15059
11:52:08,720 --> 11:52:12,720
We're going to do a create table if not exists, and we'll have some address.
15060
11:52:12,720 --> 11:52:15,720
And we're really just caching the geographical data.
15061
11:52:15,720 --> 11:52:17,720
We're going to cache the JSON.
15062
11:52:17,720 --> 11:52:21,720
One of the things we do when we build these processes is we tend to simplify these things
15063
11:52:21,720 --> 11:52:25,720
and not do all the calculation and parsing the JSON.
15064
11:52:25,720 --> 11:52:30,720
Just load it and get it in and load it and get it in and fill the data up in this database.
15065
11:52:30,720 --> 11:52:33,720
And so that's what we're going to do.
15066
11:52:33,720 --> 11:52:39,720
Because Python doesn't ship with any legitimate certificates, we have to sort of ignore certificate errors.
15067
11:52:39,720 --> 11:52:42,720
We're going to open the file.
15068
11:52:42,720 --> 11:52:49,720
And we're going to loop through it and pull out the address from the file.
15069
11:52:49,720 --> 11:52:56,720
And we're going to select from the geodata where that address is the address.
15070
11:52:56,720 --> 11:52:58,720
Let's move this in a bit.
15071
11:52:58,720 --> 11:53:03,720
And so we're going to do a select and pull out that address.
15072
11:53:03,720 --> 11:53:07,720
And the idea is if it's already in the database, we don't want to do it.
15073
11:53:07,720 --> 11:53:13,720
So we do a fetch one and pull out that first thing, which is the, that will be the JSON right there.
15074
11:53:13,720 --> 11:53:15,720
If we get that, we'll continue up.
15075
11:53:15,720 --> 11:53:18,720
Otherwise, we'll keep going.
15076
11:53:18,720 --> 11:53:20,720
Pass just means don't blow up.
15077
11:53:20,720 --> 11:53:22,720
So we accept and we just do a pass.
15078
11:53:22,720 --> 11:53:24,720
That's like a no op.
15079
11:53:24,720 --> 11:53:31,720
And we're going to make a dictionary because that's what we do for the key value pairs.
15080
11:53:31,720 --> 11:53:34,720
Everything you've seen so far, I've used constants here.
15081
11:53:34,720 --> 11:53:39,720
But because we may or may not have an API key, query equals and then that's the address.
15082
11:53:39,720 --> 11:53:42,720
And then the key equals and then the API key.
15083
11:53:42,720 --> 11:53:49,720
If you recall, URL encode adds the pluses and question marks and all that nice stuff.
15084
11:53:49,720 --> 11:53:50,720
We're going to retrieve it.
15085
11:53:50,720 --> 11:53:52,720
We're going to read it and decode it.
15086
11:53:52,720 --> 11:53:55,720
Print out how much data we've got.
15087
11:53:55,720 --> 11:53:57,720
And add account.
15088
11:53:57,720 --> 11:54:02,720
And then we're going to try to parse that JSON data and print it if something goes wrong.
15089
11:54:02,720 --> 11:54:09,720
And as we've seen, at this top level of this JSON data from this geocoding API is an object,
15090
11:54:09,720 --> 11:54:13,720
which we'll see a little bit of in a bit.
15091
11:54:13,720 --> 11:54:15,720
And it has a status field in it.
15092
11:54:15,720 --> 11:54:19,720
And the status is okay if things went well.
15093
11:54:19,720 --> 11:54:25,720
So if the status is not there, that means our JavaScript is not well formed or not how we expect it.
15094
11:54:25,720 --> 11:54:32,720
If the status is not okay or not equal to zero results, then print out failure to retrieve and then quit.
15095
11:54:32,720 --> 11:54:36,720
And then we're simply going to insert this new data that we just put in.
15096
11:54:36,720 --> 11:54:38,720
And then we're going to commit it.
15097
11:54:38,720 --> 11:54:41,720
And every tenth one, this is count mod 10.
15098
11:54:41,720 --> 11:54:43,720
We're going to pause for five seconds.
15099
11:54:43,720 --> 11:54:45,720
And we can hit control C here.
15100
11:54:45,720 --> 11:54:48,720
And then we're going to play the, do the geodump.
15101
11:54:48,720 --> 11:54:49,720
Okay.
15102
11:54:49,720 --> 11:54:51,720
So let's just run this.
15103
11:54:51,720 --> 11:54:54,720
Geodata.
15104
11:54:54,720 --> 11:54:56,720
Python.
15105
11:54:56,720 --> 11:54:58,720
So let's do an LS.
15106
11:54:58,720 --> 11:55:04,720
So we don't have, we do have, let's get rid of from a previous test, geodata.sqlite.
15107
11:55:04,720 --> 11:55:13,720
So we'll start with a fresh, fresh set of data and run python geoload.py.
15108
11:55:13,720 --> 11:55:17,720
Of course, I'm always forever making the mistake of forgetting Python 3.
15109
11:55:17,720 --> 11:55:19,720
So you can see that it's running.
15110
11:55:19,720 --> 11:55:21,720
And it's adding the query.
15111
11:55:21,720 --> 11:55:24,720
And in this case, I don't have the API key.
15112
11:55:24,720 --> 11:55:25,720
And it's putting the pluses in.
15113
11:55:25,720 --> 11:55:28,720
And that's this part here with all the pluses.
15114
11:55:28,720 --> 11:55:30,720
That's the URL and code.
15115
11:55:30,720 --> 11:55:31,720
And you notice it's pausing a bit.
15116
11:55:31,720 --> 11:55:35,720
Now it depends on how fast your net connection, this may or may not go so fast.
15117
11:55:35,720 --> 11:55:36,720
But this is not that much data.
15118
11:55:36,720 --> 11:55:40,720
So it should, it's like only 2,000, 3,000 characters.
15119
11:55:40,720 --> 11:55:45,720
And so it's working and talking to my server.
15120
11:55:45,720 --> 11:55:47,720
And the interesting thing here is I can blow this up.
15121
11:55:47,720 --> 11:55:49,720
I'm going to hit control C.
15122
11:55:49,720 --> 11:55:51,720
In Windows you'd hit control.
15123
11:55:51,720 --> 11:55:53,720
In Linux you'd hit control C.
15124
11:55:53,720 --> 11:55:55,720
And in Windows I think you'd hit control Z.
15125
11:55:55,720 --> 11:55:57,720
Depending on what shell you're working in.
15126
11:55:57,720 --> 11:55:58,720
But I'm going to hit control C.
15127
11:55:58,720 --> 11:56:00,720
And you see I sort of blew it up, right?
15128
11:56:00,720 --> 11:56:04,720
And that causes a traceback, a keyboard interrupt traceback.
15129
11:56:04,720 --> 11:56:09,720
If I do an LS minus L, you can see that now this geodata is there.
15130
11:56:09,720 --> 11:56:13,720
Now in the name of restarting, I will restart this.
15131
11:56:13,720 --> 11:56:16,720
And you will see that it checks and skips.
15132
11:56:16,720 --> 11:56:23,720
And so it runs this code here where it's right here.
15133
11:56:23,720 --> 11:56:24,720
It grabs it and finds it in the database.
15134
11:56:24,720 --> 11:56:26,720
So you'll see it say found in the database really quick.
15135
11:56:26,720 --> 11:56:27,720
Chop, chop, chop.
15136
11:56:27,720 --> 11:56:28,720
And go really fast.
15137
11:56:28,720 --> 11:56:32,720
And then it'll go back to catching up where it left off.
15138
11:56:32,720 --> 11:56:36,720
And so all those up there, they did not actually re-retrieve it,
15139
11:56:36,720 --> 11:56:38,720
because it knew about those things.
15140
11:56:38,720 --> 11:56:40,720
And so now it's catching up and doing some more,
15141
11:56:40,720 --> 11:56:43,720
and doing some more, and doing some more.
15142
11:56:43,720 --> 11:56:45,720
And then I'll hit control C.
15143
11:56:45,720 --> 11:56:47,720
It has a little counter in here that basically,
15144
11:56:47,720 --> 11:56:50,720
if it hits 200 it stops and you have to restart it.
15145
11:56:50,720 --> 11:56:52,720
You could obviously change this code.
15146
11:56:52,720 --> 11:56:54,720
You could make it so it didn't sleep.
15147
11:56:54,720 --> 11:56:57,720
It doesn't hurt to sleep for like a second after every 100 or so
15148
11:56:57,720 --> 11:56:58,720
if you want.
15149
11:56:58,720 --> 11:57:00,720
You could change that code.
15150
11:57:00,720 --> 11:57:04,720
And now let's just hit control C.
15151
11:57:04,720 --> 11:57:05,720
And blow it up.
15152
11:57:05,720 --> 11:57:07,720
LS minus L.
15153
11:57:07,720 --> 11:57:08,720
And there is another bit of code.
15154
11:57:08,720 --> 11:57:12,720
And this code, it's always good to write these really simple things.
15155
11:57:12,720 --> 11:57:16,720
And so now we're going to import SQLite and JSON.
15156
11:57:16,720 --> 11:57:18,720
We're going to connect ourselves up.
15157
11:57:18,720 --> 11:57:22,720
We're going to open, except this is a UTF-8,
15158
11:57:22,720 --> 11:57:26,720
because we're going to open this with UTF-8.
15159
11:57:26,720 --> 11:57:29,720
And we're going to read through.
15160
11:57:29,720 --> 11:57:35,720
And in this case, we are going to decode.
15161
11:57:35,720 --> 11:57:37,720
We did select star from locations.
15162
11:57:37,720 --> 11:57:43,720
And if you recall, locations has a location and a geodata.
15163
11:57:43,720 --> 11:57:45,720
And so the sub-zero will be the location,
15164
11:57:45,720 --> 11:57:49,720
and the sub-one will be the geodata.
15165
11:57:49,720 --> 11:57:54,720
And we're going to parse it, convert it to a string, and then parse it.
15166
11:57:54,720 --> 11:57:57,720
If something goes wrong with the JSON, we'll just keep skipping it.
15167
11:57:57,720 --> 11:58:03,720
We'll check to see if we have the status in our JSON.
15168
11:58:03,720 --> 11:58:10,720
Let me run the SQLite browser here.
15169
11:58:10,720 --> 11:58:12,720
File, open database.
15170
11:58:12,720 --> 11:58:16,720
Let's take a look at what's in this database.
15171
11:58:16,720 --> 11:58:17,720
Oh, where are we?
15172
11:58:17,720 --> 11:58:19,720
Code three.
15173
11:58:19,720 --> 11:58:20,720
Geodata.
15174
11:58:20,720 --> 11:58:22,720
Geodata SQLite.
15175
11:58:22,720 --> 11:58:25,720
So this is the data we've got.
15176
11:58:25,720 --> 11:58:28,720
So if you make this a little bigger, if I can, can I make that bigger?
15177
11:58:28,720 --> 11:58:30,720
Yeah, it's not going to show us much.
15178
11:58:30,720 --> 11:58:33,720
So you can see that these are the addresses in the geodata.
15179
11:58:33,720 --> 11:58:35,720
That's just the JSON.
15180
11:58:35,720 --> 11:58:38,720
So that's the JSON that we've got, and it retrieves it.
15181
11:58:38,720 --> 11:58:40,720
And so this is a really simple database.
15182
11:58:40,720 --> 11:58:42,720
It's just a sort of spidering process.
15183
11:58:42,720 --> 11:58:44,720
Run, run, run.
15184
11:58:44,720 --> 11:58:46,720
But now we're going to run the geodump code,
15185
11:58:46,720 --> 11:58:50,720
which is going to read this and dump this stuff out and print where.js,
15186
11:58:50,720 --> 11:58:52,720
so it's going to actually parse this stuff.
15187
11:58:52,720 --> 11:58:56,720
And that's code we've seen before.
15188
11:58:56,720 --> 11:58:57,720
So we're actually reading it.
15189
11:58:57,720 --> 11:58:59,720
And this line goes into the results.
15190
11:58:59,720 --> 11:59:01,720
The results is an array.
15191
11:59:01,720 --> 11:59:04,720
So if we go into results, results is an array.
15192
11:59:04,720 --> 11:59:07,720
We're going to go grab the zeroth item in that array.
15193
11:59:07,720 --> 11:59:11,720
And then we're going to go find geometry.
15194
11:59:11,720 --> 11:59:13,720
And then location.
15195
11:59:13,720 --> 11:59:16,720
And then lat and long for the latitude and longitude.
15196
11:59:16,720 --> 11:59:22,720
And then we're also going to take the actual address out of the formatted address right here.
15197
11:59:22,720 --> 11:59:28,720
So in this bit of code, we're actually parsing the JSON.
15198
11:59:28,720 --> 11:59:33,720
And we're going to clean things up, get rid of some single quotes.
15199
11:59:33,720 --> 11:59:36,720
This kind of data cleaning is just stuff after you play with it for a while.
15200
11:59:36,720 --> 11:59:39,720
You realize, oh, my data is ugly or does this.
15201
11:59:39,720 --> 11:59:40,720
And I print it out.
15202
11:59:40,720 --> 11:59:42,720
And then I'm going to write this out.
15203
11:59:42,720 --> 11:59:45,720
And I'm going to write it into a JavaScript file.
15204
11:59:45,720 --> 11:59:50,720
And so the JavaScript file is this where.js.
15205
11:59:50,720 --> 11:59:53,720
And I'll show you what it looks like.
15206
11:59:53,720 --> 11:59:55,720
It's going to be overwritten.
15207
11:59:55,720 --> 11:59:57,720
This is the one that came out of the zip file.
15208
11:59:57,720 --> 11:59:59,720
It'll have the latitude, the longitude.
15209
11:59:59,720 --> 12:00:04,720
And we're going to use JavaScript to read this in this where.html file.
15210
12:00:04,720 --> 12:00:08,720
It's going to actually read this right there and pull that data in.
15211
12:00:08,720 --> 12:00:10,720
And that's how we're going to visualize.
15212
12:00:10,720 --> 12:00:16,720
I'm not going to go into great detail on how the visualization happens.
15213
12:00:16,720 --> 12:00:17,720
But that's what's happening.
15214
12:00:17,720 --> 12:00:18,720
And so we're going to write that.
15215
12:00:18,720 --> 12:00:20,720
So we're going to actually write this to a file.
15216
12:00:20,720 --> 12:00:30,720
So let's go ahead and run this code and say python3 geodump.
15217
12:00:30,720 --> 12:00:33,720
OK, so it wrote 120 records to where.js.
15218
12:00:33,720 --> 12:00:36,720
So if we look at where.js, this is now the new data
15219
12:00:36,720 --> 12:00:39,720
that I just downloaded moments ago.
15220
12:00:39,720 --> 12:00:51,720
And it says open where.html in a browser.
15221
12:00:51,720 --> 12:00:54,720
Now, this way you'll need the Google Maps API.
15222
12:00:54,720 --> 12:00:57,720
And you might not be able to see this depending on where you're at.
15223
12:00:57,720 --> 12:01:00,720
But here you go with Google Maps locations.
15224
12:01:00,720 --> 12:01:03,720
And I think if you hover over this, you can see.
15225
12:01:03,720 --> 12:01:06,720
And you see the UTF, why we there in that particular thing,
15226
12:01:06,720 --> 12:01:11,720
why we had to use the UTF-8 when we wrote the file
15227
12:01:11,720 --> 12:01:13,720
so that we didn't end up with trouble writing the file out.
15228
12:01:13,720 --> 12:01:14,720
And so there you go.
15229
12:01:14,720 --> 12:01:19,720
And so that is a simple visualization.
15230
12:01:19,720 --> 12:01:22,720
And just a simple visualization.
15231
12:01:22,720 --> 12:01:24,720
It wrote this where.js.
15232
12:01:24,720 --> 12:01:27,720
If you are smart with HTML and JavaScript,
15233
12:01:27,720 --> 12:01:31,720
you can look at this where.html file.
15234
12:01:31,720 --> 12:01:35,720
It's really just reading through a bunch of data and putting the points.
15235
12:01:35,720 --> 12:01:36,720
That's all there is.
15236
12:01:36,720 --> 12:01:39,720
But I'm not going to go through that.
15237
12:01:39,720 --> 12:01:42,720
So at least not in this.
15238
12:01:42,720 --> 12:01:45,720
And so I hope that this was useful to you.
15239
12:01:45,720 --> 12:01:51,720
And thanks for watching.
15240
12:01:51,720 --> 12:01:53,720
So now we're going to write a search engine.
15241
12:01:53,720 --> 12:01:54,720
Do some of the things.
15242
12:01:54,720 --> 12:01:55,720
We're going to do page rank.
15243
12:01:55,720 --> 12:01:59,720
And we're going to visualize it in a web browser and show the weights.
15244
12:01:59,720 --> 12:02:02,720
We're really only going to do page rank on one page
15245
12:02:02,720 --> 12:02:05,720
because you want to have links that more than one page
15246
12:02:05,720 --> 12:02:07,720
that points to a page so that you can figure out
15247
12:02:07,720 --> 12:02:09,720
which pages are more or less important.
15248
12:02:09,720 --> 12:02:10,720
And then visualize it.
15249
12:02:10,720 --> 12:02:13,720
We'll run the page rank algorithm and we'll separately do all this.
15250
12:02:13,720 --> 12:02:16,720
So at this point we're going to do pretty much the web crawling,
15251
12:02:16,720 --> 12:02:18,720
the index building, and the searching.
15252
12:02:18,720 --> 12:02:19,720
We're not going to really search it.
15253
12:02:19,720 --> 12:02:21,720
We're going to visualize the index.
15254
12:02:21,720 --> 12:02:24,720
But you could write a simple program to do searches for keywords
15255
12:02:24,720 --> 12:02:27,720
and figure out which page was the most likely page for a keyword.
15256
12:02:27,720 --> 12:02:30,720
And that would be a fun additional thing to do.
15257
12:02:30,720 --> 12:02:34,720
So the web crawler is this program that hits a page,
15258
12:02:34,720 --> 12:02:37,720
pulls down the HTML, parses the page, looks for links,
15259
12:02:37,720 --> 12:02:41,720
makes a queue of incoming links that are as yet unretrieved.
15260
12:02:41,720 --> 12:02:44,720
And I'm going to do this in a simple SQLite database.
15261
12:02:44,720 --> 12:02:48,720
It starts out with the database basically starts with one link as the starting point
15262
12:02:48,720 --> 12:02:50,720
and then it retrieves that page.
15263
12:02:50,720 --> 12:02:53,720
And then you see the database end up with lots of unretrieved pages.
15264
12:02:53,720 --> 12:02:56,720
And then it goes back in and picks a random page and retrieves that one.
15265
12:02:56,720 --> 12:02:58,720
And then it just expands and expands.
15266
12:02:58,720 --> 12:03:01,720
This code that I've built that you're going to play with
15267
12:03:01,720 --> 12:03:04,720
only stays on one website, otherwise it would go crazy.
15268
12:03:04,720 --> 12:03:08,720
And of course, Google doesn't use an SQLite database running on your hard drive.
15269
12:03:08,720 --> 12:03:10,720
But you'll get the idea.
15270
12:03:10,720 --> 12:03:13,720
You'll see this thing exponentially gain links.
15271
12:03:13,720 --> 12:03:17,720
And you'll run it for a while, pull down 1,000 web pages or whatever.
15272
12:03:17,720 --> 12:03:23,720
But of course, make sure that you don't violate any terms conditions.
15273
12:03:23,720 --> 12:03:27,720
And again, I've got some data sources that you can use.
15274
12:03:27,720 --> 12:03:29,720
And they're not rate limited.
15275
12:03:29,720 --> 12:03:31,720
But you can also use things like Wikipedia,
15276
12:03:31,720 --> 12:03:33,720
which I think they sort of discourage you.
15277
12:03:33,720 --> 12:03:36,720
Or DrChuck.com, which has no rate limit.
15278
12:03:36,720 --> 12:03:38,720
Or who knows what, right?
15279
12:03:38,720 --> 12:03:39,720
So just be careful.
15280
12:03:39,720 --> 12:03:40,720
Don't do this on Facebook.
15281
12:03:40,720 --> 12:03:42,720
And don't do it on Google.
15282
12:03:42,720 --> 12:03:43,720
Don't get yourself in trouble.
15283
12:03:43,720 --> 12:03:49,720
And if you're using, you know, a internet connection
15284
12:03:49,720 --> 12:03:51,720
where you're paying for bandwidth, be careful.
15285
12:03:51,720 --> 12:03:53,720
So this is the idea of the web crawler.
15286
12:03:53,720 --> 12:03:54,720
And this isn't my picture.
15287
12:03:54,720 --> 12:03:56,720
This is the classic picture of a web crawler.
15288
12:03:56,720 --> 12:04:02,720
Read a page, parse it, take all the URLs and stick them in a queue,
15289
12:04:02,720 --> 12:04:04,720
grab again and again.
15290
12:04:04,720 --> 12:04:07,720
So for us, the scheduler is going to do it as long as you'd say,
15291
12:04:07,720 --> 12:04:10,720
oh, do 100 pages or it runs until it blows up.
15292
12:04:10,720 --> 12:04:14,720
And again, these processes that have the network in the loop,
15293
12:04:14,720 --> 12:04:17,720
it's really important that they behave well when they blow up.
15294
12:04:17,720 --> 12:04:19,720
And that's why databases are so useful.
15295
12:04:19,720 --> 12:04:21,720
Because you can be writing along to the database.
15296
12:04:21,720 --> 12:04:25,720
And some random thing happens and blows your data up and you start over.
15297
12:04:25,720 --> 12:04:28,720
So you're reading these things, you're storing each page,
15298
12:04:28,720 --> 12:04:30,720
building up your storage, et cetera, et cetera.
15299
12:04:30,720 --> 12:04:32,720
So you just keep on doing that.
15300
12:04:32,720 --> 12:04:35,720
And with this program, you'll be able to retrieve some stuff,
15301
12:04:35,720 --> 12:04:37,720
then run the page rank, then you can retrieve them more,
15302
12:04:37,720 --> 12:04:39,720
and then you can run some more page rank.
15303
12:04:39,720 --> 12:04:44,720
And you can kind of see how Google sort of evolves its index over time.
15304
12:04:44,720 --> 12:04:46,720
Of course, we're so much simpler.
15305
12:04:46,720 --> 12:04:49,720
And like I said, be careful when you crawl.
15306
12:04:49,720 --> 12:04:52,720
You're going to run a crawler that just goes as fast as it can.
15307
12:04:52,720 --> 12:04:54,720
But Google doesn't do that.
15308
12:04:54,720 --> 12:04:57,720
It's careful not to overwhelm any websites.
15309
12:04:57,720 --> 12:05:02,720
It's trying to be smart about the use of your bandwidth on your website.
15310
12:05:02,720 --> 12:05:04,720
There is a file.
15311
12:05:04,720 --> 12:05:09,720
Our code won't bother looking at this.
15312
12:05:09,720 --> 12:05:12,720
But there's a file called robots.txt that real web crawlers look at,
15313
12:05:12,720 --> 12:05:15,720
and it gives a list of the things you are allowed to look at
15314
12:05:15,720 --> 12:05:17,720
and not allowed to look at.
15315
12:05:17,720 --> 12:05:19,720
And so if you go to Google and you see a search that says,
15316
12:05:19,720 --> 12:05:23,720
we are not allowed to show you the summary text of this page
15317
12:05:23,720 --> 12:05:26,720
because of the robots.txt, it's there.
15318
12:05:26,720 --> 12:05:31,720
And you can go and you can actually see a robots.txt.
15319
12:05:31,720 --> 12:05:33,720
Just go to any website.
15320
12:05:33,720 --> 12:05:37,720
It's at the top root, blah, blah, blah, blah, blah, slash robots.txt.
15321
12:05:37,720 --> 12:05:38,720
It's not a path.
15322
12:05:38,720 --> 12:05:41,720
It's not slash this, slash that, slash something else, robots.
15323
12:05:41,720 --> 12:05:45,720
It's at the very, very top of a website.
15324
12:05:45,720 --> 12:05:47,720
The index building uses the page rank algorithm.
15325
12:05:47,720 --> 12:05:49,720
And the whole goal of the page rank algorithm
15326
12:05:49,720 --> 12:05:57,720
is to figure out which pages have the most best links.
15327
12:05:57,720 --> 12:05:59,720
So having the most links is really easy.
15328
12:05:59,720 --> 12:06:01,720
You can just say, how many links go to this?
15329
12:06:01,720 --> 12:06:04,720
But the problem is you've got to figure out the value of those links.
15330
12:06:04,720 --> 12:06:07,720
And then you have to, how do you figure the value of those links?
15331
12:06:07,720 --> 12:06:11,720
By looking at how many good links come to it.
15332
12:06:11,720 --> 12:06:14,720
So it turns out that it's an infinite problem.
15333
12:06:14,720 --> 12:06:18,720
It's an infinitely difficult problem to use page rank.
15334
12:06:18,720 --> 12:06:20,720
But you can approximate it.
15335
12:06:20,720 --> 12:06:25,720
And what happens is, after a while, it converges to a reasonable value.
15336
12:06:25,720 --> 12:06:27,720
And so we're going to run the search index.
15337
12:06:27,720 --> 12:06:30,720
And each time it runs, you're going to see that it says,
15338
12:06:30,720 --> 12:06:32,720
how much did these numbers change?
15339
12:06:32,720 --> 12:06:35,720
And what happens is, in the beginning, they change very wildly.
15340
12:06:35,720 --> 12:06:37,720
But quickly, they flatten out.
15341
12:06:37,720 --> 12:06:43,720
And the best way to think about the page rank
15342
12:06:43,720 --> 12:06:52,720
is think about how water runs, where you have a small little stream going by a house.
15343
12:06:52,720 --> 12:06:56,720
And sometimes it rains. Sometimes it's dry.
15344
12:06:56,720 --> 12:07:02,720
And sometimes there's like a little lake.
15345
12:07:02,720 --> 12:07:04,720
And the stream is always running.
15346
12:07:04,720 --> 12:07:06,720
And it doesn't go up and it doesn't go down.
15347
12:07:06,720 --> 12:07:08,720
It might go up a little bit if it rains a lot.
15348
12:07:08,720 --> 12:07:10,720
But in general, there's sort of a steady state,
15349
12:07:10,720 --> 12:07:14,720
meaning that whatever water's coming in is about the same as the water going out.
15350
12:07:14,720 --> 12:07:17,720
So we think about this in terms of web pages.
15351
12:07:17,720 --> 12:07:22,720
The value of the links coming in is roughly the same as the value of links going out.
15352
12:07:22,720 --> 12:07:27,720
So when that starts to balance the in and the out value from each of the nodes,
15353
12:07:27,720 --> 12:07:30,720
then you've got pretty stable.
15354
12:07:30,720 --> 12:07:34,720
And so what Google does is they have a really relatively stable assessment
15355
12:07:34,720 --> 12:07:36,720
of goodness and value of pages.
15356
12:07:36,720 --> 12:07:38,720
And they use that to commute page rank.
15357
12:07:38,720 --> 12:07:41,720
And then they throw a few more pages in and it kind of has to adjust for a while,
15358
12:07:41,720 --> 12:07:42,720
but it reconverges.
15359
12:07:42,720 --> 12:07:49,720
And so this is a calculation that generally converges and it doesn't vary wildly.
15360
12:07:49,720 --> 12:07:54,720
And that's why Google's pretty good at kind of arriving at the true value of something.
15361
12:07:54,720 --> 12:07:58,720
So let's take a look at what we're going to do in this application.
15362
12:07:58,720 --> 12:08:04,720
Again, we have a file that is going to spider the web.
15363
12:08:04,720 --> 12:08:06,720
And we only have one database.
15364
12:08:06,720 --> 12:08:09,720
Again, in this one we'll have two databases in the next one.
15365
12:08:09,720 --> 12:08:13,720
And so this is spider is the restartable part.
15366
12:08:13,720 --> 12:08:17,720
And what we actually do is we put one URL in, the starting URL.
15367
12:08:17,720 --> 12:08:22,720
And then spider walks in and asks, are there any unretrieved pages?
15368
12:08:22,720 --> 12:08:24,720
And it does that randomly.
15369
12:08:24,720 --> 12:08:27,720
It sort of picks among the unretrieved pages and says, okay, great.
15370
12:08:27,720 --> 12:08:28,720
I'll go retrieve that page.
15371
12:08:28,720 --> 12:08:30,720
And then I'll parse that page.
15372
12:08:30,720 --> 12:08:33,720
And then I'll put in a bunch of new unretrieved pages.
15373
12:08:33,720 --> 12:08:37,720
Okay, as well as the text of that page and then a bunch of unretrieved pages.
15374
12:08:37,720 --> 12:08:42,720
And then it'll go back up and it'll say, oh, give me one of the randomly non-retrieved pages.
15375
12:08:42,720 --> 12:08:45,720
And it'll grab the next page and pull that page down and then add to it.
15376
12:08:45,720 --> 12:08:49,720
And so this is like there's a page and then a to-do list.
15377
12:08:49,720 --> 12:08:53,720
And then this one becomes a page and then adds a few more things to the to-do list.
15378
12:08:53,720 --> 12:08:59,720
And so the to-do list or the unretrieved URLs grows very rapidly.
15379
12:08:59,720 --> 12:09:02,720
And the retrieved ones grow sort of as you retrieve them one at a time.
15380
12:09:02,720 --> 12:09:04,720
But you've always got this long list.
15381
12:09:04,720 --> 12:09:07,720
If you have a really short site that only has like two links,
15382
12:09:07,720 --> 12:09:11,720
if you start at drchuck.com slash page1.htm,
15383
12:09:11,720 --> 12:09:14,720
it'll go to page two and then go back to page one and it'll be out of things.
15384
12:09:14,720 --> 12:09:16,720
It'll have retrieved all of the pages.
15385
12:09:16,720 --> 12:09:20,720
And so if you have a website that has no external links or has very few pages
15386
12:09:20,720 --> 12:09:23,720
and they point to each other, this will run out of things to do.
15387
12:09:23,720 --> 12:09:29,720
But if you go to a page like my blog or the sample stuff that I have up
15388
12:09:29,720 --> 12:09:33,720
for you to spider for testing on drchuck.net,
15389
12:09:33,720 --> 12:09:35,720
it'll run for a very long time.
15390
12:09:35,720 --> 12:09:38,720
And you'll have far more pages to retrieve than pages that you retrieve.
15391
12:09:38,720 --> 12:09:39,720
But that's okay.
15392
12:09:39,720 --> 12:09:41,720
At some point, you can stop this.
15393
12:09:41,720 --> 12:09:43,720
Maybe it stops because you ran out of bandwidth
15394
12:09:43,720 --> 12:09:46,720
or maybe your computer went down or who knows what, right?
15395
12:09:46,720 --> 12:09:47,720
But it's okay.
15396
12:09:47,720 --> 12:09:51,720
This is a restartable process because it always has some pages that are retrieved
15397
12:09:51,720 --> 12:09:52,720
and some unretrieved pages.
15398
12:09:52,720 --> 12:09:53,720
You start it back up.
15399
12:09:53,720 --> 12:09:55,720
It picks randomly from the unretrieved pages.
15400
12:09:55,720 --> 12:10:00,720
The database is the sort of persistent state of your spider
15401
12:10:00,720 --> 12:10:03,720
rather than a bunch of dictionaries or lists inside the Python
15402
12:10:03,720 --> 12:10:06,720
which go away when the program dies.
15403
12:10:06,720 --> 12:10:10,720
And so at some point you have, let's just say, a few hundred pages in here
15404
12:10:10,720 --> 12:10:12,720
and a few thousand unretrieved pages.
15405
12:10:12,720 --> 12:10:14,720
You can run the page rank algorithm.
15406
12:10:14,720 --> 12:10:17,720
And what the page rank algorithm does is it loops through all the pages
15407
12:10:17,720 --> 12:10:19,720
and figure out which pages are linked to which pages
15408
12:10:19,720 --> 12:10:22,720
and then reads the numbers and then updates the numbers
15409
12:10:22,720 --> 12:10:25,720
and then does that some number of times.
15410
12:10:25,720 --> 12:10:27,720
And so this is where the numbers, all the pages,
15411
12:10:27,720 --> 12:10:29,720
sort of start out with goodness of one.
15412
12:10:29,720 --> 12:10:32,720
I think this printout is showing that goodness of one.
15413
12:10:32,720 --> 12:10:34,720
And then it changes.
15414
12:10:34,720 --> 12:10:37,720
And then the goodness goes to, some of the goodness goes up to two.
15415
12:10:37,720 --> 12:10:40,720
Some of the goodness goes to seven and whatever.
15416
12:10:40,720 --> 12:10:43,720
But then it does this over and over and then it uses these numbers
15417
12:10:43,720 --> 12:10:44,720
and then they change again.
15418
12:10:44,720 --> 12:10:47,720
And so there's a number of time steps that this page rank runs.
15419
12:10:47,720 --> 12:10:51,720
And you will see as the page rank runs, when I show you the code,
15420
12:10:51,720 --> 12:10:56,720
you'll see the average sort of change in these numbers
15421
12:10:56,720 --> 12:10:57,720
across all these things.
15422
12:10:57,720 --> 12:11:01,720
And you'll see that the average goes down very rapidly as you get through.
15423
12:11:01,720 --> 12:11:04,720
And so usually with a few hundred or even thousand pages,
15424
12:11:04,720 --> 12:11:07,720
like a hundred plus times during this algorithm
15425
12:11:07,720 --> 12:11:09,720
and these numbers have converged.
15426
12:11:09,720 --> 12:11:12,720
And that's when you sort of can begin to trust the numbers.
15427
12:11:12,720 --> 12:11:15,720
Now there's this one program called SP Reset,
15428
12:11:15,720 --> 12:11:17,720
which sets all the pages back to one.
15429
12:11:17,720 --> 12:11:19,720
So you can start this over.
15430
12:11:19,720 --> 12:11:23,720
So if you were to spider for a while, run SP rank for a while, play around,
15431
12:11:23,720 --> 12:11:26,720
and then you wanted to spider some more and start it over,
15432
12:11:26,720 --> 12:11:29,720
you could say, oh, let's start the page rank completely over.
15433
12:11:29,720 --> 12:11:34,720
Or you could simply take the new pages and watch it adapt.
15434
12:11:34,720 --> 12:11:37,720
Either way, this is just a way to reset all the pages
15435
12:11:37,720 --> 12:11:41,720
to have sort of their initial value of a goodness of 1.0.
15436
12:11:41,720 --> 12:11:43,720
So at some point you run this.
15437
12:11:43,720 --> 12:11:46,720
This runs really, this part here runs really slow.
15438
12:11:46,720 --> 12:11:49,720
This part runs super fast, like in the blink of an eye.
15439
12:11:49,720 --> 12:11:52,720
This one is pretty fast.
15440
12:11:52,720 --> 12:11:57,720
And then at some point you've got these pages that have, you know, numbers on them.
15441
12:11:57,720 --> 12:11:59,720
They have values on the pages.
15442
12:11:59,720 --> 12:12:03,720
And there's a couple of programs that allow us to visualize that.
15443
12:12:03,720 --> 12:12:06,720
One is the dump which just reads it and checks to see.
15444
12:12:06,720 --> 12:12:09,720
It shows the new page rank, the old page rank,
15445
12:12:09,720 --> 12:12:13,720
and various other things and shows just a way to dump it.
15446
12:12:13,720 --> 12:12:16,720
And then there's this thing that reads the whole thing.
15447
12:12:16,720 --> 12:12:19,720
You say, I'd like to do 25 at the top, the best.
15448
12:12:19,720 --> 12:12:22,720
It sorts it by page rank and then produces a JavaScript file.
15449
12:12:22,720 --> 12:12:24,720
It has just the numbers in it.
15450
12:12:24,720 --> 12:12:29,720
And then there is some HTML and a visualization library called D3.js,
15451
12:12:29,720 --> 12:12:33,720
which you can read about, that when the HTML starts it reads this
15452
12:12:33,720 --> 12:12:36,720
and has this nice force-directed layout of the page rank.
15453
12:12:36,720 --> 12:12:41,720
And you can hover over things and you can see what page rank you've got.
15454
12:12:41,720 --> 12:12:47,720
And so that is the page rank algorithm that we're going to do.
15455
12:12:47,720 --> 12:12:50,720
And up next we'll do the largest and most complex of these things,
15456
12:12:50,720 --> 12:12:53,720
and that is the email.
15457
12:12:53,720 --> 12:12:56,720
We're going to spider some email, which is about a gigabyte of data.
15458
12:12:56,720 --> 12:13:02,720
Okay?
15459
12:13:02,720 --> 12:13:04,720
We're doing a bit of code walkthrough, and if you want to,
15460
12:13:04,720 --> 12:13:07,720
you can get to the sample code and download it all
15461
12:13:07,720 --> 12:13:09,720
so that you can walk through the code yourself.
15462
12:13:09,720 --> 12:13:12,720
What we're walking through today is the page rank code.
15463
12:13:12,720 --> 12:13:20,720
And so the page rank code, let me get the picture of the page rank code up here.
15464
12:13:20,720 --> 12:13:22,720
Here's the picture of the page rank code.
15465
12:13:22,720 --> 12:13:29,720
And so the page rank code has five chunks of code that are going to run.
15466
12:13:29,720 --> 12:13:32,720
The first one we're going to look at is the spidering code.
15467
12:13:32,720 --> 12:13:36,720
And then we'll do a separate look at these other guys later.
15468
12:13:36,720 --> 12:13:38,720
So the first one we'll look at is spidering.
15469
12:13:38,720 --> 12:13:41,720
And again, it's sort of the same pattern of we've got some stuff on the web,
15470
12:13:41,720 --> 12:13:43,720
in this case web pages.
15471
12:13:43,720 --> 12:13:47,720
We're going to have a database that sort of just captures the stuff.
15472
12:13:47,720 --> 12:13:50,720
It's not really trying to be particularly intelligent,
15473
12:13:50,720 --> 12:13:52,720
but it is going to parse these with Beautiful Soup
15474
12:13:52,720 --> 12:13:55,720
and add things to the database.
15475
12:13:55,720 --> 12:13:56,720
Okay?
15476
12:13:56,720 --> 12:13:59,720
And so then we'll talk about how we run the page rank algorithm
15477
12:13:59,720 --> 12:14:02,720
and then how we visualize the page rank algorithm in a bit.
15478
12:14:02,720 --> 12:14:06,720
Now, the first thing to notice is that I've got to put,
15479
12:14:06,720 --> 12:14:09,720
I put the Beautiful Soup code in right here.
15480
12:14:09,720 --> 12:14:10,720
Okay?
15481
12:14:10,720 --> 12:14:13,720
So this is, you can get this from the bs4.zip file.
15482
12:14:13,720 --> 12:14:15,720
There might need to be a readme.
15483
12:14:15,720 --> 12:14:17,720
No, but there's a readme somewhere.
15484
12:14:17,720 --> 12:14:20,720
But to get to use Beautiful Soup, you've got to put this bs4.zip,
15485
12:14:20,720 --> 12:14:22,720
or you have to install Beautiful Soup for your stuff.
15486
12:14:22,720 --> 12:14:26,720
So I provide this bs4.zip as a quick and dirty way
15487
12:14:26,720 --> 12:14:34,720
if you can't install something for all of the Python users on your system.
15488
12:14:34,720 --> 12:14:35,720
So that's what it's supposed to look like.
15489
12:14:35,720 --> 12:14:37,720
You're supposed to have it unzipped right here in these files.
15490
12:14:37,720 --> 12:14:39,720
And I don't know what dammit.py means.
15491
12:14:39,720 --> 12:14:41,720
That came from Beautiful Soup.
15492
12:14:41,720 --> 12:14:43,720
If you look, it's in their source code.
15493
12:14:43,720 --> 12:14:45,720
So I'm not swearing.
15494
12:14:45,720 --> 12:14:46,720
It's Beautiful Soup.
15495
12:14:46,720 --> 12:14:47,720
People are swearing.
15496
12:14:47,720 --> 12:14:48,720
I'm sorry.
15497
12:14:48,720 --> 12:14:49,720
I apologize.
15498
12:14:49,720 --> 12:14:50,720
Okay.
15499
12:14:50,720 --> 12:14:52,720
So the code we're going to play with the most is,
15500
12:14:52,720 --> 12:14:55,720
and this first one is called spider.py.
15501
12:14:55,720 --> 12:14:57,720
And, you know, we're going to do databases.
15502
12:14:57,720 --> 12:14:59,720
We're going to read URLs.
15503
12:14:59,720 --> 12:15:02,720
And we're going to parse them with Beautiful Soup.
15504
12:15:02,720 --> 12:15:04,720
Okay?
15505
12:15:04,720 --> 12:15:08,720
And so what we're going to do is we're going to make a file.
15506
12:15:08,720 --> 12:15:10,720
Again, this will make spider.sql lite.
15507
12:15:10,720 --> 12:15:13,720
And here we are in PageRank and else minus l.
15508
12:15:13,720 --> 12:15:16,720
Spider.sql lite is not there.
15509
12:15:16,720 --> 12:15:17,720
So it's going to create the database.
15510
12:15:17,720 --> 12:15:19,720
We do create table if not exists.
15511
12:15:19,720 --> 12:15:21,720
We're going to have an integer primary key
15512
12:15:21,720 --> 12:15:23,720
because we're going to do foreign keys here.
15513
12:15:23,720 --> 12:15:24,720
We're going to have a URL.
15514
12:15:24,720 --> 12:15:29,720
And the URL, which is unique, the HTML, which is unique,
15515
12:15:29,720 --> 12:15:30,720
whether we got an error.
15516
12:15:30,720 --> 12:15:33,720
And then for the second half, when we start doing PageRank,
15517
12:15:33,720 --> 12:15:34,720
we're going to have old rank and new rank.
15518
12:15:34,720 --> 12:15:37,720
Because the way PageRank works is it takes the old rank,
15519
12:15:37,720 --> 12:15:39,720
computes the new rank, and then replaces the new rank
15520
12:15:39,720 --> 12:15:42,720
with the old rank, and then does it over and over again.
15521
12:15:42,720 --> 12:15:46,720
And then we're going to have a many-to-many table,
15522
12:15:46,720 --> 12:15:48,720
which points really back.
15523
12:15:48,720 --> 12:15:50,720
So I call this from ID and to ID.
15524
12:15:50,720 --> 12:15:53,720
We did this with some of the Twitter stuff.
15525
12:15:53,720 --> 12:15:56,720
And then this webs is just in case I have more than one web,
15526
12:15:56,720 --> 12:15:58,720
but that really doesn't make much difference.
15527
12:15:58,720 --> 12:16:05,720
Okay, so what we're going to do is we're going to select
15528
12:16:05,720 --> 12:16:08,720
ID, URL from pages where HTML is null.
15529
12:16:08,720 --> 12:16:11,720
This is our indicator that a page has not yet been retrieved.
15530
12:16:11,720 --> 12:16:14,720
And error is null, ordered by random.
15531
12:16:14,720 --> 12:16:17,720
And so this is our way, this long bit of stuff.
15532
12:16:17,720 --> 12:16:20,720
And not all this SQL is completely standard,
15533
12:16:20,720 --> 12:16:23,720
but this order by random is really quite nice in SQLite.
15534
12:16:23,720 --> 12:16:28,720
Limit once is just randomly pick a record in this database
15535
12:16:28,720 --> 12:16:32,720
where this true is true, and then pick it randomly.
15536
12:16:32,720 --> 12:16:34,720
And then we're going to fetch a row.
15537
12:16:34,720 --> 12:16:41,720
And if that row is none, right, we're going to ask for a new web,
15538
12:16:41,720 --> 12:16:45,720
a starting URL, and this is going to fire things up,
15539
12:16:45,720 --> 12:16:47,720
and we're going to insert this new URL.
15540
12:16:47,720 --> 12:16:49,720
Otherwise, we're going to restart.
15541
12:16:49,720 --> 12:16:51,720
We have a row to start with.
15542
12:16:51,720 --> 12:16:53,720
And otherwise, we're going to sort of prime this
15543
12:16:53,720 --> 12:16:58,720
by inserting the URL we start with, insert into it.
15544
12:16:58,720 --> 12:17:00,720
If you enter it, it just goes to drchuck.com,
15545
12:17:00,720 --> 12:17:02,720
which is a fine place to start.
15546
12:17:02,720 --> 12:17:07,720
And then what we do is we, what this does is its page rank,
15547
12:17:07,720 --> 12:17:12,720
is it uses this web's table to limit the links.
15548
12:17:12,720 --> 12:17:16,720
It only does links to the sites that you tell it to do links.
15549
12:17:16,720 --> 12:17:20,720
And probably the best for your page rank is to stick with one site.
15550
12:17:20,720 --> 12:17:23,720
Otherwise, you will just never find the same site again
15551
12:17:23,720 --> 12:17:26,720
if you let this wander the web aimlessly.
15552
12:17:26,720 --> 12:17:29,720
And so I generally run with one web,
15553
12:17:29,720 --> 12:17:32,720
which this should be probably called web sites.
15554
12:17:32,720 --> 12:17:36,720
And I pull in all the data, and I read this in,
15555
12:17:36,720 --> 12:17:39,720
and I just make myself a list of the legit URLs,
15556
12:17:39,720 --> 12:17:41,720
and you'll see how we use that.
15557
12:17:41,720 --> 12:17:45,720
And the web is what are the legit places we're going to go,
15558
12:17:45,720 --> 12:17:48,720
because we're going to go through a loop,
15559
12:17:48,720 --> 12:17:51,720
ask for how many pages,
15560
12:17:51,720 --> 12:17:53,720
and we're going to look for a null page.
15561
12:17:53,720 --> 12:17:57,720
Again, we're using that random order by random limit one.
15562
12:17:57,720 --> 12:18:04,720
And then we're going to have a, we're going to grab one.
15563
12:18:04,720 --> 12:18:08,720
We're going to get the from ID, which is the page we're linking from,
15564
12:18:08,720 --> 12:18:11,720
and then the URL.
15565
12:18:11,720 --> 12:18:13,720
Otherwise, there's no on retrieved.
15566
12:18:13,720 --> 12:18:17,720
And so the from ID is when we start adding links
15567
12:18:17,720 --> 12:18:21,720
to our page links, we've got to know the page we started with.
15568
12:18:21,720 --> 12:18:23,720
And that's the primary key.
15569
12:18:23,720 --> 12:18:25,720
We'll see how that primary key is set in a second.
15570
12:18:25,720 --> 12:18:27,720
So otherwise, we have none.
15571
12:18:27,720 --> 12:18:30,720
And we're going to print this from ID,
15572
12:18:30,720 --> 12:18:33,720
the from ID and the URL that we're working with,
15573
12:18:33,720 --> 12:18:37,720
just to make sure we're going to wipe out all of the links.
15574
12:18:37,720 --> 12:18:40,720
Because it's on retrieved, we're going to wipe out from the links.
15575
12:18:40,720 --> 12:18:45,720
The links is the connection table that connects from pages back to pages.
15576
12:18:45,720 --> 12:18:47,720
And so we're going to wipe out.
15577
12:18:47,720 --> 12:18:50,720
So we're going to go grab this URL.
15578
12:18:50,720 --> 12:18:52,720
We're going to read it.
15579
12:18:52,720 --> 12:18:56,720
We're not decoding it because we're using Beautiful Soup,
15580
12:18:56,720 --> 12:19:02,720
which compensates for the UTF-8.
15581
12:19:02,720 --> 12:19:06,720
And so we can ask, this is the HTML error code.
15582
12:19:06,720 --> 12:19:08,720
And we checked 200 is a good error.
15583
12:19:08,720 --> 12:19:12,720
And if we get a bad error, we're going to say this error on page.
15584
12:19:12,720 --> 12:19:14,720
We're going to set that error.
15585
12:19:14,720 --> 12:19:15,720
We're going to take pages.
15586
12:19:15,720 --> 12:19:18,720
That way, we don't retrieve it ever again.
15587
12:19:18,720 --> 12:19:24,720
We basically check to see if the content type is text HTML.
15588
12:19:24,720 --> 12:19:27,720
Remember, in HTTP, you get the content type.
15589
12:19:27,720 --> 12:19:28,720
We only want to retrieve it.
15590
12:19:28,720 --> 12:19:31,720
We only want to look for the links on HTML pages.
15591
12:19:31,720 --> 12:19:33,720
And so we wipe that guy out.
15592
12:19:33,720 --> 12:19:38,720
If we get a JPEG or something like that, we're not going to retrieve JPEG.
15593
12:19:38,720 --> 12:19:40,720
And then we commit and continue.
15594
12:19:40,720 --> 12:19:43,720
So these are kind of like, oh, those are pages we didn't want to mess with.
15595
12:19:43,720 --> 12:19:47,720
And then we print out how many characters we got and parse it.
15596
12:19:47,720 --> 12:19:50,720
And we do this whole thing in a try-accept block,
15597
12:19:50,720 --> 12:19:52,720
because a lot of things can go wrong here.
15598
12:19:52,720 --> 12:19:54,720
It's a bit of a long try-accept block.
15599
12:19:54,720 --> 12:19:59,720
Keyboard interrupt, that's what happens if I hit CTRL-C at my keyboard
15600
12:19:59,720 --> 12:20:01,720
or CTRL-Z on Windows.
15601
12:20:01,720 --> 12:20:05,720
Some other exception probably means Beautiful Soup blew up
15602
12:20:05,720 --> 12:20:06,720
or something else blew up.
15603
12:20:06,720 --> 12:20:13,720
And so we indicate with the error equals negative 1 for that URL
15604
12:20:13,720 --> 12:20:15,720
so we don't retrieve it again.
15605
12:20:15,720 --> 12:20:21,720
At this point, at line 103, we have got the HTML for that URL.
15606
12:20:21,720 --> 12:20:23,720
And so we're going to insert it in.
15607
12:20:23,720 --> 12:20:25,720
And we're going to set the page rank to 1.
15608
12:20:25,720 --> 12:20:30,720
So the way page rank works is it gives all the pages some normal value.
15609
12:20:30,720 --> 12:20:32,720
And then it alters that.
15610
12:20:32,720 --> 12:20:33,720
We'll see that in a bit.
15611
12:20:33,720 --> 12:20:36,720
So it sets it in with 1.
15612
12:20:36,720 --> 12:20:40,720
We're going to insert or ignore.
15613
12:20:40,720 --> 12:20:44,720
That's just in case the pages is not there.
15614
12:20:44,720 --> 12:20:46,720
And then we're going to do an update.
15615
12:20:46,720 --> 12:20:48,720
And that's kind of doing the same thing twice,
15616
12:20:48,720 --> 12:20:51,720
just sort of doubly making sure if it's already there,
15617
12:20:51,720 --> 12:20:54,720
this or ignore will cause this to do nothing.
15618
12:20:54,720 --> 12:20:56,720
And the update will cause us to retain it.
15619
12:20:56,720 --> 12:20:59,720
And then we commit it so that if we do selects later,
15620
12:20:59,720 --> 12:21:01,720
we get that information.
15621
12:21:01,720 --> 12:21:04,720
Now this code is similar.
15622
12:21:04,720 --> 12:21:07,720
Remember, we used beautiful soup to pull out all the anchor tags.
15623
12:21:07,720 --> 12:21:08,720
We have a for loop.
15624
12:21:08,720 --> 12:21:10,720
We pull out the href.
15625
12:21:10,720 --> 12:21:13,720
And you'll see this code's a little more complex
15626
12:21:13,720 --> 12:21:15,720
than some of the earlier stuff,
15627
12:21:15,720 --> 12:21:17,720
because it has to deal with the real nastiness
15628
12:21:17,720 --> 12:21:19,720
or imperfection of the web.
15629
12:21:19,720 --> 12:21:21,720
And so we're going to use URL parse,
15630
12:21:21,720 --> 12:21:26,720
which is actually part of the URL lib code.
15631
12:21:26,720 --> 12:21:29,720
And that's going to break the URL into pieces.
15632
12:21:29,720 --> 12:21:30,720
Come back.
15633
12:21:30,720 --> 12:21:32,720
We use URL parse.
15634
12:21:32,720 --> 12:21:36,720
We have the scheme, which is HTTP or HTTPS.
15635
12:21:36,720 --> 12:21:39,720
If this solves relative references,
15636
12:21:39,720 --> 12:21:41,720
this is solved relative references
15637
12:21:41,720 --> 12:21:44,720
by taking the current URL and hooking it up.
15638
12:21:44,720 --> 12:21:47,720
URL join knows about slashes and all those other things.
15639
12:21:47,720 --> 12:21:49,720
We check to see if there's an anchor,
15640
12:21:49,720 --> 12:21:51,720
the pound sign at the end of a URL,
15641
12:21:51,720 --> 12:21:56,720
and we throw everything past including the anchor away.
15642
12:21:56,720 --> 12:22:00,720
If we have a JPEG or a PNG or a GIF,
15643
12:22:00,720 --> 12:22:01,720
we are going to skip it.
15644
12:22:01,720 --> 12:22:03,720
We don't want to bother with that.
15645
12:22:03,720 --> 12:22:04,720
We're looking through links now.
15646
12:22:04,720 --> 12:22:06,720
We're looking at all the links.
15647
12:22:06,720 --> 12:22:09,720
And if we have a slash at the end,
15648
12:22:09,720 --> 12:22:12,720
we're going to chop off the slash by saying minus one.
15649
12:22:12,720 --> 12:22:15,720
And so this is just kind of nasty choppage
15650
12:22:15,720 --> 12:22:18,720
and throwing away the URLs that we're going through a page,
15651
12:22:18,720 --> 12:22:20,720
and we have a bunch that we don't like
15652
12:22:20,720 --> 12:22:23,720
or we have to clean them up or whatever.
15653
12:22:23,720 --> 12:22:25,720
And now, and we've made them absolute by doing this,
15654
12:22:25,720 --> 12:22:27,720
it's an absolute URL.
15655
12:22:27,720 --> 12:22:30,720
This is just, you write this slowly but surely
15656
12:22:30,720 --> 12:22:32,720
when your code blows up and you start it over
15657
12:22:32,720 --> 12:22:35,720
and start it over and start over.
15658
12:22:35,720 --> 12:22:38,720
Then what we do is we check to see through all the webs.
15659
12:22:38,720 --> 12:22:41,720
Remember, those were the URLs that we're willing to stay with
15660
12:22:41,720 --> 12:22:43,720
and usually it's just one.
15661
12:22:43,720 --> 12:22:46,720
If this would link off the sites,
15662
12:22:46,720 --> 12:22:48,720
of the sites we're interested in, we're going to skip it.
15663
12:22:48,720 --> 12:22:51,720
We are not interested in links that leave the site.
15664
12:22:51,720 --> 12:22:54,720
So this is like link that left the site, skip it.
15665
12:22:54,720 --> 12:22:58,720
But now we finally here at line 132,
15666
12:22:58,720 --> 12:23:01,720
we are ready to put this into pages,
15667
12:23:01,720 --> 12:23:05,720
URL and the HTML, and it's all good, right?
15668
12:23:05,720 --> 12:23:11,720
And that one's going to be null right there
15669
12:23:11,720 --> 12:23:14,720
because we haven't retrieved the HTML.
15670
12:23:14,720 --> 12:23:18,720
This is null because this is a page we're going to retrieve,
15671
12:23:18,720 --> 12:23:20,720
we're giving the page rank of one,
15672
12:23:20,720 --> 12:23:23,720
and we're giving it no HTML and that way it'll be retrieved.
15673
12:23:23,720 --> 12:23:27,720
And then we commit that, okay?
15674
12:23:27,720 --> 12:23:29,720
And then we want to get the ID.
15675
12:23:29,720 --> 12:23:33,720
So we could have done this with one way or another,
15676
12:23:33,720 --> 12:23:35,720
but we're going to do a select to say,
15677
12:23:35,720 --> 12:23:38,720
hey, what was the ID that either was already there
15678
12:23:38,720 --> 12:23:41,720
or was just created?
15679
12:23:41,720 --> 12:23:44,720
And we grab that with a fetch one and say,
15680
12:23:44,720 --> 12:23:47,720
retrieve two ID, and now we're going to put a link in,
15681
12:23:47,720 --> 12:23:51,720
insert or into links from ID to ID,
15682
12:23:51,720 --> 12:23:54,720
which is the primary key of the page
15683
12:23:54,720 --> 12:23:56,720
that we're going through and looking for links.
15684
12:23:56,720 --> 12:24:00,720
Two ID is the link that we just created and away we run.
15685
12:24:00,720 --> 12:24:03,720
So it's going to go and go and go and go.
15686
12:24:03,720 --> 12:24:10,720
Let's go look at the create statement up here
15687
12:24:10,720 --> 12:24:13,720
from ID and to ID right there, okay.
15688
12:24:13,720 --> 12:24:19,720
So let's run it.
15689
12:24:19,720 --> 12:24:23,720
Python 3, oops.
15690
12:24:23,720 --> 12:24:32,720
Python 3, spider, python.
15691
12:24:32,720 --> 12:24:36,720
So it's fresh and so it wants a URL with which to start,
15692
12:24:36,720 --> 12:24:40,720
and I'll just start with my favorite website,
15693
12:24:40,720 --> 12:24:42,720
www.drchuck.com.
15694
12:24:42,720 --> 12:24:45,720
Now this basically, this first one you put in,
15695
12:24:45,720 --> 12:24:49,720
it's going to stay on this website for a while, okay.
15696
12:24:49,720 --> 12:24:52,720
So I'll hit enter and let's just grab like,
15697
12:24:52,720 --> 12:24:55,720
let's grab one page just for yucks.
15698
12:24:55,720 --> 12:25:00,720
Okay, so it grabbed that and it printed out that it got
15699
12:25:00,720 --> 12:25:11,720
85, 45 characters and it printed out that it got six links.
15700
12:25:11,720 --> 12:25:21,720
So if I go to this and open database,
15701
12:25:21,720 --> 12:25:28,720
and I go to code 3 and I go to page rank and I look at this,
15702
12:25:28,720 --> 12:25:31,720
oh, let me get out so it closes.
15703
12:25:31,720 --> 12:25:34,720
So notice this SQLite journal,
15704
12:25:34,720 --> 12:25:38,720
that means it's not done closing so I'm going to get out of this
15705
12:25:38,720 --> 12:25:42,720
by pressing enter and so you'll notice now that that journal file went away
15706
12:25:42,720 --> 12:25:45,720
otherwise we would not be getting the final data.
15707
12:25:45,720 --> 12:25:46,720
There we go.
15708
12:25:46,720 --> 12:25:50,720
Okay, so webs, let's take a look at the data.
15709
12:25:50,720 --> 12:25:54,720
Webs has just one URL,
15710
12:25:54,720 --> 12:25:57,720
that's the URLs that we're allowing ourselves to look at.
15711
12:25:57,720 --> 12:25:59,720
You can put more than one in here if you want
15712
12:25:59,720 --> 12:26:01,720
but most people will just leave this as one.
15713
12:26:01,720 --> 12:26:07,720
Pages, so we got this first one and we retrieved this
15714
12:26:07,720 --> 12:26:12,720
as the HTML of it and we found six other URLs in there
15715
12:26:12,720 --> 12:26:14,720
that are drchuck.com URLs.
15716
12:26:14,720 --> 12:26:16,720
There was lots of other URLs in there
15717
12:26:16,720 --> 12:26:21,720
but there were only five other ones that we found.
15718
12:26:21,720 --> 12:26:24,720
And what we'll find is if we go to links,
15719
12:26:24,720 --> 12:26:27,720
we'll see that page one, links to two, links to three,
15720
12:26:27,720 --> 12:26:29,720
links to four, links to five, links to six
15721
12:26:29,720 --> 12:26:31,720
because the links is just a many to many table.
15722
12:26:31,720 --> 12:26:34,720
So page one points to page two,
15723
12:26:34,720 --> 12:26:37,720
page one to three, page one to five, okay?
15724
12:26:37,720 --> 12:26:42,720
So that's what happens when we have the first page.
15725
12:26:42,720 --> 12:26:47,720
So let's retrieve one more page.
15726
12:26:47,720 --> 12:26:51,720
Now it's, we could have started a new crawl
15727
12:26:51,720 --> 12:26:54,720
but we're just gonna, it's gonna stay on drchuck.com
15728
12:26:54,720 --> 12:26:56,720
and I'll just ask for one more page.
15729
12:26:56,720 --> 12:26:58,720
And so now it went and grabbed.
15730
12:26:58,720 --> 12:27:00,720
It randomly picked among these null guys
15731
12:27:00,720 --> 12:27:02,720
and I'm gonna hit enter to close it
15732
12:27:02,720 --> 12:27:04,720
and then I'll refresh this.
15733
12:27:04,720 --> 12:27:09,720
And oh, so it looks like we retrieved OBI sample
15734
12:27:09,720 --> 12:27:11,720
and we didn't get any new links.
15735
12:27:11,720 --> 12:27:14,720
And so the links page, no, we didn't get any new links.
15736
12:27:14,720 --> 12:27:18,720
So that page, whatever that was, OBI sample
15737
12:27:18,720 --> 12:27:19,720
had no external links.
15738
12:27:19,720 --> 12:27:25,720
So let's do another one.
15739
12:27:25,720 --> 12:27:29,720
Oh, one more page.
15740
12:27:29,720 --> 12:27:32,720
So that one had 15 links, so let's take a look now.
15741
12:27:32,720 --> 12:27:35,720
So now we have 15 pages.
15742
12:27:35,720 --> 12:27:38,720
It picked this one to do, right?
15743
12:27:38,720 --> 12:27:40,720
And now it added 15 more pages
15744
12:27:40,720 --> 12:27:44,720
and then if you look at links you will see that page four,
15745
12:27:44,720 --> 12:27:47,720
which is one it just retrieved, links back to page one.
15746
12:27:47,720 --> 12:27:49,720
So now we're seeing this is where the page rank
15747
12:27:49,720 --> 12:27:50,720
is gonna be cool.
15748
12:27:50,720 --> 12:27:56,720
Four links to one, four links to whatever, away we go, right?
15749
12:27:56,720 --> 12:27:59,720
One goes to four, four goes to one.
15750
12:27:59,720 --> 12:28:02,720
I should have probably put a uniqueness constraint on that.
15751
12:28:02,720 --> 12:28:06,720
It's not supposed to duplicated that.
15752
12:28:06,720 --> 12:28:10,720
Okay, so let's run this a bunch of times now.
15753
12:28:10,720 --> 12:28:18,720
So let's just run it 100 times for 100 pages.
15754
12:28:18,720 --> 12:28:21,720
It'll take a minute.
15755
12:28:21,720 --> 12:28:24,720
So you'll see it's like freaking out on certain pages
15756
12:28:24,720 --> 12:28:27,720
and not parsing them.
15757
12:28:27,720 --> 12:28:32,720
It's finding its way into my blog.
15758
12:28:32,720 --> 12:28:34,720
It's finding like 27 links.
15759
12:28:34,720 --> 12:28:39,720
This table is growing wildly at this point.
15760
12:28:39,720 --> 12:28:41,720
It's gonna take us a while before we get to 100.
15761
12:28:41,720 --> 12:28:42,720
It's kind of slow.
15762
12:28:42,720 --> 12:28:45,720
Now the interesting thing is I can hit control C
15763
12:28:45,720 --> 12:28:48,720
at any point in time.
15764
12:28:48,720 --> 12:28:49,720
Right?
15765
12:28:49,720 --> 12:28:51,720
And so that blew up.
15766
12:28:51,720 --> 12:28:53,720
But it's okay because the data is still there
15767
12:28:53,720 --> 12:28:55,720
and if we go back to pages, for example,
15768
12:28:55,720 --> 12:28:58,720
and we refresh our data, we see we got a ton of stuff.
15769
12:28:58,720 --> 12:29:01,720
And this will restart and all the things,
15770
12:29:01,720 --> 12:29:04,720
so if we search this, I started that by HTML,
15771
12:29:04,720 --> 12:29:06,720
you see that there's lots of files that we've got
15772
12:29:06,720 --> 12:29:08,720
and it's never gonna retrieve those again
15773
12:29:08,720 --> 12:29:10,720
because those have HTML.
15774
12:29:10,720 --> 12:29:14,720
So then I can run this thing again and start it up.
15775
12:29:14,720 --> 12:29:17,720
And when I say control C, your computer might go down,
15776
12:29:17,720 --> 12:29:18,720
your network might go down.
15777
12:29:18,720 --> 12:29:21,720
There's all kinds of things that might happen
15778
12:29:21,720 --> 12:29:22,720
and you just pick up where it leaves off.
15779
12:29:22,720 --> 12:29:24,720
It just picks up where it leaves off
15780
12:29:24,720 --> 12:29:26,720
and that's what's nice about this.
15781
12:29:26,720 --> 12:29:27,720
Okay?
15782
12:29:27,720 --> 12:29:32,720
So that's pretty much how this works.
15783
12:29:32,720 --> 12:29:35,720
We've got this part running.
15784
12:29:35,720 --> 12:29:37,720
We're seeing it flow into Spider DeskQL Lite.
15785
12:29:37,720 --> 12:29:40,720
We're seeing that we can start this and replace this.
15786
12:29:40,720 --> 12:29:44,720
And so what I'll do is I will come back in the next video
15787
12:29:44,720 --> 12:29:47,720
and show you how all these things work together
15788
12:29:47,720 --> 12:29:51,720
and then how we actually do the page rank.
15789
12:29:51,720 --> 12:29:55,720
So thanks again for listening and see you in the next video.
15790
12:30:03,720 --> 12:30:04,720
We're picking up in the middle here
15791
12:30:04,720 --> 12:30:09,720
where we are running a simple spider that's retrieving data
15792
12:30:09,720 --> 12:30:14,720
and putting it into running this spider.py file
15793
12:30:14,720 --> 12:30:17,720
and it's cruising around and doing things.
15794
12:30:17,720 --> 12:30:19,720
And the beauty of any of these spider processes
15795
12:30:19,720 --> 12:30:23,720
is I can stop any time and just hit control C.
15796
12:30:23,720 --> 12:30:29,720
And so we take a look at the spider.sqlite file
15797
12:30:29,720 --> 12:30:30,720
and retrieve it.
15798
12:30:30,720 --> 12:30:33,720
And it looks like we've got 302 pages.
15799
12:30:33,720 --> 12:30:36,720
I don't know how many we've got retrieved.
15800
12:30:36,720 --> 12:30:37,720
70.
15801
12:30:37,720 --> 12:30:39,720
Okay, there we go.
15802
12:30:39,720 --> 12:30:43,720
We've got about 100.
15803
12:30:43,720 --> 12:30:45,720
Oh wait, I'm looking for the wrong thing.
15804
12:30:45,720 --> 12:30:48,720
No, no, no, no, no.
15805
12:30:48,720 --> 12:30:51,720
Yeah, we've got about 107 pages.
15806
12:30:51,720 --> 12:30:54,720
So what we're going to do now with 107 pages
15807
12:30:54,720 --> 12:30:59,720
is we are going to run the page rank algorithm.
15808
12:30:59,720 --> 12:31:01,720
Okay, so let's take a look at that code.
15809
12:31:01,720 --> 12:31:06,720
So the idea of page rank,
15810
12:31:06,720 --> 12:31:09,720
we're going to run this page rank algorithm.
15811
12:31:09,720 --> 12:31:12,720
The spreset just resets the page rank
15812
12:31:12,720 --> 12:31:15,720
and sprank runs as many iterations of page rank.
15813
12:31:15,720 --> 12:31:20,720
So the basic idea is that if you were to look at the links here,
15814
12:31:20,720 --> 12:31:25,720
we think of page 1 pointing to page 2
15815
12:31:25,720 --> 12:31:28,720
gives some of page 1's love to page 2.
15816
12:31:28,720 --> 12:31:33,720
Page 4 has some value that it gives to page 1.
15817
12:31:33,720 --> 12:31:37,720
You go on and page 2 gives love to page 46
15818
12:31:37,720 --> 12:31:39,720
over and over and over again.
15819
12:31:39,720 --> 12:31:42,720
But the problem is that how good is page 1
15820
12:31:42,720 --> 12:31:47,720
and how much positive karma does it give to page 2?
15821
12:31:47,720 --> 12:31:54,720
And so what happens is we start by giving every page a rank of 1.
15822
12:31:54,720 --> 12:31:57,720
We say, look, everybody starts out equal.
15823
12:31:57,720 --> 12:31:59,720
But then what we do is we divide up
15824
12:31:59,720 --> 12:32:02,720
in one iteration of the page rank algorithm,
15825
12:32:02,720 --> 12:32:06,720
we divide up the goodness of a page across its outbound links
15826
12:32:06,720 --> 12:32:11,720
and then accumulate that and that becomes the next rank.
15827
12:32:11,720 --> 12:32:19,720
So let's take a look at the code for the page rank algorithm.
15828
12:32:19,720 --> 12:32:21,720
So this is pretty simple.
15829
12:32:21,720 --> 12:32:25,720
It only imports SQLite 3 because it's really doing everything in the database.
15830
12:32:25,720 --> 12:32:30,720
It's going to be updating these columns right here in the database.
15831
12:32:30,720 --> 12:32:36,720
So we're going to do some things here to speed this up.
15832
12:32:36,720 --> 12:32:41,720
This rank runs, if you're thinking of Google, this rank runs slowly
15833
12:32:41,720 --> 12:32:45,720
and is going to run continuously to keep updating these things.
15834
12:32:45,720 --> 12:32:50,720
So the first thing I do is I read in all of the from IDs from the links.
15835
12:32:50,720 --> 12:32:54,720
Select distinct throws out any duplicates.
15836
12:32:54,720 --> 12:32:58,720
And so I have all the from IDs,
15837
12:32:58,720 --> 12:33:02,720
which are all the pages that have links to other pages
15838
12:33:02,720 --> 12:33:05,720
because all the pages are in pages,
15839
12:33:05,720 --> 12:33:09,720
but in links to have a from ID, you have to also have a to ID.
15840
12:33:09,720 --> 12:33:15,720
And so we're also going to look at the pages that receive page rank
15841
12:33:15,720 --> 12:33:17,720
and we're kind of precaching this stuff.
15842
12:33:17,720 --> 12:33:21,720
So we're going to do a select distinct of from ID and to ID
15843
12:33:21,720 --> 12:33:23,720
and loop through that group of things.
15844
12:33:23,720 --> 12:33:27,720
And we're making a links list here.
15845
12:33:27,720 --> 12:33:31,720
And so we're saying if the from ID is the same as the to ID,
15846
12:33:31,720 --> 12:33:36,720
we're not interested if the from ID is not already in my from IDs
15847
12:33:36,720 --> 12:33:37,720
that I've got.
15848
12:33:37,720 --> 12:33:38,720
I'm going to skip it.
15849
12:33:38,720 --> 12:33:40,720
If the to ID is not in the from ID,
15850
12:33:40,720 --> 12:33:44,720
meaning that this is a to ID that's not also,
15851
12:33:44,720 --> 12:33:47,720
we don't want links that point off to nowhere
15852
12:33:47,720 --> 12:33:49,720
or point to pages that we haven't retrieved yet.
15853
12:33:49,720 --> 12:33:51,720
And that's what this is saying.
15854
12:33:51,720 --> 12:33:53,720
So this is really going to give us,
15855
12:33:53,720 --> 12:33:58,720
it's a filter on the from IDs and the to IDs from the links table
15856
12:33:58,720 --> 12:34:02,720
so that it only are the links that point to another page we've already retrieved.
15857
12:34:02,720 --> 12:34:07,720
And then we're going to keep track of the entire super set of two IDs,
15858
12:34:07,720 --> 12:34:09,720
the destination IDs.
15859
12:34:09,720 --> 12:34:10,720
And I'm just putting these all in lists
15860
12:34:10,720 --> 12:34:13,720
so that I don't have to hit the database so hard.
15861
12:34:13,720 --> 12:34:16,720
Okay, so this is getting what's called the strongly connected component,
15862
12:34:16,720 --> 12:34:18,720
meaning that any of these IDs,
15863
12:34:18,720 --> 12:34:23,720
there is a path from every ID to every other ID eventually.
15864
12:34:23,720 --> 12:34:27,720
So that's called the strongly connected component in graph theory.
15865
12:34:27,720 --> 12:34:30,720
Then what we're going to do is we're going to grab the,
15866
12:34:30,720 --> 12:34:34,720
we're going to select new rank from pages
15867
12:34:34,720 --> 12:34:40,720
where for all the from IDs, right?
15868
12:34:40,720 --> 12:34:45,720
And so we're going to have a dictionary that's based on the ID,
15869
12:34:45,720 --> 12:34:49,720
the primary key, that's what node is, equals the rank.
15870
12:34:49,720 --> 12:34:52,720
And so if we look at our database,
15871
12:34:52,720 --> 12:34:57,720
that means that for the part of the strongly connected component in links,
15872
12:34:57,720 --> 12:35:00,720
we're going to grab this number and stick it into a dictionary
15873
12:35:00,720 --> 12:35:06,720
based on the primary key of this,
15874
12:35:06,720 --> 12:35:09,720
based on the primary key, this number right here.
15875
12:35:09,720 --> 12:35:12,720
So we're going to have a dictionary that's this map to that.
15876
12:35:12,720 --> 12:35:15,720
Again, we want to do this as fast as possible.
15877
12:35:15,720 --> 12:35:17,720
Now we're only doing one iteration at the beginning,
15878
12:35:17,720 --> 12:35:21,720
so it asks how many times you want to run it, okay?
15879
12:35:21,720 --> 12:35:25,720
And so we just make an integer of that.
15880
12:35:25,720 --> 12:35:28,720
We check to see if there's any values in there.
15881
12:35:28,720 --> 12:35:31,720
If there are no values, we are bad.
15882
12:35:31,720 --> 12:35:34,720
And now we're going to go I equals one to range many.
15883
12:35:34,720 --> 12:35:38,720
This is going to be one to one, so it might run however many times.
15884
12:35:38,720 --> 12:35:44,720
And then what it's going to do is it's going to compute the new page ranks.
15885
12:35:44,720 --> 12:35:49,720
And so what it's really going to do is it's going to take the previous ranks
15886
12:35:49,720 --> 12:35:56,720
and loop through them, and the previous ranks
15887
12:35:56,720 --> 12:36:02,720
is the mapping of primary key to old page rank, okay?
15888
12:36:02,720 --> 12:36:07,720
And for each node, we're going to have total equals total plus old rank,
15889
12:36:07,720 --> 12:36:15,720
and then we're going to set the next ranks to be zero, okay?
15890
12:36:15,720 --> 12:36:18,720
And then what we're going to do is figure out the number of outbound links
15891
12:36:18,720 --> 12:36:25,720
for each page rank item, so node and old rank in the list of the previous ranks.
15892
12:36:25,720 --> 12:36:27,720
These are the IDs we're going to give it to,
15893
12:36:27,720 --> 12:36:33,720
and so for this particular node, we're going to have the outbound links,
15894
12:36:33,720 --> 12:36:40,720
and we're going to go through the links and not link to itself,
15895
12:36:40,720 --> 12:36:42,720
although we made sure that doesn't happen.
15896
12:36:42,720 --> 12:36:46,720
We make sure that this, but then we're going to make a list called give IDs,
15897
12:36:46,720 --> 12:36:52,720
which are the IDs that node is going to share its goodness.
15898
12:36:52,720 --> 12:36:55,720
And now what we're going to do is we're going to say how much goodness
15899
12:36:55,720 --> 12:37:00,720
are we going to flow outbound based on our previous rank of this particular node
15900
12:37:00,720 --> 12:37:03,720
and the number of outbound links we have.
15901
12:37:03,720 --> 12:37:09,720
So that's how much we're going to give in our outbound links.
15902
12:37:09,720 --> 12:37:13,720
And then what we're doing is all the IDs we're giving it to,
15903
12:37:13,720 --> 12:37:17,720
we started with the next ranks being zero for these folks.
15904
12:37:17,720 --> 12:37:22,720
These are the receiving end, and we're going to add the amount of page rank
15905
12:37:22,720 --> 12:37:24,720
to each one, so whatever this is.
15906
12:37:24,720 --> 12:37:28,720
So we'll go through all of the links,
15907
12:37:28,720 --> 12:37:31,720
give out fractional bits of our current goodness,
15908
12:37:31,720 --> 12:37:34,720
and it's accumulated in each one,
15909
12:37:34,720 --> 12:37:42,720
and so eventually all the incoming links will have granted each new link value.
15910
12:37:42,720 --> 12:37:46,720
Now I'm just going to run through and calculate the new total,
15911
12:37:46,720 --> 12:37:56,720
and this evaporation, the idea is that it has to do with the page rank algorithm
15912
12:37:56,720 --> 12:38:00,720
that there are dysfunctional shapes in which page rank can be trapped,
15913
12:38:00,720 --> 12:38:05,720
and this evaporation is taking a fraction away from everyone
15914
12:38:05,720 --> 12:38:07,720
and giving it back to everybody else.
15915
12:38:07,720 --> 12:38:12,720
And so we add this evaporative factor,
15916
12:38:12,720 --> 12:38:17,720
and then we're going to do some computations just to show some stuff,
15917
12:38:17,720 --> 12:38:24,720
and that is we're calculating the average difference between the page ranks,
15918
12:38:24,720 --> 12:38:26,720
and you'll see this when I start running it,
15919
12:38:26,720 --> 12:38:32,720
and this is going to tell us the stability of the page rank.
15920
12:38:32,720 --> 12:38:36,720
So from one iteration to the next, the more it changes, the least stable it is,
15921
12:38:36,720 --> 12:38:39,720
and you'll see in a sec that these things stabilize,
15922
12:38:39,720 --> 12:38:43,720
and we say what's the average difference in the page ranks per node,
15923
12:38:43,720 --> 12:38:46,720
which is what this is, and that's what we're going to print,
15924
12:38:46,720 --> 12:38:51,720
and now we're going to take the new ranks and make them the old ranks
15925
12:38:51,720 --> 12:38:53,720
and then run the loop again.
15926
12:38:53,720 --> 12:38:58,720
So I'm not actually updating the database each time through the page rank iteration,
15927
12:38:58,720 --> 12:39:03,720
but then at the very end I am going to do the update for all of these things
15928
12:39:03,720 --> 12:39:07,720
and update all of the rankings with a new rank.
15929
12:39:07,720 --> 12:39:13,720
So I'm doing an in-memory calculation so that this loop here runs screamingly fast.
15930
12:39:13,720 --> 12:39:17,720
Even if I want to do this loop 100 times or 1000 times,
15931
12:39:17,720 --> 12:39:21,720
it's really all just in-memory data structures.
15932
12:39:21,720 --> 12:39:24,720
Okay, so it's probably easier just for me to show you this.
15933
12:39:24,720 --> 12:39:28,720
The code runs quite simply.
15934
12:39:28,720 --> 12:39:34,720
Python 3,
15935
12:39:34,720 --> 12:39:39,720
SprankRank.py.
15936
12:39:39,720 --> 12:39:41,720
And so I'm only going to run it for one iteration,
15937
12:39:41,720 --> 12:39:46,720
and that means that this loop here is just going to run one time.
15938
12:39:46,720 --> 12:39:54,720
And so it's going to start with the page ranks of the new rank of one,
15939
12:39:54,720 --> 12:39:58,720
and it's going to just run one iteration and put the rank there.
15940
12:39:58,720 --> 12:40:00,720
Okay, and then update this as well.
15941
12:40:00,720 --> 12:40:05,720
So let's go ahead and run that once for one iteration.
15942
12:40:05,720 --> 12:40:08,720
Okay, and so it ran one iteration,
15943
12:40:08,720 --> 12:40:14,720
and the average change between the previous rank and the new rank is one.
15944
12:40:14,720 --> 12:40:16,720
So it's actually quite crazy.
15945
12:40:16,720 --> 12:40:18,720
So I'm going to refresh here,
15946
12:40:18,720 --> 12:40:21,720
and you'll see that the old rank was one,
15947
12:40:21,720 --> 12:40:26,720
and the new rank went way down, way down, way down, way down,
15948
12:40:26,720 --> 12:40:32,720
down a little bit, down some, up a whole bunch.
15949
12:40:32,720 --> 12:40:33,720
Down, down, up.
15950
12:40:33,720 --> 12:40:35,720
So you see that they went down and up.
15951
12:40:35,720 --> 12:40:39,720
Now the sum of all of these numbers is going to be the same, right?
15952
12:40:39,720 --> 12:40:44,720
Because all it did was like float it out and recalculate it.
15953
12:40:44,720 --> 12:40:47,720
And so that's what happens with PageRank.
15954
12:40:47,720 --> 12:40:50,720
And so what will happen is if I run one more PageRank iteration,
15955
12:40:50,720 --> 12:40:55,720
this number will, these numbers will be used to compute the new new rank,
15956
12:40:55,720 --> 12:40:57,720
and then these will be calculated to the old rank.
15957
12:40:57,720 --> 12:41:00,720
And so you'll see that these will get, they will change again.
15958
12:41:00,720 --> 12:41:05,720
So I'll just run it one more time.
15959
12:41:05,720 --> 12:41:09,720
So I'm going to run one iteration, and then I'm going to hit refresh.
15960
12:41:09,720 --> 12:41:12,720
So you see all these numbers got copied over,
15961
12:41:12,720 --> 12:41:17,720
but now there's a new rank that's computed based on these guys.
15962
12:41:17,720 --> 12:41:19,720
And so they're getting, this one went up.
15963
12:41:19,720 --> 12:41:20,720
This was 0.13.
15964
12:41:20,720 --> 12:41:21,720
That's gone up a little bit.
15965
12:41:21,720 --> 12:41:23,720
This one's gone up some more.
15966
12:41:23,720 --> 12:41:24,720
This one's gone up.
15967
12:41:24,720 --> 12:41:26,720
This one went down, right?
15968
12:41:26,720 --> 12:41:28,720
So this one went down from 6 to 8.
15969
12:41:28,720 --> 12:41:33,720
And you can see that the difference is now the average difference between
15970
12:41:33,720 --> 12:41:39,720
this number and this number across all of them went from 1 point something to 0.41.
15971
12:41:39,720 --> 12:41:41,720
And you'll see that with these very few pages,
15972
12:41:41,720 --> 12:41:47,720
this PageRank converges really quickly, okay?
15973
12:41:47,720 --> 12:41:49,720
So let's run it again.
15974
12:41:49,720 --> 12:41:53,720
And I'll just run 10, and you will watch how this converges, okay?
15975
12:41:53,720 --> 12:41:54,720
So there you go.
15976
12:41:54,720 --> 12:41:56,720
It converges.
15977
12:41:56,720 --> 12:41:59,720
And you're seeing now after like 12 iterations
15978
12:41:59,720 --> 12:42:06,720
that the difference between the old rank and the new rank,
15979
12:42:06,720 --> 12:42:08,720
well, that's because it's that old rank.
15980
12:42:08,720 --> 12:42:11,720
I'll run one more iteration so that you can see.
15981
12:42:11,720 --> 12:42:15,720
So this old rank is less than 0.005.
15982
12:42:15,720 --> 12:42:19,720
And so now you can see that these numbers are sort of stabilizing.
15983
12:42:19,720 --> 12:42:20,720
This is the average.
15984
12:42:20,720 --> 12:42:24,720
That 0.005 number is the average difference between these two things.
15985
12:42:24,720 --> 12:42:27,720
Now, if we're going to pretend to be Google for a moment,
15986
12:42:27,720 --> 12:42:32,720
we can say python3 spider.py.
15987
12:42:36,720 --> 12:42:38,720
So let's just do 10 more pages.
15988
12:42:38,720 --> 12:42:41,720
Now what's going to happen here is these new pages
15989
12:42:41,720 --> 12:42:44,720
are going to have PageRanks of 1, okay?
15990
12:42:44,720 --> 12:42:48,720
So let's get out.
15991
12:42:48,720 --> 12:42:52,720
So if I do a refresh now, and I look at new rank.
15992
12:42:52,720 --> 12:42:55,720
So there's these guys that have high rank.
15993
12:42:55,720 --> 12:42:58,720
What you'll see, I hope, if we, yeah, okay.
15994
12:42:58,720 --> 12:43:00,720
So you see new pages, right?
15995
12:43:00,720 --> 12:43:02,720
These are the new ones that we just retrieved.
15996
12:43:02,720 --> 12:43:05,720
I don't know if they're linked or not, and they all got one.
15997
12:43:05,720 --> 12:43:08,720
So some old pages are way up, 14.
15998
12:43:08,720 --> 12:43:11,720
Some pages, if we go downwards, are way down, right?
15999
12:43:11,720 --> 12:43:13,720
So these are like useless pages.
16000
12:43:13,720 --> 12:43:16,720
They, you know, they point to somewhere, but nobody points to them.
16001
12:43:16,720 --> 12:43:18,720
That's what happens with these PageRanks, okay?
16002
12:43:18,720 --> 12:43:23,720
So what happens is the new records get this 0.1.
16003
12:43:23,720 --> 12:43:29,720
And so if I run the ranking code again, and I run, let's just run five iterations,
16004
12:43:29,720 --> 12:43:33,720
you'll see that the average delta goes up just briefly
16005
12:43:33,720 --> 12:43:36,720
as it sort of assimilates these new pages,
16006
12:43:36,720 --> 12:43:38,720
and then it goes right back down again.
16007
12:43:38,720 --> 12:43:39,720
And so that's what's happening with Google.
16008
12:43:39,720 --> 12:43:42,720
It's sort of running the spider to get more pages,
16009
12:43:42,720 --> 12:43:45,720
then running the PageRank, which gets disturbed a little bit,
16010
12:43:45,720 --> 12:43:47,720
but then it reconverges very rapidly.
16011
12:43:47,720 --> 12:43:50,720
And of course, they've got billions of pages, and we've got hundreds of pages,
16012
12:43:50,720 --> 12:43:52,720
but you get the idea, okay?
16013
12:43:52,720 --> 12:43:56,720
And so I can run PageRank like 100 times,
16014
12:43:56,720 --> 12:43:59,720
and after a while, it just sort of hardly is changing.
16015
12:43:59,720 --> 12:44:03,720
So that's 2.7 to the negative 10th power.
16016
12:44:03,720 --> 12:44:08,720
So now, you know, let me run it one more time to update the stuff.
16017
12:44:08,720 --> 12:44:13,720
And if I refresh this, you're going to see, look at how stable these numbers are.
16018
12:44:13,720 --> 12:44:19,720
14, 9, 4, 3, 5, 9, 1, 5, 6, 7.
16019
12:44:19,720 --> 12:44:22,720
The difference is they're in the seventh one.
16020
12:44:22,720 --> 12:44:24,720
So that's why this whole PageRank is really cool.
16021
12:44:24,720 --> 12:44:27,720
It seems like it's really chaotic when it first starts out,
16022
12:44:27,720 --> 12:44:30,720
and away you go, okay?
16023
12:44:30,720 --> 12:44:36,720
So that was just this, SPRank, right?
16024
12:44:36,720 --> 12:44:40,720
SPRank, and SPReset, we can look at that code.
16025
12:44:40,720 --> 12:44:42,720
I won't bother running it.
16026
12:44:42,720 --> 12:44:45,720
It just sets the old rank to 1.
16027
12:44:45,720 --> 12:44:46,720
That's it.
16028
12:44:46,720 --> 12:44:47,720
That's as much code as you've got.
16029
12:44:47,720 --> 12:44:50,720
It just starts it and lets it rerun.
16030
12:44:50,720 --> 12:44:54,720
So I'm going to stop now, and I'm going to start a new video,
16031
12:44:54,720 --> 12:44:56,720
where I should talk about this phase here,
16032
12:44:56,720 --> 12:45:06,720
where we're actually going to visualize the PageRank data.
16033
12:45:06,720 --> 12:45:11,720
And what we are in the middle of is we're in the middle of the PageRank code,
16034
12:45:11,720 --> 12:45:14,720
and we just got done running the PageRank,
16035
12:45:14,720 --> 12:45:17,720
and so we have spiedered the code.
16036
12:45:17,720 --> 12:45:19,720
We've run PageRank a bunch of times.
16037
12:45:19,720 --> 12:45:22,720
SPReset allows us to restart the PageRank algorithm if we want,
16038
12:45:22,720 --> 12:45:24,720
but we're not going to play with that.
16039
12:45:24,720 --> 12:45:27,720
We're just going to play with spdump and spjson and do the visualization,
16040
12:45:27,720 --> 12:45:29,720
which is the fun part.
16041
12:45:29,720 --> 12:45:32,720
So I'll go into spdump.
16042
12:45:32,720 --> 12:45:34,720
So this is a simple code,
16043
12:45:34,720 --> 12:45:38,720
because it's really just running a SQL query and then printing stuff out, right?
16044
12:45:38,720 --> 12:45:41,720
So we connect to our database, create a cursor,
16045
12:45:41,720 --> 12:45:44,720
and then just do a select count,
16046
12:45:44,720 --> 12:45:48,720
and we're going to just show the number of links.
16047
12:45:48,720 --> 12:45:51,720
We're going to order by the number of inbound links descending
16048
12:45:51,720 --> 12:45:55,720
so we see the most linked things, and we'll see the top 50 that.
16049
12:45:55,720 --> 12:45:56,720
So this is just a sample.
16050
12:45:56,720 --> 12:46:00,720
You'll tend to write little helpers like this that make your life easier
16051
12:46:00,720 --> 12:46:05,720
just to show you the kinds of things that you want, spdump.py.
16052
12:46:05,720 --> 12:46:07,720
And you just kind of test to make sure that it's like,
16053
12:46:07,720 --> 12:46:09,720
oh, this looks right to me.
16054
12:46:09,720 --> 12:46:12,720
And so here is the number of inbound links.
16055
12:46:12,720 --> 12:46:15,720
So that's my blog that has the most inbound links,
16056
12:46:15,720 --> 12:46:18,720
followed by my uncategorized, whatever that is.
16057
12:46:18,720 --> 12:46:23,720
And these are the number of inbound links within my own blog somehow.
16058
12:46:23,720 --> 12:46:29,720
I don't know, because this is not looking at the whole internet at all.
16059
12:46:29,720 --> 12:46:31,720
So there we go.
16060
12:46:31,720 --> 12:46:32,720
So that's spdump.
16061
12:46:32,720 --> 12:46:34,720
Pretty straightforward.
16062
12:46:34,720 --> 12:46:37,720
And now we're going to go through the visualization process.
16063
12:46:37,720 --> 12:46:40,720
And so this is going to look at all that data
16064
12:46:40,720 --> 12:46:43,720
and produce a JavaScript file.
16065
12:46:43,720 --> 12:46:45,720
It's going to write a JavaScript file
16066
12:46:45,720 --> 12:46:49,720
that will then be fed into my visualization using D3.
16067
12:46:49,720 --> 12:46:57,720
And spjson is going to do a big, long join.
16068
12:46:57,720 --> 12:46:59,720
It joins the links with the thing.
16069
12:46:59,720 --> 12:47:01,720
And HTML is not null.
16070
12:47:01,720 --> 12:47:03,720
And error is not null.
16071
12:47:03,720 --> 12:47:05,720
You know, order by the number of inbound links.
16072
12:47:05,720 --> 12:47:10,720
So we're looking at the things that have the highest number of inbound links.
16073
12:47:10,720 --> 12:47:14,720
We're going to read all this stuff.
16074
12:47:14,720 --> 12:47:18,720
We're going to read through all those rows
16075
12:47:18,720 --> 12:47:21,720
and pull out the page rank for each one.
16076
12:47:21,720 --> 12:47:24,720
We are looking for the highest and lowest rank
16077
12:47:24,720 --> 12:47:27,720
because these numbers can vary quite widely.
16078
12:47:27,720 --> 12:47:31,720
They go all the way from 0.000 to 20 or 30.
16079
12:47:31,720 --> 12:47:35,720
And so it asks, how many do you want to do?
16080
12:47:35,720 --> 12:47:38,720
So it only does the top, like 20 or something.
16081
12:47:38,720 --> 12:47:41,720
And you'll see why we need that in the visualization.
16082
12:47:41,720 --> 12:47:44,720
And so this is just checking.
16083
12:47:44,720 --> 12:47:46,720
And so we're going to write out a file.
16084
12:47:46,720 --> 12:47:48,720
We'll see what the format of this is.
16085
12:47:48,720 --> 12:47:51,720
It's just a little, it's just a JavaScript file.
16086
12:47:51,720 --> 12:47:54,720
And we're going to write out,
16087
12:47:54,720 --> 12:47:57,720
we're basically normalizing the rank.
16088
12:47:57,720 --> 12:47:59,720
We're subtracting the minimum rank.
16089
12:47:59,720 --> 12:48:03,720
And because we're going to turn this into line weight,
16090
12:48:03,720 --> 12:48:04,720
the thickness of the line,
16091
12:48:04,720 --> 12:48:07,720
and so we're dividing by, you know,
16092
12:48:07,720 --> 12:48:11,720
we're normalizing the rank to be the thickness of the line
16093
12:48:11,720 --> 12:48:16,720
and the size of the ball.
16094
12:48:16,720 --> 12:48:17,720
You'll see all this.
16095
12:48:17,720 --> 12:48:19,720
And so this is really just writing some JavaScript
16096
12:48:19,720 --> 12:48:22,720
with the little strings and stuff like that.
16097
12:48:22,720 --> 12:48:25,720
And then we're going to finish the JavaScript.
16098
12:48:25,720 --> 12:48:27,720
And then we're going to write all the links out.
16099
12:48:27,720 --> 12:48:29,720
So these are the balls that you'll see.
16100
12:48:29,720 --> 12:48:32,720
And this is showing what, this is drawing all the lines.
16101
12:48:32,720 --> 12:48:35,720
And this is again normalizing things for thickness
16102
12:48:35,720 --> 12:48:36,720
and printing these things out.
16103
12:48:36,720 --> 12:48:40,720
Now I don't want to go through this in tremendous detail,
16104
12:48:40,720 --> 12:48:47,720
but so I'll do python spjson.py.
16105
12:48:47,720 --> 12:48:50,720
Let's do the top 20 nodes.
16106
12:48:50,720 --> 12:48:55,720
And if I take a look at this file spider.js,
16107
12:48:55,720 --> 12:48:58,720
you can see that it's some objects that basically
16108
12:48:58,720 --> 12:49:01,720
put the page rank in, which ID it is,
16109
12:49:01,720 --> 12:49:04,720
and that's a way for me to be able to link back and forth.
16110
12:49:04,720 --> 12:49:07,720
Weight is how big the little circle is.
16111
12:49:07,720 --> 12:49:08,720
And then I have the links.
16112
12:49:08,720 --> 12:49:11,720
And I only asked for the top 20.
16113
12:49:11,720 --> 12:49:15,720
And then this is the thickness of the line,
16114
12:49:15,720 --> 12:49:18,720
where the line starts, where the line ends.
16115
12:49:18,720 --> 12:49:24,720
So this is read by this HTML file.
16116
12:49:24,720 --> 12:49:31,720
And it's going to read somewhere this force.js file.
16117
12:49:31,720 --> 12:49:36,720
And my own spider.js code, this is some JavaScript.
16118
12:49:36,720 --> 12:49:40,720
I mean, no, the force.js is the visualization code.
16119
12:49:40,720 --> 12:49:43,720
And this is D3, the visualization library.
16120
12:49:43,720 --> 12:49:46,720
So I'm using this D3.js,
16121
12:49:46,720 --> 12:49:49,720
which is a really great visualization library.
16122
12:49:49,720 --> 12:49:51,720
And this is just drawing the circles
16123
12:49:51,720 --> 12:49:53,720
and making the circles of colors
16124
12:49:53,720 --> 12:49:55,720
and making the circles bigger and smaller
16125
12:49:55,720 --> 12:49:57,720
and then connecting all the lines in between it.
16126
12:49:57,720 --> 12:49:59,720
So this is just there.
16127
12:49:59,720 --> 12:50:01,720
This data feeds that thing.
16128
12:50:01,720 --> 12:50:04,720
And so when we're all done, you simply say open.
16129
12:50:04,720 --> 12:50:05,720
You don't have to do anything.
16130
12:50:05,720 --> 12:50:10,720
Open force.html.
16131
12:50:10,720 --> 12:50:13,720
And so all this beautiful JavaScript stuff is like,
16132
12:50:13,720 --> 12:50:14,720
oh, wow, that's really cool,
16133
12:50:14,720 --> 12:50:16,720
because you can move these things around.
16134
12:50:16,720 --> 12:50:17,720
Whoa.
16135
12:50:17,720 --> 12:50:20,720
You can see the circles are bigger.
16136
12:50:20,720 --> 12:50:21,720
If you hover over it for a while,
16137
12:50:21,720 --> 12:50:24,720
it shows you the big ones.
16138
12:50:24,720 --> 12:50:26,720
You know, you can see these things, and it's kind of cool.
16139
12:50:26,720 --> 12:50:28,720
So I gave you all this force.js
16140
12:50:28,720 --> 12:50:30,720
and force.html.
16141
12:50:30,720 --> 12:50:33,720
And so that kind of visualizes the page rank.
16142
12:50:33,720 --> 12:50:37,720
And you could use this to visualize quite a bit of stuff.
16143
12:50:37,720 --> 12:50:42,720
You know, it'll take you a while to pull down enough data
16144
12:50:42,720 --> 12:50:45,720
from a real website.
16145
12:50:45,720 --> 12:50:47,720
But after you pull down 400 or 500 pages
16146
12:50:47,720 --> 12:50:48,720
if you have some time,
16147
12:50:48,720 --> 12:50:51,720
then the visualization is quite interesting.
16148
12:50:51,720 --> 12:50:54,720
But you can see why we had to pull down several hundred pages
16149
12:50:54,720 --> 12:50:57,720
just to get this much page rank information.
16150
12:50:57,720 --> 12:51:03,720
Okay, so that gives you a sense
16151
12:51:03,720 --> 12:51:08,720
of how to run the page rank code in Python for everybody.
16152
12:51:08,720 --> 12:51:11,720
So thanks for listening.
16153
12:51:11,720 --> 12:51:16,720
The last visualization application
16154
12:51:16,720 --> 12:51:18,720
that we're going to take a look at is mailing lists,
16155
12:51:18,720 --> 12:51:19,720
and that's kind of ironic.
16156
12:51:19,720 --> 12:51:20,720
We started with the mailing lists,
16157
12:51:20,720 --> 12:51:22,720
and we're going to end with the mailing lists.
16158
12:51:22,720 --> 12:51:23,720
The mailing lists, of course,
16159
12:51:23,720 --> 12:51:25,720
are from my open source Sakai project,
16160
12:51:25,720 --> 12:51:28,720
which I love and am very proud of.
16161
12:51:28,720 --> 12:51:30,720
And so what we're going to do
16162
12:51:30,720 --> 12:51:32,720
is we're going to crawl the archive of a mailing list,
16163
12:51:32,720 --> 12:51:34,720
and then we're going to do two visualizations.
16164
12:51:34,720 --> 12:51:36,720
One is an activity visualization,
16165
12:51:36,720 --> 12:51:38,720
and another is a word cloud.
16166
12:51:38,720 --> 12:51:41,720
So probably the more important thing
16167
12:51:41,720 --> 12:51:44,720
is when I do the demonstration of how the software works.
16168
12:51:44,720 --> 12:51:47,720
So this is a large data set, so you've got to be careful.
16169
12:51:47,720 --> 12:51:50,720
This could spider gmain.org,
16170
12:51:50,720 --> 12:51:52,720
which is a very free and friendly archive.
16171
12:51:52,720 --> 12:51:56,720
This data originally came from gmain.org,
16172
12:51:56,720 --> 12:51:58,720
but I've got a copy of it.
16173
12:51:58,720 --> 12:52:01,720
And so gmain.org is not rate limited,
16174
12:52:01,720 --> 12:52:04,720
but if everyone who is watching this
16175
12:52:04,720 --> 12:52:06,720
starts spidering gmain.org at the same time,
16176
12:52:06,720 --> 12:52:07,720
you will crash it.
16177
12:52:07,720 --> 12:52:09,720
It just doesn't have the horsepower
16178
12:52:09,720 --> 12:52:11,720
to give you this data as fast.
16179
12:52:11,720 --> 12:52:14,720
And so I've got something that can give you the data super fast
16180
12:52:14,720 --> 12:52:17,720
and has no rate limited on a really good server,
16181
12:52:17,720 --> 12:52:19,720
and it's cached all around the world
16182
12:52:19,720 --> 12:52:21,720
using a technology called CloudFlare.
16183
12:52:21,720 --> 12:52:25,720
So please, please, please don't point this at gmain.org.
16184
12:52:25,720 --> 12:52:27,720
Point this at the URL here,
16185
12:52:27,720 --> 12:52:30,720
mboxdrchuck.net, et cetera, et cetera.
16186
12:52:30,720 --> 12:52:33,720
And then you can run this as fast as you like.
16187
12:52:33,720 --> 12:52:35,720
Now, another thing to worry about is
16188
12:52:35,720 --> 12:52:39,720
if you have a metered connection.
16189
12:52:39,720 --> 12:52:41,720
So don't do this on a cell phone connection
16190
12:52:41,720 --> 12:52:44,720
because you'll pay thousands of dollars perhaps.
16191
12:52:44,720 --> 12:52:47,720
Make sure you run a no cost connection
16192
12:52:47,720 --> 12:52:48,720
before you start running this
16193
12:52:48,720 --> 12:52:50,720
because this is going to pull a lot of data down.
16194
12:52:50,720 --> 12:52:53,720
If you just start this from scratch and you let it run,
16195
12:52:53,720 --> 12:52:56,720
on a super fast connection,
16196
12:52:56,720 --> 12:53:00,720
downloading the whole thing is probably about four hours.
16197
12:53:00,720 --> 12:53:04,720
On my home connection,
16198
12:53:04,720 --> 12:53:07,720
when I had like about a 10 megabit connection,
16199
12:53:07,720 --> 12:53:09,720
it took several days.
16200
12:53:09,720 --> 12:53:12,720
And so just understand that in this one,
16201
12:53:12,720 --> 12:53:15,720
it's both fun to deal with a ton of data,
16202
12:53:15,720 --> 12:53:17,720
and it's scary to deal with a ton of data.
16203
12:53:17,720 --> 12:53:18,720
So this one is big.
16204
12:53:18,720 --> 12:53:22,720
This one is, you'll see the process in action
16205
12:53:22,720 --> 12:53:24,720
because it'll run for a while.
16206
12:53:24,720 --> 12:53:27,720
Everything, the things will take a long time.
16207
12:53:27,720 --> 12:53:30,720
So here's basically the flow of the data
16208
12:53:30,720 --> 12:53:32,720
in this particular one.
16209
12:53:32,720 --> 12:53:34,720
You are going to have the restartable spider
16210
12:53:34,720 --> 12:53:38,720
that talks to the API, mboxdrchuck.net,
16211
12:53:38,720 --> 12:53:42,720
which has a scalable copy of all this information.
16212
12:53:42,720 --> 12:53:46,720
And again, it's going to do kind of a raw database,
16213
12:53:46,720 --> 12:53:47,720
not a very clean database.
16214
12:53:47,720 --> 12:53:48,720
It's sort of a mess.
16215
12:53:48,720 --> 12:53:51,720
It's just enough columns to keep track of whether or not
16216
12:53:51,720 --> 12:53:53,720
we've got this page or not.
16217
12:53:53,720 --> 12:53:57,720
And so this has the ones we've retrieved so far.
16218
12:53:57,720 --> 12:54:00,720
And so what gmain does is it sort of scans down
16219
12:54:00,720 --> 12:54:02,720
to see where to retrieve next, gets that,
16220
12:54:02,720 --> 12:54:05,720
and then starts scanning and then adding things here.
16221
12:54:05,720 --> 12:54:07,720
So it just adds it and then it blows up
16222
12:54:07,720 --> 12:54:09,720
and then it comes in again and says,
16223
12:54:09,720 --> 12:54:11,720
okay, I'll start here and then it starts retrieving stuff
16224
12:54:11,720 --> 12:54:14,720
and fills this in, fills this in, fills this in.
16225
12:54:14,720 --> 12:54:16,720
And sometimes you put like a delay in this
16226
12:54:16,720 --> 12:54:19,720
so you don't overwhelm networks, you don't overwhelm servers.
16227
12:54:19,720 --> 12:54:22,720
But basically this is pretty much a raw retrieval
16228
12:54:22,720 --> 12:54:24,720
of the email messages.
16229
12:54:24,720 --> 12:54:26,720
And this file can get rather large.
16230
12:54:26,720 --> 12:54:28,720
This is the one that's greater than a gigabyte.
16231
12:54:28,720 --> 12:54:31,720
Now this data is actually really nasty.
16232
12:54:31,720 --> 12:54:33,720
It's email data.
16233
12:54:33,720 --> 12:54:34,720
The date format's changed.
16234
12:54:34,720 --> 12:54:41,720
This is data that lasted from 2004 to like 2012 or 2013.
16235
12:54:41,720 --> 12:54:45,720
And so this data has got a lot of things wrong with it.
16236
12:54:45,720 --> 12:54:48,720
It even has things where people's email address has changed.
16237
12:54:48,720 --> 12:54:50,720
And so it has this mapping file.
16238
12:54:50,720 --> 12:54:53,720
This comes along with it, this mapping file that says,
16239
12:54:53,720 --> 12:54:56,720
here's this one person and here are the six email addresses
16240
12:54:56,720 --> 12:55:00,720
that they used throughout the life of the project.
16241
12:55:00,720 --> 12:55:03,720
And so there is a relatively complex,
16242
12:55:03,720 --> 12:55:09,720
and so this part here is super slow, very slow.
16243
12:55:09,720 --> 12:55:12,720
This part here is slow.
16244
12:55:12,720 --> 12:55:15,720
But it'll take like, depending on how fast your computer is,
16245
12:55:15,720 --> 12:55:17,720
somewhere between two minutes and ten minutes.
16246
12:55:17,720 --> 12:55:20,720
This first part will take days, perhaps,
16247
12:55:20,720 --> 12:55:22,720
depending on the speed of your network connection.
16248
12:55:22,720 --> 12:55:25,720
And so what gmodel does is it reads through this.
16249
12:55:25,720 --> 12:55:27,720
It actually re-creates, it wipes this out
16250
12:55:27,720 --> 12:55:30,720
and re-creates index.sqlite every time it runs
16251
12:55:30,720 --> 12:55:32,720
so that you can change any number of things,
16252
12:55:32,720 --> 12:55:35,720
you can respiter things, you can do whatever.
16253
12:55:35,720 --> 12:55:38,720
And often the cleanup, this is one of those cleanup processes,
16254
12:55:38,720 --> 12:55:40,720
and you have to tweak the cleanup process.
16255
12:55:40,720 --> 12:55:43,720
You're like, look at your data, like, oh, the cleanup missed something,
16256
12:55:43,720 --> 12:55:44,720
so I've got to run it again.
16257
12:55:44,720 --> 12:55:48,720
So this produces index.sqlite every time it runs.
16258
12:55:48,720 --> 12:55:50,720
So this is like two to ten minutes.
16259
12:55:50,720 --> 12:55:52,720
gmodel is two to ten minutes.
16260
12:55:52,720 --> 12:55:56,720
And it maps names, and when it's all said and done,
16261
12:55:56,720 --> 12:56:01,720
this is a very small, highly normalized, it's a nice data model.
16262
12:56:01,720 --> 12:56:04,720
This one here, the content.sqlite has an ugly data model.
16263
12:56:04,720 --> 12:56:06,720
Index.sqlite has a pretty data model.
16264
12:56:06,720 --> 12:56:09,720
It's got foreign keys, it's got all this stuff.
16265
12:56:09,720 --> 12:56:12,720
And all those things we talked about in the database where it's efficient.
16266
12:56:12,720 --> 12:56:16,720
And so in your mind, keep track of how fast it is to scan all the data
16267
12:56:16,720 --> 12:56:18,720
in a database with a bad model,
16268
12:56:18,720 --> 12:56:21,720
and then watch when you run like gbasic, which is a scanner,
16269
12:56:21,720 --> 12:56:23,720
or gline, which produces line data, or gword,
16270
12:56:23,720 --> 12:56:25,720
and watch how fast they run.
16271
12:56:25,720 --> 12:56:28,720
They run in like a couple of seconds at the most,
16272
12:56:28,720 --> 12:56:30,720
and this runs in two to ten minutes.
16273
12:56:30,720 --> 12:56:34,720
And the difference is that's because the data is efficiently modeled
16274
12:56:34,720 --> 12:56:35,720
in index.sqlite.
16275
12:56:35,720 --> 12:56:38,720
So you can take a look at that using SQLite browser
16276
12:56:38,720 --> 12:56:40,720
and take a look at the data model.
16277
12:56:40,720 --> 12:56:42,720
And you'll see it looks just like the stuff we talked about
16278
12:56:42,720 --> 12:56:43,720
in the database chapter.
16279
12:56:43,720 --> 12:56:46,720
It's got foreign keys and all those things.
16280
12:56:46,720 --> 12:56:48,720
And so that runs, and you've got this.
16281
12:56:48,720 --> 12:56:51,720
And then we do our visualizations and our analysis
16282
12:56:51,720 --> 12:56:53,720
from this clean version of all the data.
16283
12:56:53,720 --> 12:56:56,720
And so gbasic just loops through and prints some stuff out.
16284
12:56:56,720 --> 12:56:58,720
It's a great way to test things.
16285
12:56:58,720 --> 12:57:00,720
It's a pretty easy to understand program,
16286
12:57:00,720 --> 12:57:01,720
and you could take a look at it.
16287
12:57:01,720 --> 12:57:04,720
Gline does some bucketing and makes some histograms
16288
12:57:04,720 --> 12:57:06,720
to produce a line graph.
16289
12:57:06,720 --> 12:57:09,720
And then gword does a different histogram.
16290
12:57:09,720 --> 12:57:11,720
It does a histogram of word frequency
16291
12:57:11,720 --> 12:57:14,720
and then produces that as the word frequency ends up in gword.js.
16292
12:57:14,720 --> 12:57:21,720
And then we have two HTML files that use the d3.js visualization
16293
12:57:21,720 --> 12:57:24,720
to produce a line and a word chart.
16294
12:57:24,720 --> 12:57:28,720
And so in another video, I will show you how this code works,
16295
12:57:28,720 --> 12:57:31,720
which is probably more useful than this picture.
16296
12:57:31,720 --> 12:57:37,720
But this is a whole bunch of good stuff
16297
12:57:37,720 --> 12:57:39,720
in this particular application.
16298
12:57:39,720 --> 12:57:42,720
And if you really understand everything in here,
16299
12:57:42,720 --> 12:57:44,720
you can build a pretty sophisticated
16300
12:57:44,720 --> 12:57:47,720
data retrieval and analysis pipeline.
16301
12:57:47,720 --> 12:57:49,720
And so that's it.
16302
12:57:49,720 --> 12:57:51,720
Thank you for watching all these lectures,
16303
12:57:51,720 --> 12:57:54,720
and I look forward to seeing you on the net.
16304
12:57:58,720 --> 12:58:00,720
We're doing some code walkthroughs.
16305
12:58:00,720 --> 12:58:02,720
If you want to get the source code,
16306
12:58:02,720 --> 12:58:04,720
you can take a look at the sample code
16307
12:58:04,720 --> 12:58:07,720
and download it and work through it.
16308
12:58:07,720 --> 12:58:12,720
And so what we're working on now is doing some retrieval
16309
12:58:12,720 --> 12:58:15,720
and visualization of email data.
16310
12:58:15,720 --> 12:58:16,720
It's kind of ironic.
16311
12:58:16,720 --> 12:58:24,720
We are going to now look at the email data that we started with.
16312
12:58:24,720 --> 12:58:28,720
It's the same Sakai developer list email data.
16313
12:58:28,720 --> 12:58:32,720
And so there's this service called gmain.
16314
12:58:32,720 --> 12:58:36,720
And gmain archives developer lists and various email lists.
16315
12:58:36,720 --> 12:58:39,720
And I've made a copy of their data because all the students
16316
12:58:39,720 --> 12:58:43,720
in my class hitting their server with their API would crush it.
16317
12:58:43,720 --> 12:58:47,720
So in order to be a nice guy, I put up a much more powerful server
16318
12:58:47,720 --> 12:58:51,720
with just the data from this one list.
16319
12:58:51,720 --> 12:58:53,720
And it's about a gigabyte of data,
16320
12:58:53,720 --> 12:58:56,720
so be real careful if you're paying for network.
16321
12:58:56,720 --> 12:58:59,720
So the basic process we're going to go through is we're going to have
16322
12:58:59,720 --> 12:59:03,720
a spidering process that's a simple, restartable,
16323
12:59:03,720 --> 12:59:07,720
focused on the network problems, data pulling,
16324
12:59:07,720 --> 12:59:11,720
to pull content.sqlite, and there's going to be a database there.
16325
12:59:11,720 --> 12:59:13,720
And then we're going to have a cleanup process.
16326
12:59:13,720 --> 12:59:16,720
This database is going to get large, about a gigabyte.
16327
12:59:16,720 --> 12:59:19,720
And then we're going to have a process that kind of grinds through this data.
16328
12:59:19,720 --> 12:59:22,720
It takes a while.
16329
12:59:22,720 --> 12:59:24,720
And so then it's going to read this mapping,
16330
12:59:24,720 --> 12:59:26,720
and I'll show you that when it comes,
16331
12:59:26,720 --> 12:59:29,720
because things like people's names have changed over all these years.
16332
12:59:29,720 --> 12:59:31,720
And it does a cleanup and makes a really nice,
16333
12:59:31,720 --> 12:59:34,720
highly relational version of this data.
16334
12:59:34,720 --> 12:59:36,720
And then we visualize from here.
16335
12:59:36,720 --> 12:59:40,720
And so this could take you several days to finish this.
16336
12:59:40,720 --> 12:59:42,720
This will take like a few minutes to run,
16337
12:59:42,720 --> 12:59:45,720
and then this will just take seconds to run.
16338
12:59:45,720 --> 12:59:50,720
And so this is a multi-step process where if you were doing something
16339
12:59:50,720 --> 12:59:53,720
like running something for two days to produce a visualization,
16340
12:59:53,720 --> 12:59:56,720
and it blew up three cars the way through, it would do you no good.
16341
12:59:56,720 --> 12:59:59,720
And so that's why we break this into simple parts.
16342
12:59:59,720 --> 13:00:03,720
But right now we're just going to focus on this part right here,
16343
13:00:03,720 --> 13:00:10,720
and take a look at the mail bit, and retrieve the mail,
16344
13:00:10,720 --> 13:00:15,720
and then we'll have another video to talk about the rest of this stuff.
16345
13:00:15,720 --> 13:00:21,720
So let's take a look at the code.
16346
13:00:21,720 --> 13:00:24,720
So here is gmain.py.
16347
13:00:24,720 --> 13:00:26,720
That is the basic code.
16348
13:00:26,720 --> 13:00:29,720
And hopefully this stuff is starting to look familiar.
16349
13:00:29,720 --> 13:00:32,720
The thing that's weird here is we've got to do some date-time parsing.
16350
13:00:32,720 --> 13:00:35,720
And there is code that's out there, but you may have to install it.
16351
13:00:35,720 --> 13:00:40,720
And I had to write my code in a way that didn't assume
16352
13:00:40,720 --> 13:00:42,720
that you could install the date-time parser.
16353
13:00:42,720 --> 13:00:45,720
And so it has it, even if that's not there,
16354
13:00:45,720 --> 13:00:47,720
it uses my own date-time parser, and that's what this code is.
16355
13:00:47,720 --> 13:00:50,720
Don't worry too much about that.
16356
13:00:50,720 --> 13:00:55,720
And of course we have to deal with the lack of certificates inside of Python.
16357
13:00:55,720 --> 13:00:59,720
And so we start things out.
16358
13:00:59,720 --> 13:01:02,720
And this is really a simple table.
16359
13:01:02,720 --> 13:01:05,720
We've got a messages table that's got a primary key,
16360
13:01:05,720 --> 13:01:09,720
the email itself, when it was sent, what the subject,
16361
13:01:09,720 --> 13:01:13,720
and the headers, and the body.
16362
13:01:13,720 --> 13:01:17,720
And so what we're going to do is, because we have to pick up where we left off,
16363
13:01:17,720 --> 13:01:25,720
we're going to select the largest primary key from the messages table
16364
13:01:25,720 --> 13:01:27,720
and retrieve that.
16365
13:01:27,720 --> 13:01:31,720
And then we're going to go to the one after that.
16366
13:01:31,720 --> 13:01:35,720
And so we know what the ID is,
16367
13:01:35,720 --> 13:01:38,720
and we're going to pick up where we left off.
16368
13:01:38,720 --> 13:01:43,720
And so we have a starting point that starts either 0 or 1.
16369
13:01:43,720 --> 13:01:47,720
And we're going to ask how many messages to retrieve.
16370
13:01:47,720 --> 13:01:49,720
We've got some counters.
16371
13:01:49,720 --> 13:01:51,720
And so we're going to say, okay,
16372
13:01:51,720 --> 13:01:56,720
see if select ID for messages where ID equals whatever that starting is,
16373
13:01:56,720 --> 13:01:58,720
that's the highest number we've seen so far.
16374
13:01:58,720 --> 13:02:03,720
And if row is not none,
16375
13:02:03,720 --> 13:02:07,720
that means we've already retrieved this particular email message.
16376
13:02:07,720 --> 13:02:10,720
Otherwise we're going to keep on going, and we're in good shape.
16377
13:02:10,720 --> 13:02:12,720
And this is one that we want to retrieve.
16378
13:02:12,720 --> 13:02:14,720
And we're subtracting that so we don't.
16379
13:02:14,720 --> 13:02:16,720
And so this is the base URL.
16380
13:02:16,720 --> 13:02:21,720
This is the URL of our API,
16381
13:02:21,720 --> 13:02:26,720
the one that I have a nice copy of all this data on a server
16382
13:02:26,720 --> 13:02:29,720
that's accessible worldwide and won't crash.
16383
13:02:29,720 --> 13:02:33,720
So the format of this is you can say I would like the email address
16384
13:02:33,720 --> 13:02:48,720
from 1 to 2 or from 100, oops, from 102, 101, message 101 to 102.
16385
13:02:48,720 --> 13:02:51,720
And we can just kind of walk through these things.
16386
13:02:51,720 --> 13:02:53,720
So that's the message ID.
16387
13:02:53,720 --> 13:02:59,720
And so if we're going to make the URL,
16388
13:02:59,720 --> 13:03:01,720
we're going to take the base URL,
16389
13:03:01,720 --> 13:03:03,720
add the starting address, and then add plus one.
16390
13:03:03,720 --> 13:03:06,720
So we got the slash at the end of this starting address.
16391
13:03:06,720 --> 13:03:10,720
And so that's how we form those.
16392
13:03:10,720 --> 13:03:14,720
And we're going to retrieve that, and we're going to decode it.
16393
13:03:14,720 --> 13:03:16,720
We've seen this in some other ones.
16394
13:03:16,720 --> 13:03:19,720
We're going to check to see if we got legit data.
16395
13:03:19,720 --> 13:03:24,720
If not, if I got a 404 not found or something else, we're going to quit.
16396
13:03:24,720 --> 13:03:28,720
If someone has control C, which is our control Z,
16397
13:03:28,720 --> 13:03:30,720
we'll get the program interrupt and we'll stop.
16398
13:03:30,720 --> 13:03:38,720
If there's some other problem, we're going to complain and keep going.
16399
13:03:38,720 --> 13:03:40,720
And if we have five failures in a row, we're going to quit,
16400
13:03:40,720 --> 13:03:44,720
but we'll just keep on going because these things do have glitchy bits here.
16401
13:03:44,720 --> 13:03:48,720
And so at this point, if we made it this far, we've retrieved the URL
16402
13:03:48,720 --> 13:03:50,720
and we've got the number of characters we've retrieved.
16403
13:03:50,720 --> 13:03:55,720
And if we get bad data, if it doesn't start with from,
16404
13:03:55,720 --> 13:03:57,720
because this is a mail message,
16405
13:03:57,720 --> 13:04:05,720
and they all start with from space, if it's right, it starts with from space.
16406
13:04:05,720 --> 13:04:09,720
Then what we're going to, we're going to tolerate up to five failures there
16407
13:04:09,720 --> 13:04:12,720
for bad data because it could be bad.
16408
13:04:12,720 --> 13:04:14,720
And then we're going to find a blank line because that's the new line
16409
13:04:14,720 --> 13:04:17,720
at the end of one line and then a blank line.
16410
13:04:17,720 --> 13:04:20,720
And then we're going to take and break this into the headers,
16411
13:04:20,720 --> 13:04:25,720
the mail headers, which is that mail headers is this stuff right here,
16412
13:04:25,720 --> 13:04:27,720
up to but not including the blank line,
16413
13:04:27,720 --> 13:04:32,720
and then the body is everything after that, okay?
16414
13:04:32,720 --> 13:04:36,720
And so we'll just have, break that into pieces.
16415
13:04:36,720 --> 13:04:40,720
Otherwise we'll complain and tolerate up to five characters.
16416
13:04:40,720 --> 13:04:42,720
And then we're going to use a regular expression,
16417
13:04:42,720 --> 13:04:44,720
kind of from the regular expressions chapter,
16418
13:04:44,720 --> 13:04:50,720
to pull out an email address from the from colon line
16419
13:04:50,720 --> 13:04:54,720
somewhere in these headers, from colon right there.
16420
13:04:54,720 --> 13:04:58,720
It's going to go find a less than and then pull, oops, come on,
16421
13:04:58,720 --> 13:05:00,720
pull this stuff out up to it.
16422
13:05:00,720 --> 13:05:03,720
So you got the less than, you got the parenthesis,
16423
13:05:03,720 --> 13:05:06,720
you got one or more non-blank characters followed by the outside,
16424
13:05:06,720 --> 13:05:08,720
followed by one or more non-blank characters.
16425
13:05:08,720 --> 13:05:11,720
And we'll get back a list of those.
16426
13:05:11,720 --> 13:05:13,720
We should only get one.
16427
13:05:13,720 --> 13:05:16,720
If we find one, we're going to grab the email.
16428
13:05:16,720 --> 13:05:18,720
We're going to strip the lower case.
16429
13:05:18,720 --> 13:05:22,720
And if we got some little nasty less than sign in there,
16430
13:05:22,720 --> 13:05:24,720
we'll tolerate that as well.
16431
13:05:24,720 --> 13:05:26,720
So this is kind of clean up, and you get used to this
16432
13:05:26,720 --> 13:05:28,720
where you're like, oh, how come all those email addresses
16433
13:05:28,720 --> 13:05:32,720
have this other stuff in them?
16434
13:05:32,720 --> 13:05:36,720
And then we also look for it if there are no less than signs.
16435
13:05:36,720 --> 13:05:40,720
And we do this way, this is, and that's different.
16436
13:05:40,720 --> 13:05:43,720
Some mail messages have it this way, and others,
16437
13:05:43,720 --> 13:05:45,720
again, you write this code after you watch it for a while,
16438
13:05:45,720 --> 13:05:48,720
and you're like, oh, it's crapped out and giving me bad stuff.
16439
13:05:48,720 --> 13:05:51,720
And I make them all lower case so they match better
16440
13:05:51,720 --> 13:05:53,720
and I get rid of bad characters.
16441
13:05:53,720 --> 13:05:55,720
Now I got an email address.
16442
13:05:55,720 --> 13:05:58,720
Then what I do is I look for the date of this.
16443
13:05:58,720 --> 13:06:00,720
So I'm going to graph these by date,
16444
13:06:00,720 --> 13:06:03,720
so I look for this line and use a regular expression
16445
13:06:03,720 --> 13:06:05,720
to pull that out.
16446
13:06:05,720 --> 13:06:08,720
So I'm looking for a date, followed by a blank,
16447
13:06:08,720 --> 13:06:13,720
followed by any number of characters, followed by a comma.
16448
13:06:13,720 --> 13:06:16,720
So I'm not interested in this Wednesday bit,
16449
13:06:16,720 --> 13:06:18,720
so I'm skipping that bit right there,
16450
13:06:18,720 --> 13:06:22,720
and going and grabbing everything after that comma space.
16451
13:06:22,720 --> 13:06:26,720
And so it's really here to the end of the line.
16452
13:06:26,720 --> 13:06:27,720
So that's the new line.
16453
13:06:27,720 --> 13:06:29,720
So it's going all the way.
16454
13:06:29,720 --> 13:06:31,720
It's going to pull this bit right here.
16455
13:06:31,720 --> 13:06:33,720
That's the text.
16456
13:06:33,720 --> 13:06:34,720
And this is where we're going to say,
16457
13:06:34,720 --> 13:06:36,720
oh, that's kind of a funky-looking date
16458
13:06:36,720 --> 13:06:38,720
and we want to standardize that date.
16459
13:06:38,720 --> 13:06:43,720
So we're going to, let's see.
16460
13:06:43,720 --> 13:06:46,720
Yeah, we're going to chop it off at the 26th character.
16461
13:06:46,720 --> 13:06:48,720
Apparently, I don't know what the 26th,
16462
13:06:48,720 --> 13:06:50,720
why we care about the 26th character,
16463
13:06:50,720 --> 13:06:52,720
but we chop that off at the 26th character.
16464
13:06:52,720 --> 13:06:54,720
And then we're going to parse it,
16465
13:06:54,720 --> 13:06:57,720
and that's going to give us back a nice clean date,
16466
13:06:57,720 --> 13:06:59,720
sent at date.
16467
13:06:59,720 --> 13:07:01,720
Otherwise, we're going to complete.
16468
13:07:01,720 --> 13:07:02,720
We're going to quit.
16469
13:07:02,720 --> 13:07:03,720
And if we can't parse it,
16470
13:07:03,720 --> 13:07:07,720
then we're going to tolerate five bad email addresses in a row.
16471
13:07:07,720 --> 13:07:09,720
Then we're looking for the subject line
16472
13:07:09,720 --> 13:07:12,720
using another regular expression.
16473
13:07:12,720 --> 13:07:14,720
Subject line, regular expression.
16474
13:07:14,720 --> 13:07:15,720
That's pretty easy.
16475
13:07:15,720 --> 13:07:17,720
Up to, but not including, right?
16476
13:07:17,720 --> 13:07:19,720
There's a blank there.
16477
13:07:19,720 --> 13:07:27,720
It's the subject.
16478
13:07:27,720 --> 13:07:28,720
Let me pull that out.
16479
13:07:28,720 --> 13:07:29,720
We get the subject.
16480
13:07:29,720 --> 13:07:31,720
Now, at this point, we've parsed it
16481
13:07:31,720 --> 13:07:32,720
and we've got good stuff,
16482
13:07:32,720 --> 13:07:33,720
so we reset the fail counter
16483
13:07:33,720 --> 13:07:34,720
because I kept saying,
16484
13:07:34,720 --> 13:07:37,720
if you fail five straight times, you quit.
16485
13:07:37,720 --> 13:07:39,720
And we're going to print it out,
16486
13:07:39,720 --> 13:07:41,720
and then we're just going to insert that stuff.
16487
13:07:41,720 --> 13:07:44,720
We've got the ID of the message,
16488
13:07:44,720 --> 13:07:48,720
which we've got the email address that it came from,
16489
13:07:48,720 --> 13:07:50,720
the time it was sent, the subject,
16490
13:07:50,720 --> 13:07:52,720
and then basically the headers in the body,
16491
13:07:52,720 --> 13:07:53,720
and we're just inserting it.
16492
13:07:53,720 --> 13:07:54,720
And now we're going to say,
16493
13:07:54,720 --> 13:07:56,720
every 50th we're going to commit it,
16494
13:07:56,720 --> 13:07:57,720
so that speeds things up,
16495
13:07:57,720 --> 13:07:59,720
and every 100th we're going to wait a second.
16496
13:07:59,720 --> 13:08:02,720
So that's, you know, count is going up, up, up, up, up,
16497
13:08:02,720 --> 13:08:05,720
and every 50th you'll see it pause,
16498
13:08:05,720 --> 13:08:07,720
and then it will, every 100th,
16499
13:08:07,720 --> 13:08:09,720
it'll pause for a second.
16500
13:08:09,720 --> 13:08:11,720
Mostly that's to let me hit control C
16501
13:08:11,720 --> 13:08:15,720
or to not overload any server.
16502
13:08:15,720 --> 13:08:17,720
Okay, so that's the simple one.
16503
13:08:17,720 --> 13:08:19,720
The problem is, is this data just gets ugly,
16504
13:08:19,720 --> 13:08:22,720
and so you'll find yourself wanting to reset this
16505
13:08:22,720 --> 13:08:23,720
and start it over.
16506
13:08:23,720 --> 13:08:26,720
This one's going to work, of course,
16507
13:08:26,720 --> 13:08:29,720
but it's, these are hard to build,
16508
13:08:29,720 --> 13:08:32,720
and that's why it's a good idea.
16509
13:08:32,720 --> 13:08:33,720
Oops.
16510
13:08:33,720 --> 13:08:34,720
Python
16511
13:08:36,720 --> 13:08:39,720
three gmain.py.
16512
13:08:39,720 --> 13:08:41,720
How many messages?
16513
13:08:41,720 --> 13:08:43,720
Well, let's just do one.
16514
13:08:43,720 --> 13:08:44,720
Choo!
16515
13:08:44,720 --> 13:08:45,720
Okay, so it went and grabbed,
16516
13:08:45,720 --> 13:08:48,720
oh, do I have this already running?
16517
13:08:48,720 --> 13:08:50,720
51 through 52.
16518
13:08:50,720 --> 13:08:52,720
Let me start over.
16519
13:08:52,720 --> 13:08:56,720
That's minus L star SQLite.
16520
13:08:56,720 --> 13:08:57,720
Okay, rm content.
16521
13:08:57,720 --> 13:09:00,720
I must have run it to test it.
16522
13:09:00,720 --> 13:09:02,720
So, let's run it again.
16523
13:09:02,720 --> 13:09:06,720
Python gmain.py and ask for one message.
16524
13:09:06,720 --> 13:09:08,720
Okay, so there we went and got message one
16525
13:09:08,720 --> 13:09:09,720
from one to two.
16526
13:09:09,720 --> 13:09:11,720
We got 226 two characters,
16527
13:09:11,720 --> 13:09:13,720
and we printed out the email address,
16528
13:09:13,720 --> 13:09:15,720
the time we got it after all that hacking,
16529
13:09:15,720 --> 13:09:18,720
and the subject line, and that's what we got.
16530
13:09:18,720 --> 13:09:21,720
So, if we take a look at the database
16531
13:09:21,720 --> 13:09:24,720
and we go into the gmain,
16532
13:09:24,720 --> 13:09:28,720
oh, every time you see the content SQLite journal,
16533
13:09:28,720 --> 13:09:31,720
that means it needed to run a commit,
16534
13:09:31,720 --> 13:09:32,720
and it hasn't run a commit,
16535
13:09:32,720 --> 13:09:34,720
but I'll hit enter and that will do the commit,
16536
13:09:34,720 --> 13:09:36,720
and you see that vanish.
16537
13:09:36,720 --> 13:09:40,720
So, now I can open it and I take a look at,
16538
13:09:43,720 --> 13:09:45,720
how come there's no messages?
16539
13:09:45,720 --> 13:09:48,720
Did that one not get stored in there for some reason?
16540
13:09:48,720 --> 13:09:50,720
Used refresh.
16541
13:09:54,720 --> 13:09:56,720
Huh, let's run it again.
16542
13:09:59,720 --> 13:10:01,720
Maybe it didn't commit.
16543
13:10:01,720 --> 13:10:03,720
Maybe it got a bug in it.
16544
13:10:03,720 --> 13:10:06,720
Let's make a change to the code.
16545
13:10:13,720 --> 13:10:17,720
I'm going to see this connection.commit.
16546
13:10:18,720 --> 13:10:19,720
See that?
16547
13:10:20,720 --> 13:10:22,720
Connection.commit.
16548
13:10:24,720 --> 13:10:26,720
Gonna commit there,
16549
13:10:26,720 --> 13:10:28,720
and the other thing I'm gonna do is,
16550
13:10:28,720 --> 13:10:30,720
every time I stop to read,
16551
13:10:30,720 --> 13:10:33,720
I'm gonna commit right before I read it.
16552
13:10:33,720 --> 13:10:35,720
So, I think we should, I hope that doesn't blow up.
16553
13:10:35,720 --> 13:10:36,720
We'll see.
16554
13:10:36,720 --> 13:10:39,720
So, the idea is, if I wanna stop,
16555
13:10:39,720 --> 13:10:40,720
I wanna commit it.
16556
13:10:40,720 --> 13:10:41,720
So, let's do this.
16557
13:10:41,720 --> 13:10:43,720
Let's do one message,
16558
13:10:44,720 --> 13:10:47,720
and now I should hit, is it committed?
16559
13:10:47,720 --> 13:10:48,720
Now that I've put the commits in,
16560
13:10:48,720 --> 13:10:51,720
I think that it will look better.
16561
13:10:54,720 --> 13:10:55,720
I can't refresh,
16562
13:10:55,720 --> 13:10:57,720
and so there it is because I committed it,
16563
13:10:57,720 --> 13:11:00,720
and I don't have, yeah, I don't have the journal file,
16564
13:11:00,720 --> 13:11:01,720
so that's good.
16565
13:11:01,720 --> 13:11:03,720
So, that's a good idea to put those commits there.
16566
13:11:03,720 --> 13:11:05,720
So, I'll just leave those commits in.
16567
13:11:05,720 --> 13:11:08,720
When you download it, it'll have those commits in there.
16568
13:11:08,720 --> 13:11:10,720
So, again, I put a commit here,
16569
13:11:10,720 --> 13:11:14,720
and a commit at the very, very end,
16570
13:11:15,720 --> 13:11:18,720
to make sure, and then I, so, I missed that.
16571
13:11:19,720 --> 13:11:21,720
But now we get one, right?
16572
13:11:21,720 --> 13:11:22,720
And so, let's just run it again,
16573
13:11:22,720 --> 13:11:25,720
and you'll see how by selecting the max of the ID,
16574
13:11:25,720 --> 13:11:28,720
it's gonna select the max of this and then add one to it,
16575
13:11:28,720 --> 13:11:30,720
so it doesn't do the next one.
16576
13:11:30,720 --> 13:11:33,720
So, if I run it again,
16577
13:11:33,720 --> 13:11:36,720
I say, give me one message, so it goes two to three,
16578
13:11:36,720 --> 13:11:39,720
and give me two messages, right?
16579
13:11:39,720 --> 13:11:42,720
So, I hit enter, and I can do refresh,
16580
13:11:42,720 --> 13:11:45,720
and now you see we've got four messages, okay?
16581
13:11:45,720 --> 13:11:48,720
And so, let's just fire this baby up.
16582
13:11:48,720 --> 13:11:49,720
Tell it to get 100.
16583
13:11:50,720 --> 13:11:53,720
Er, run, run, run, run, run, run, run.
16584
13:11:53,720 --> 13:11:56,720
All right, it just goes and goes,
16585
13:11:56,720 --> 13:11:58,720
and it pauses once in a while to do a commit,
16586
13:11:58,720 --> 13:12:00,720
and if I made a commit every time,
16587
13:12:00,720 --> 13:12:03,720
oop, it just paused there, now it finished.
16588
13:12:03,720 --> 13:12:07,720
So, this'll run, and we will get a bunch of data.
16589
13:12:10,720 --> 13:12:12,720
The problem is, is if I just run this,
16590
13:12:12,720 --> 13:12:15,720
it'll take about five hours, okay,
16591
13:12:15,720 --> 13:12:17,720
to run this and get this all,
16592
13:12:17,720 --> 13:12:18,720
and I've got a really fast connection.
16593
13:12:18,720 --> 13:12:22,720
So, I have got a file that you can download,
16594
13:12:22,720 --> 13:12:25,720
let's go find it, let's see if I can,
16595
13:12:25,720 --> 13:12:28,720
let's see how long it'll take me to download this.
16596
13:12:28,720 --> 13:12:31,720
I've got a file that you can download and save.
16597
13:12:31,720 --> 13:12:36,720
Now, I'm gonna use the command line, curl,
16598
13:12:36,720 --> 13:12:39,720
or wget is another command that we Linux
16599
13:12:39,720 --> 13:12:41,720
and Mac people can use.
16600
13:12:41,720 --> 13:12:43,720
I don't know, you might have to use your browser do it,
16601
13:12:43,720 --> 13:12:45,720
let's see how long this is gonna take.
16602
13:12:45,720 --> 13:12:49,720
Yum, it's retrieving, minute 30.
16603
13:12:49,720 --> 13:12:54,720
Okay, well, I'll just wait when this come back.
16604
13:13:14,720 --> 13:13:16,720
Okay, so now that's done.
16605
13:13:16,720 --> 13:13:19,720
I was averaging 10 megabits a second.
16606
13:13:19,720 --> 13:13:23,720
I downloaded about 600 megabytes, 10 megabits a second.
16607
13:13:23,720 --> 13:13:26,720
That will probably be slower for you.
16608
13:13:26,720 --> 13:13:29,720
But, so now if I take a look,
16609
13:13:29,720 --> 13:13:34,720
you're gonna find that that content.sqlite
16610
13:13:34,720 --> 13:13:37,720
is 624 megabytes.
16611
13:13:37,720 --> 13:13:39,720
Now, what happens is I've pre-spitered this,
16612
13:13:39,720 --> 13:13:42,720
and so now if you run gmain.py
16613
13:13:42,720 --> 13:13:45,720
and ask for five more messages,
16614
13:13:45,720 --> 13:13:48,720
it will pick up where I left that one off.
16615
13:13:48,720 --> 13:13:51,720
So it's up to message 59,000.
16616
13:13:51,720 --> 13:13:53,720
And I think that, oh, you saw an error.
16617
13:13:53,720 --> 13:13:54,720
You saw a bug in that one.
16618
13:13:54,720 --> 13:13:56,720
I don't know what's wrong with that one.
16619
13:13:56,720 --> 13:13:58,720
So let's see if, so at this point,
16620
13:13:58,720 --> 13:14:00,720
we're gonna have most of the data.
16621
13:14:00,720 --> 13:14:04,720
It might find its way to the very end.
16622
13:14:04,720 --> 13:14:07,720
Once you get this, it should be not too much more.
16623
13:14:07,720 --> 13:14:13,720
I don't know, maybe it's like 63,000 or something.
16624
13:14:13,720 --> 13:14:16,720
So what we'll do is we will let that run,
16625
13:14:16,720 --> 13:14:20,720
and we will come back when that one's finished
16626
13:14:20,720 --> 13:14:25,720
and run the next phase after it's got all of its data, okay?
16627
13:14:25,720 --> 13:14:27,720
So thanks for listening.
16628
13:14:34,720 --> 13:14:36,720
The work that we're doing right now
16629
13:14:36,720 --> 13:14:38,720
is we are in the process of building
16630
13:14:38,720 --> 13:14:43,720
a writer and visualization tool for email data
16631
13:14:43,720 --> 13:14:46,720
that came originally from this website gmain,
16632
13:14:46,720 --> 13:14:48,720
but I've got my own copy of it.
16633
13:14:48,720 --> 13:14:51,720
And so what we've done before is we ran gmain.py,
16634
13:14:51,720 --> 13:14:55,720
and I grabbed a URL.
16635
13:14:55,720 --> 13:14:59,720
I have a URL that has all this data,
16636
13:14:59,720 --> 13:15:03,720
and I downloaded that, and then I ran gmain again
16637
13:15:03,720 --> 13:15:07,720
to catch up, and so it took quite a bit of catching up.
16638
13:15:07,720 --> 13:15:09,720
But by the time I get to, remember how I said
16639
13:15:09,720 --> 13:15:12,720
it tries to fail five times?
16640
13:15:12,720 --> 13:15:16,720
Well, it ran out of data at 60,421,
16641
13:15:16,720 --> 13:15:19,720
and then it started failing, and then it quit.
16642
13:15:19,720 --> 13:15:22,720
So we pretty much have all of our data now.
16643
13:15:22,720 --> 13:15:28,720
We have finished this process in content SQLite, okay?
16644
13:15:28,720 --> 13:15:34,720
And if I take a look in the database browser,
16645
13:15:34,720 --> 13:15:38,720
we can see we've got 59,823 email messages.
16646
13:15:38,720 --> 13:15:40,720
And so if I look at any of these things,
16647
13:15:40,720 --> 13:15:44,720
you see the headers, you see the subject line,
16648
13:15:44,720 --> 13:15:47,720
you see the email address, you see the body of it.
16649
13:15:47,720 --> 13:15:52,720
So remember I split the body in half and the headers.
16650
13:15:52,720 --> 13:15:56,720
And so I made this as raw as I possibly could
16651
13:15:56,720 --> 13:15:59,720
because as you saw, I had to spend so much time in the gmain
16652
13:15:59,720 --> 13:16:03,720
just getting the data successfully retrieved.
16653
13:16:03,720 --> 13:16:05,720
And so I don't like cleaning the data up too much.
16654
13:16:05,720 --> 13:16:07,720
And so what we're gonna look at next
16655
13:16:07,720 --> 13:16:11,720
is the data cleaning process, okay?
16656
13:16:11,720 --> 13:16:14,720
And so this is gmodel.py is the code
16657
13:16:14,720 --> 13:16:17,720
we're gonna take a look at now.
16658
13:16:17,720 --> 13:16:22,720
So let's get rid of those guys and look at gmodel.py.
16659
13:16:24,720 --> 13:16:27,720
I don't think I need URL lib in this code.
16660
13:16:27,720 --> 13:16:33,720
Do I have any URL lib?
16661
13:16:33,720 --> 13:16:36,720
No, so I don't need that, sorry.
16662
13:16:36,720 --> 13:16:38,720
Fix that.
16663
13:16:38,720 --> 13:16:41,720
Okay, so it's gonna read from the database,
16664
13:16:41,720 --> 13:16:43,720
it's gonna use regular expressions,
16665
13:16:43,720 --> 13:16:45,720
and zlib is a way to do some compression.
16666
13:16:45,720 --> 13:16:46,720
And so I'm gonna do, in this one,
16667
13:16:46,720 --> 13:16:48,720
I'm gonna compress some of the data
16668
13:16:48,720 --> 13:16:50,720
to make it so that I have less data to,
16669
13:16:50,720 --> 13:16:52,720
some of the text fields are gonna be compressed.
16670
13:16:52,720 --> 13:16:54,720
I wanted to keep these fields uncompressed
16671
13:16:54,720 --> 13:16:56,720
inside of messages.
16672
13:16:56,720 --> 13:17:01,720
And so we have some just cleanup messages
16673
13:17:01,720 --> 13:17:03,720
and cleans things up.
16674
13:17:03,720 --> 13:17:06,720
And it turns out that the way email addresses
16675
13:17:06,720 --> 13:17:11,720
in this particular mail corpus, they changed over time
16676
13:17:11,720 --> 13:17:14,720
and there's certain kinds of things.
16677
13:17:14,720 --> 13:17:16,720
Sometimes the gmain.org is the email address
16678
13:17:16,720 --> 13:17:19,720
when people wanna hide their address.
16679
13:17:19,720 --> 13:17:22,720
And I made all kinds of stuff and I split it
16680
13:17:22,720 --> 13:17:24,720
and checked to see if it ended with this.
16681
13:17:24,720 --> 13:17:28,720
And I cleaned up things, just that kind of thing.
16682
13:17:28,720 --> 13:17:32,720
And so I have all kinds of cleanup stuff going on in here.
16683
13:17:32,720 --> 13:17:34,720
And I have this mapping and DNS mapping
16684
13:17:34,720 --> 13:17:36,720
that I'll talk about in a bit
16685
13:17:36,720 --> 13:17:39,720
where organizations sometimes sent email
16686
13:17:39,720 --> 13:17:41,720
with different addresses over time
16687
13:17:41,720 --> 13:17:44,720
and people sent email from different time.
16688
13:17:44,720 --> 13:17:47,720
And we're gonna do the parsing of the date
16689
13:17:47,720 --> 13:17:50,720
and that is the code for that.
16690
13:17:50,720 --> 13:17:53,720
I'm gonna pull out the header information.
16691
13:17:53,720 --> 13:17:58,720
This is sort of borrowed from the other code.
16692
13:17:59,720 --> 13:18:04,720
We'll clean up the email addresses and the domain names.
16693
13:18:04,720 --> 13:18:07,720
And we'll pull the date out, pull the subject out,
16694
13:18:07,720 --> 13:18:10,720
pull out the message ID, various things.
16695
13:18:10,720 --> 13:18:13,720
So here's the main body of the code.
16696
13:18:13,720 --> 13:18:18,720
We're going to go from content.sqlite to index.sqlite.
16697
13:18:18,720 --> 13:18:20,720
And what I'm gonna do every time
16698
13:18:20,720 --> 13:18:22,720
is I'm gonna wipe out index.sqlite
16699
13:18:22,720 --> 13:18:25,720
and drop the messages, senders, subjects, and replies.
16700
13:18:25,720 --> 13:18:27,720
So this is a normalized database
16701
13:18:27,720 --> 13:18:29,720
in that it has foreign keys.
16702
13:18:29,720 --> 13:18:31,720
So there's a messages table here
16703
13:18:31,720 --> 13:18:34,720
with an integer primary key, the GUID for it.
16704
13:18:34,720 --> 13:18:37,720
The GUID stands for global unique ID,
16705
13:18:37,720 --> 13:18:41,720
sender ID, and it's gonna have a blob.
16706
13:18:41,720 --> 13:18:43,720
These are blobs, binary or large objects
16707
13:18:43,720 --> 13:18:44,720
for the headers in the body
16708
13:18:44,720 --> 13:18:46,720
because I'm gonna compress them in this database
16709
13:18:46,720 --> 13:18:48,720
to make them.
16710
13:18:48,720 --> 13:18:53,720
And then the senders, each sender has a key
16711
13:18:53,720 --> 13:18:57,720
and then each subject line is gonna have a key
16712
13:18:57,720 --> 13:19:02,720
and then replies our connection from one message to another.
16713
13:19:02,720 --> 13:19:04,720
And so this is like a many to many.
16714
13:19:04,720 --> 13:19:07,720
Now, I also have this file called mapping.sqlite
16715
13:19:07,720 --> 13:19:12,720
and so we can take a look at that one, mapping.sqlite.
16716
13:19:12,720 --> 13:19:17,720
And so what happened is this has two tables
16717
13:19:17,720 --> 13:19:19,720
that I hand deal with.
16718
13:19:19,720 --> 13:19:21,720
And so sometimes in the end,
16719
13:19:21,720 --> 13:19:24,720
this was a email address that mapped to that.
16720
13:19:24,720 --> 13:19:27,720
So Indiana.edu, that's a way to take
16721
13:19:27,720 --> 13:19:29,720
an at's the email address.
16722
13:19:29,720 --> 13:19:31,720
And then these were a bunch of people
16723
13:19:31,720 --> 13:19:35,720
that had email addresses changing throughout the project
16724
13:19:35,720 --> 13:19:38,720
and I sort of kind of mapped them in a way.
16725
13:19:38,720 --> 13:19:40,720
And so this is just sort of like,
16726
13:19:40,720 --> 13:19:42,720
I pull this in really quick
16727
13:19:42,720 --> 13:19:46,720
and I read all this stuff from the DNS mapping
16728
13:19:46,720 --> 13:19:50,720
and I, other than stripping and making this lowercase,
16729
13:19:50,720 --> 13:19:54,720
et cetera, I just am gonna make a dictionary.
16730
13:19:54,720 --> 13:19:57,720
DNS mapping, which is the old name to the new name
16731
13:19:57,720 --> 13:20:00,720
and the email address mapping
16732
13:20:00,720 --> 13:20:02,720
from the old name to the new name
16733
13:20:02,720 --> 13:20:03,720
and I'm using fixsender.
16734
13:20:03,720 --> 13:20:05,720
Fixsender is because the email addresses
16735
13:20:05,720 --> 13:20:08,720
even within gmain were kind of funky.
16736
13:20:08,720 --> 13:20:12,720
So don't worry so much about this.
16737
13:20:12,720 --> 13:20:15,720
Okay, and so now what I'm gonna do is
16738
13:20:15,720 --> 13:20:18,720
I opened up a connection just to read all that stuff in
16739
13:20:18,720 --> 13:20:21,720
and now I'm going to actually open the main content
16740
13:20:21,720 --> 13:20:23,720
and I'm asking this a little trickier.
16741
13:20:23,720 --> 13:20:25,720
I open that read only.
16742
13:20:25,720 --> 13:20:28,720
That was so that I could potentially be running the spider
16743
13:20:28,720 --> 13:20:30,720
and running this at the same time.
16744
13:20:30,720 --> 13:20:32,720
I get a cursor.
16745
13:20:32,720 --> 13:20:35,720
And so I'm gonna read through,
16746
13:20:35,720 --> 13:20:38,720
so in the content file, this is the big one,
16747
13:20:38,720 --> 13:20:40,720
I'm gonna read through and go through every one
16748
13:20:40,720 --> 13:20:43,720
and write all of these things in.
16749
13:20:43,720 --> 13:20:47,720
And I'm gonna take all the email addresses
16750
13:20:47,720 --> 13:20:51,720
and I'm going to put those in a list.
16751
13:20:52,720 --> 13:20:55,720
So I loaded that, I've got the mappings loaded
16752
13:20:55,720 --> 13:20:59,720
and so now I'm going to go through every single message.
16753
13:20:59,720 --> 13:21:01,720
I got all the senders, all the subjects
16754
13:21:01,720 --> 13:21:04,720
and all the global unique IDs.
16755
13:21:04,720 --> 13:21:06,720
So I read in each message.
16756
13:21:06,720 --> 13:21:10,720
So now I'm going through content one at a time.
16757
13:21:10,720 --> 13:21:14,720
I parse the headers.
16758
13:21:16,720 --> 13:21:21,720
I check to see if the sender's name, email address,
16759
13:21:21,720 --> 13:21:25,720
after it's been cleaned up, is in my mapping.
16760
13:21:25,720 --> 13:21:28,720
Mapping.getSender and the default is I get backSender.
16761
13:21:28,720 --> 13:21:30,720
That's what that's saying.
16762
13:21:30,720 --> 13:21:32,720
Lookup Sender, if it's in there,
16763
13:21:32,720 --> 13:21:34,720
give me the entry of that key,
16764
13:21:34,720 --> 13:21:37,720
otherwise give me sender back.
16765
13:21:37,720 --> 13:21:41,720
We're gonna print every 250 things we do.
16766
13:21:41,720 --> 13:21:44,720
We'll complain if this is true.
16767
13:21:44,720 --> 13:21:47,720
We're gonna go get the mapping between the senders
16768
13:21:47,720 --> 13:21:49,720
which is a way to look up the primary key.
16769
13:21:49,720 --> 13:21:51,720
I could have done this with a database thing
16770
13:21:51,720 --> 13:21:53,720
but I wanted it to be fast.
16771
13:21:53,720 --> 13:21:55,720
So that's part of the reason I read all these things in
16772
13:21:55,720 --> 13:21:58,720
so I could have those mappings to be really fast.
16773
13:21:58,720 --> 13:22:00,720
You'll see this takes a little while
16774
13:22:00,720 --> 13:22:04,720
even though I got all this stuff cached.
16775
13:22:04,720 --> 13:22:08,720
And so then if I don't have a sender ID,
16776
13:22:08,720 --> 13:22:10,720
meaning that I haven't seen it yet,
16777
13:22:10,720 --> 13:22:13,720
then I'm gonna do an insert or ignore into senders
16778
13:22:13,720 --> 13:22:15,720
and then I'm gonna do a select
16779
13:22:15,720 --> 13:22:18,720
and then you've seen this where I grab the row back
16780
13:22:18,720 --> 13:22:22,720
and I'm really just trying to look at the recently assigned ID
16781
13:22:22,720 --> 13:22:26,720
and then I'm going to not only set the sender ID
16782
13:22:26,720 --> 13:22:29,720
for this iteration loop but I'm also gonna store it
16783
13:22:29,720 --> 13:22:33,720
in the dictionary and so that builds this dictionary up.
16784
13:22:33,720 --> 13:22:37,720
And you'll see the same thing is true for subject ID.
16785
13:22:37,720 --> 13:22:39,720
I'm gonna insert it into the subjects table
16786
13:22:39,720 --> 13:22:42,720
and get a primary key if I don't know what it is
16787
13:22:42,720 --> 13:22:44,720
and then I'm gonna put it into,
16788
13:22:44,720 --> 13:22:46,720
not only am I going to put it into the database
16789
13:22:46,720 --> 13:22:50,720
but I'm also gonna put it into my dictionary.
16790
13:22:50,720 --> 13:22:54,720
And the same thing, I guess I didn't do it for the GUID.
16791
13:22:54,720 --> 13:22:56,720
Okay.
16792
13:22:56,720 --> 13:23:00,720
So now what I have is the sender ID and the subject ID
16793
13:23:00,720 --> 13:23:03,720
which are foreign keys into the sender table and the subject table
16794
13:23:03,720 --> 13:23:07,720
and I'm gonna insert the message with the sender ID, subject ID,
16795
13:23:07,720 --> 13:23:09,720
the sent at, headers, and body.
16796
13:23:09,720 --> 13:23:16,720
And the values here are the GUID, sender ID, subject ID, sent at.
16797
13:23:16,720 --> 13:23:19,720
Now this here is Zlib compress.
16798
13:23:19,720 --> 13:23:27,720
So what I'm taking is the message, the header, and the body
16799
13:23:27,720 --> 13:23:31,720
and this little bit ends up with a compressed version of this stuff
16800
13:23:31,720 --> 13:23:33,720
and you'll see it in a second.
16801
13:23:33,720 --> 13:23:35,720
And this keeps the size of these text things
16802
13:23:35,720 --> 13:23:38,720
down at the cost of the computation of,
16803
13:23:38,720 --> 13:23:42,720
we have to, at the cost of the computation
16804
13:23:42,720 --> 13:23:44,720
to compress and decompress when we want to read it.
16805
13:23:44,720 --> 13:23:46,720
Okay.
16806
13:23:46,720 --> 13:23:52,720
And then I pull the GUIDs out, the ID which is the GUID
16807
13:23:52,720 --> 13:23:56,720
and I pull out the primary key for this thing based on the GUID.
16808
13:23:56,720 --> 13:23:59,720
And I update this dictionary.
16809
13:23:59,720 --> 13:24:00,720
Okay.
16810
13:24:00,720 --> 13:24:04,720
So let me run that code.
16811
13:24:04,720 --> 13:24:06,720
It is doing a lot of cleanup
16812
13:24:06,720 --> 13:24:08,720
and I'll tell you it took me a long time to make this work.
16813
13:24:08,720 --> 13:24:13,720
So just, so this code that I'm running now,
16814
13:24:13,720 --> 13:24:17,720
oh, don't forget to take a Python 3, Chuck.
16815
13:24:17,720 --> 13:24:21,720
So this is gonna run every 250.
16816
13:24:21,720 --> 13:24:23,720
So it did all this precaching.
16817
13:24:23,720 --> 13:24:25,720
So that's how long it takes to do 250.
16818
13:24:25,720 --> 13:24:27,720
Now there's 60,000 in here.
16819
13:24:27,720 --> 13:24:30,720
And so this is really busy.
16820
13:24:30,720 --> 13:24:32,720
The reason it's bouncing back and forth is that
16821
13:24:32,720 --> 13:24:34,720
every time it makes this journal file, that's,
16822
13:24:34,720 --> 13:24:35,720
and then does a commit.
16823
13:24:35,720 --> 13:24:38,720
So you can kind of see that it's,
16824
13:24:38,720 --> 13:24:40,720
it's busy making journal files and committing
16825
13:24:40,720 --> 13:24:44,720
and there's a lot of activity going on here.
16826
13:24:44,720 --> 13:24:48,720
It just so happens that Adam shows me these files.
16827
13:24:48,720 --> 13:24:57,720
Okay, so it finished.
16828
13:24:57,720 --> 13:25:00,720
It took about three minutes to finish that, right?
16829
13:25:00,720 --> 13:25:06,720
And so if we take a look at the size of the files,
16830
13:25:06,720 --> 13:25:09,720
we will see that the index is much smaller.
16831
13:25:09,720 --> 13:25:11,720
It's fully normalized.
16832
13:25:11,720 --> 13:25:13,720
It's still 263 megabytes.
16833
13:25:13,720 --> 13:25:14,720
It's all compressed.
16834
13:25:14,720 --> 13:25:21,720
So let's take a look at that in the browser.
16835
13:25:21,720 --> 13:25:23,720
So it's 200 megabytes.
16836
13:25:23,720 --> 13:25:28,720
But it loads up a lot faster.
16837
13:25:28,720 --> 13:25:30,720
There we go.
16838
13:25:30,720 --> 13:25:33,720
So we have a senders table, right?
16839
13:25:33,720 --> 13:25:36,720
Which is just kind of a many to one table.
16840
13:25:36,720 --> 13:25:40,720
We have a subjects to table, which is a many to one table.
16841
13:25:40,720 --> 13:25:43,720
And we have messages, which has foreign keys.
16842
13:25:43,720 --> 13:25:46,720
It takes a little bit to load that up.
16843
13:25:46,720 --> 13:25:51,720
Okay, and so we see the foreign keys for sender and subject
16844
13:25:51,720 --> 13:25:53,720
and that saves us.
16845
13:25:53,720 --> 13:25:55,720
All those foreign keys save us.
16846
13:25:55,720 --> 13:25:58,720
And so we have, you can kind of see that I can't see the headers in the body
16847
13:25:58,720 --> 13:26:00,720
because now they're compressed.
16848
13:26:00,720 --> 13:26:03,720
That saves me a whole bunch of stuff, right?
16849
13:26:03,720 --> 13:26:06,720
It saved me a whole bunch of stuff.
16850
13:26:06,720 --> 13:26:10,720
And so that's what's in that file.
16851
13:26:10,720 --> 13:26:15,720
And that, we've finished this process, okay?
16852
13:26:15,720 --> 13:26:19,720
And we've finished modeling the data and making it really clean.
16853
13:26:19,720 --> 13:26:22,720
And we'll pick back up and the rest of the stuff we will do
16854
13:26:22,720 --> 13:26:26,720
is actually visualizing pulling data out of index.sqlite.
16855
13:26:26,720 --> 13:26:28,720
The idea is this can be restarted.
16856
13:26:28,720 --> 13:26:30,720
This can be run over and over and over.
16857
13:26:30,720 --> 13:26:32,720
Even though it takes like three minutes to run this,
16858
13:26:32,720 --> 13:26:35,720
that's way better than five hours to run this.
16859
13:26:35,720 --> 13:26:37,720
So three minutes, five hours.
16860
13:26:37,720 --> 13:26:41,720
And then you'll see, and we'll see now reading this as in seconds
16861
13:26:41,720 --> 13:26:45,720
because we got it all nice and normalized in a quite pretty way.
16862
13:26:45,720 --> 13:26:48,720
So I hope this has been useful.
16863
13:26:48,720 --> 13:26:51,720
In the next one, we'll actually do the visualization.
16864
13:26:56,720 --> 13:27:00,720
We are in the process of retrieving data from this gmain server,
16865
13:27:00,720 --> 13:27:03,720
one that I've made a copy of.
16866
13:27:03,720 --> 13:27:08,720
And we have, so far, spied it all, ended up with 600 megabytes
16867
13:27:08,720 --> 13:27:09,720
of spied-ed information.
16868
13:27:09,720 --> 13:27:12,720
We have ran a rather complex cleanup process
16869
13:27:12,720 --> 13:27:15,720
that you probably don't need to fully understand.
16870
13:27:15,720 --> 13:27:17,720
You can look at it for patterns.
16871
13:27:17,720 --> 13:27:22,720
But in general, the cleanup process will be very sensitive to the data.
16872
13:27:22,720 --> 13:27:27,720
And then we have this index.sqlite, which is 260 megabytes right now.
16873
13:27:27,720 --> 13:27:32,720
And we are going to now do the easy, the fun, easy bits here
16874
13:27:32,720 --> 13:27:35,720
where we're going to run little queries that just pull data out.
16875
13:27:35,720 --> 13:27:37,720
And so these are much simpler.
16876
13:27:37,720 --> 13:27:40,720
So part of what I wrote when I was doing this
16877
13:27:40,720 --> 13:27:45,720
is I wanted to do some simple, basic calculations on the data
16878
13:27:45,720 --> 13:27:48,720
to make sure I really was sort of looking for anomalies, right?
16879
13:27:48,720 --> 13:27:50,720
What was working, what wasn't working.
16880
13:27:50,720 --> 13:27:55,720
So I wrote a series of really simple things like this gbasic,
16881
13:27:55,720 --> 13:27:59,720
the gbasic code, just to give me some basic data, right?
16882
13:27:59,720 --> 13:28:02,720
So I wrote things down and I counted things.
16883
13:28:02,720 --> 13:28:06,720
And so, do I need URL librequest in this one?
16884
13:28:06,720 --> 13:28:07,720
I don't think so.
16885
13:28:07,720 --> 13:28:09,720
Let's fix that bug.
16886
13:28:09,720 --> 13:28:10,720
It's not there.
16887
13:28:10,720 --> 13:28:12,720
No reason to put any of that stuff in there.
16888
13:28:12,720 --> 13:28:17,720
So it just, it reads that index.sqlite, which is our cleaned up data.
16889
13:28:17,720 --> 13:28:20,720
It reads through and makes a dictionary of this pattern.
16890
13:28:20,720 --> 13:28:26,720
You're going to see a lot where I'm going to make a dictionary of ID to senders,
16891
13:28:26,720 --> 13:28:29,720
save myself repeatedly looking at things.
16892
13:28:29,720 --> 13:28:30,720
I'm going to grab the subjects.
16893
13:28:30,720 --> 13:28:31,720
I've cached them all.
16894
13:28:31,720 --> 13:28:37,720
I could have done this all with SQL, but I just wanted to do things faster.
16895
13:28:37,720 --> 13:28:41,720
And now I'm going to go through each of these messages
16896
13:28:41,720 --> 13:28:43,720
and make a dictionary of them.
16897
13:28:43,720 --> 13:28:45,720
I'm going to put a lot of stuff in memory.
16898
13:28:45,720 --> 13:28:47,720
And then I'm going to do some counts.
16899
13:28:47,720 --> 13:28:50,720
I'm going to see who is sent the most, right?
16900
13:28:50,720 --> 13:28:52,720
The organizations.
16901
13:28:52,720 --> 13:28:58,720
And so now I've got to go through all the messages.
16902
13:28:58,720 --> 13:29:04,720
I am not actually, so you'll notice that I'm not selecting the body or the headers here.
16903
13:29:04,720 --> 13:29:08,720
I am just getting sender ID, subject ID.
16904
13:29:08,720 --> 13:29:10,720
I probably could have done this with a join.
16905
13:29:10,720 --> 13:29:11,720
It would have been cleaner.
16906
13:29:11,720 --> 13:29:12,720
You can do that.
16907
13:29:12,720 --> 13:29:14,720
You can make that change.
16908
13:29:14,720 --> 13:29:17,720
Do that with a join so it's cleaner.
16909
13:29:17,720 --> 13:29:22,720
And so I'm going through all the messages except not the body.
16910
13:29:22,720 --> 13:29:24,720
So this is going to be really quick.
16911
13:29:24,720 --> 13:29:27,720
And I'm pulling out the senders ID.
16912
13:29:27,720 --> 13:29:29,720
I'm breaking the sender into pieces.
16913
13:29:29,720 --> 13:29:30,720
See, my data is clean now.
16914
13:29:30,720 --> 13:29:33,720
I cleaned it all up in the previous processes.
16915
13:29:33,720 --> 13:29:37,720
And if I don't have two pieces, I continue and I get the domain name.
16916
13:29:37,720 --> 13:29:38,720
So I have the person.
16917
13:29:38,720 --> 13:29:45,720
I'm doing a basic dictionary histogram for the people and the domains.
16918
13:29:45,720 --> 13:29:53,720
And then I'm going to sort them with a sorted.
16919
13:29:53,720 --> 13:29:55,720
And we're going to grab the key.
16920
13:29:55,720 --> 13:29:59,720
We're going to sort it by the how many there are reverse.
16921
13:29:59,720 --> 13:30:05,720
And then print out the top few of the organizations and the top few of the people.
16922
13:30:05,720 --> 13:30:06,720
OK?
16923
13:30:06,720 --> 13:30:08,720
So we'll just run that code.
16924
13:30:08,720 --> 13:30:12,720
Python gbasic.py.
16925
13:30:12,720 --> 13:30:14,720
Let's type the dump out the top 10.
16926
13:30:14,720 --> 13:30:20,720
So we loaded 59,000 messages, 29,000 subjects, and 1,800 senders,
16927
13:30:20,720 --> 13:30:25,720
and figured out the top 10 people and the top 10 organizations.
16928
13:30:25,720 --> 13:30:31,720
And you can write various things like that that just sort of scream through your data
16929
13:30:31,720 --> 13:30:35,720
and it's good to get sanity checking on your data.
16930
13:30:35,720 --> 13:30:36,720
OK?
16931
13:30:36,720 --> 13:30:38,720
So that's gbasic.
16932
13:30:38,720 --> 13:30:43,720
Now I want to do gword.py because that's kind of fun.
16933
13:30:43,720 --> 13:30:45,720
gword.py.
16934
13:30:45,720 --> 13:30:47,720
I don't need URLib.
16935
13:30:47,720 --> 13:30:49,720
Why do I keep putting URLib in all these things?
16936
13:30:49,720 --> 13:30:51,720
So we'll get rid of that.
16937
13:30:51,720 --> 13:30:56,720
So this is really simple because I'm just going to go for the words in the subject line.
16938
13:30:56,720 --> 13:30:59,720
And so I go through index.sqlite.
16939
13:30:59,720 --> 13:31:03,720
I read in all of the subjects.
16940
13:31:03,720 --> 13:31:06,720
And I make a dictionary of those.
16941
13:31:06,720 --> 13:31:09,720
And then I go and find all the subjects.
16942
13:31:09,720 --> 13:31:12,720
And then I'm doing this code right here.
16943
13:31:12,720 --> 13:31:19,720
I'm pulling out the subject based on the message.
16944
13:31:19,720 --> 13:31:22,720
And I'm doing this so that when the subjects are used more than once,
16945
13:31:22,720 --> 13:31:25,720
I count the words more than once.
16946
13:31:25,720 --> 13:31:30,720
DisturMakeTrans, I talked about that in an earlier chapter.
16947
13:31:30,720 --> 13:31:34,720
This basically throws away a punctuation in numbers
16948
13:31:34,720 --> 13:31:38,720
so that when I make my words, I don't end up with words that are like dashes.
16949
13:31:38,720 --> 13:31:40,720
It compresses them down.
16950
13:31:40,720 --> 13:31:42,720
Then I strip it.
16951
13:31:42,720 --> 13:31:44,720
I convert everything to lowercase.
16952
13:31:44,720 --> 13:31:47,720
This is basically just to keep too many words from showing up.
16953
13:31:47,720 --> 13:31:48,720
Then I do a split.
16954
13:31:48,720 --> 13:31:51,720
And then I got accounts, a dictionary.
16955
13:31:51,720 --> 13:31:59,720
So this is a no punctuation, no numbers dictionary count.
16956
13:31:59,720 --> 13:32:04,720
And then I just take the and do a dictionary.
16957
13:32:04,720 --> 13:32:08,720
And then I sort them in reverse order.
16958
13:32:08,720 --> 13:32:13,720
And I figure out what the highest and lowest is by running through a,
16959
13:32:13,720 --> 13:32:20,720
I could have probably done this with a max and a min if I felt like it.
16960
13:32:20,720 --> 13:32:24,720
And so now I have the highest and the lowest.
16961
13:32:24,720 --> 13:32:27,720
I should have done a max and a min on that one.
16962
13:32:27,720 --> 13:32:28,720
Why did I do that?
16963
13:32:28,720 --> 13:32:31,720
But oh well.
16964
13:32:31,720 --> 13:32:34,720
And now I've got to spread out the size.
16965
13:32:34,720 --> 13:32:39,720
And so I'm going to produce this file gword.js, which is needed by the visualization
16966
13:32:39,720 --> 13:32:45,720
because it's going to use d3.js, a word visualizer, and gword.js.
16967
13:32:45,720 --> 13:32:47,720
I have to tell it how big the text is.
16968
13:32:47,720 --> 13:32:50,720
And so I'm doing some text normalization.
16969
13:32:50,720 --> 13:32:52,720
Took me a little experimentation.
16970
13:32:52,720 --> 13:33:02,720
So if I run this now, and I say python gword.js,
16971
13:33:02,720 --> 13:33:07,720
and I say python 3gword.js, which is a lot better.
16972
13:33:07,720 --> 13:33:13,720
Oh, not python.
16973
13:33:13,720 --> 13:33:19,720
Okay, so now I can go look at the gword.js, wherever that is, gword.js.
16974
13:33:19,720 --> 13:33:20,720
Yep.
16975
13:33:20,720 --> 13:33:25,720
And so this is basically, it normalized all the frequencies
16976
13:33:25,720 --> 13:33:27,720
and made it font size.
16977
13:33:27,720 --> 13:33:29,720
These are font sizes now.
16978
13:33:29,720 --> 13:33:34,720
And so this is just the data that's needed by this gword.jm,
16979
13:33:34,720 --> 13:33:40,720
which uses this d3 visualization word cloud code.
16980
13:33:40,720 --> 13:33:44,720
So this pulls in all my data, and then this is just some JavaScript
16981
13:33:44,720 --> 13:33:49,720
that draws the picture on the page.
16982
13:33:49,720 --> 13:33:57,720
And so the easy part now is to just open gword.htm in a browser.
16983
13:33:57,720 --> 13:33:59,720
It just so happens on a Mac I can do this.
16984
13:33:59,720 --> 13:34:05,720
And so that gives me a word cloud based on that data.
16985
13:34:05,720 --> 13:34:07,720
It kind of randomizes it.
16986
13:34:07,720 --> 13:34:08,720
It shows different stuff.
16987
13:34:08,720 --> 13:34:17,720
But it's using this data to generate how big those things are,
16988
13:34:17,720 --> 13:34:21,720
and then using a bit of randomness and simulated annealing to lay it out.
16989
13:34:21,720 --> 13:34:24,720
That's not stuff that we actually have to worry about, okay?
16990
13:34:24,720 --> 13:34:30,720
So that's how we get to the point where we're seeing a word cloud from this.
16991
13:34:30,720 --> 13:34:33,720
Now we're going to do another visualization.
16992
13:34:33,720 --> 13:34:36,720
And this time we're going to do a line visualization.
16993
13:34:36,720 --> 13:34:39,720
And we're going to create a thing called gline.js
16994
13:34:39,720 --> 13:34:42,720
and produce, with another HTML file, we're going to use d3
16995
13:34:42,720 --> 13:34:45,720
and produce that output.
16996
13:34:45,720 --> 13:34:51,720
So let's say goodbye here, goodbye, goodbye, goodbye, goodbye.
16997
13:34:51,720 --> 13:34:57,720
So gline.py, get rid of that file.
16998
13:34:57,720 --> 13:35:02,720
So again, I'm going to preload all of the senders in this case.
16999
13:35:02,720 --> 13:35:04,720
And again, I could have done this with a join.
17000
13:35:04,720 --> 13:35:06,720
Probably should have done this with a join.
17001
13:35:06,720 --> 13:35:13,720
I'm going to preload all the messages, the sender ID, subject ID, etc.
17002
13:35:13,720 --> 13:35:15,720
I'll load those up.
17003
13:35:15,720 --> 13:35:18,720
And now I'm going to read through.
17004
13:35:18,720 --> 13:35:23,720
I'm going to have the sending organizations and the senders.
17005
13:35:23,720 --> 13:35:27,720
And I'm going to accumulate and split the senders.
17006
13:35:27,720 --> 13:35:30,720
And I'm going to have the sending organizations.
17007
13:35:30,720 --> 13:35:32,720
And then I'm going to do a simple dictionary
17008
13:35:32,720 --> 13:35:35,720
as I accumulate the sending organizations
17009
13:35:35,720 --> 13:35:38,720
by splitting the person's name into add signs.
17010
13:35:38,720 --> 13:35:42,720
And then based on the organization, I accumulate it.
17011
13:35:42,720 --> 13:35:44,720
And then I sort them.
17012
13:35:44,720 --> 13:35:47,720
And I pull out the top ten organizations.
17013
13:35:47,720 --> 13:35:49,720
I print those out.
17014
13:35:49,720 --> 13:35:57,720
And now I'm going to produce, break this down into months.
17015
13:35:57,720 --> 13:36:00,720
And I'll show you what this looks like in a second.
17016
13:36:00,720 --> 13:36:03,720
Let's go to the gline.js.
17017
13:36:03,720 --> 13:36:06,720
So the month looks like this, okay?
17018
13:36:06,720 --> 13:36:07,720
So the month looks like that.
17019
13:36:07,720 --> 13:36:10,720
So that's the first seven characters of the date.
17020
13:36:10,720 --> 13:36:15,720
So if we look at the date, date looks like that.
17021
13:36:15,720 --> 13:36:18,720
The month is the first seven characters.
17022
13:36:18,720 --> 13:36:22,720
And this is the data that I've got to give it.
17023
13:36:22,720 --> 13:36:24,720
We'll clean that up in a second.
17024
13:36:24,720 --> 13:36:28,720
That data will look better in a moment.
17025
13:36:28,720 --> 13:36:30,720
Go back to gline.py.
17026
13:36:30,720 --> 13:36:34,720
And so this is...
17027
13:36:34,720 --> 13:36:37,720
We're doing a...
17028
13:36:37,720 --> 13:36:40,720
The key is a tuple, which is the month,
17029
13:36:40,720 --> 13:36:45,720
and which organization it is that did it.
17030
13:36:45,720 --> 13:36:47,720
And it's only in the top ten organizations.
17031
13:36:47,720 --> 13:36:53,720
And then we're going to do a...
17032
13:36:53,720 --> 13:36:56,720
We're going to basically do a dictionary
17033
13:36:56,720 --> 13:36:58,720
where the key is a tuple.
17034
13:36:58,720 --> 13:37:00,720
And then we're going to sort it.
17035
13:37:00,720 --> 13:37:04,720
Sort by key in this case, not by value.
17036
13:37:04,720 --> 13:37:05,720
That's...
17037
13:37:05,720 --> 13:37:07,720
And the months is going to sort that.
17038
13:37:07,720 --> 13:37:10,720
And then we're going to write all this data out
17039
13:37:10,720 --> 13:37:12,720
into gline.js.
17040
13:37:12,720 --> 13:37:14,720
So let's go ahead and run this.
17041
13:37:14,720 --> 13:37:17,720
And again, this is just the data that has to be written
17042
13:37:17,720 --> 13:37:21,720
in a way that the JavaScript can understand it.
17043
13:37:21,720 --> 13:37:28,720
Python, gline, python3, gline.py.
17044
13:37:28,720 --> 13:37:30,720
Okay, so top ten organizations.
17045
13:37:30,720 --> 13:37:32,720
So let's take a look at that JavaScript.
17046
13:37:32,720 --> 13:37:34,720
So this is what it looks like.
17047
13:37:34,720 --> 13:37:39,720
So it just so happens that you got to tell it the...
17048
13:37:39,720 --> 13:37:42,720
These are the data points, these are the lines.
17049
13:37:42,720 --> 13:37:45,720
So this is the year, the line for University of Michigan,
17050
13:37:45,720 --> 13:37:48,720
gmail.com, swinsburg.com.
17051
13:37:48,720 --> 13:37:51,720
So this first column is that line points
17052
13:37:51,720 --> 13:37:53,720
and the next line points.
17053
13:37:53,720 --> 13:37:57,720
So all this code was to get the data in such a way
17054
13:37:57,720 --> 13:38:00,720
that I could produce this JavaScript file.
17055
13:38:00,720 --> 13:38:04,720
Because if I look at gline.htm,
17056
13:38:04,720 --> 13:38:07,720
I need that data in that particular format.
17057
13:38:07,720 --> 13:38:10,720
And I've got all this stuff.
17058
13:38:10,720 --> 13:38:11,720
I make a line chart.
17059
13:38:11,720 --> 13:38:14,720
And I draw it with this data, that data.
17060
13:38:14,720 --> 13:38:16,720
I had to go read all the documentation
17061
13:38:16,720 --> 13:38:19,720
on how to figure this stuff out.
17062
13:38:19,720 --> 13:38:21,720
And that's the data that I'm going to use.
17063
13:38:21,720 --> 13:38:22,720
And I had to figure this out.
17064
13:38:22,720 --> 13:38:24,720
And I had to transform it and make it pretty.
17065
13:38:24,720 --> 13:38:26,720
It took me quite a while to get this to work.
17066
13:38:26,720 --> 13:38:28,720
And this is not a JavaScript class
17067
13:38:28,720 --> 13:38:31,720
nor a how to visualize in D3.
17068
13:38:31,720 --> 13:38:35,720
But basically, we pulled all that stuff in.
17069
13:38:35,720 --> 13:38:40,720
And here's the gline that came from the JavaScript.
17070
13:38:40,720 --> 13:38:42,720
And then it makes an array to data table.
17071
13:38:42,720 --> 13:38:45,720
And then that data table is what gline draws.
17072
13:38:45,720 --> 13:38:57,720
So with no further ado, let's open gline.htm to show that data.
17073
13:38:57,720 --> 13:38:58,720
So there you go.
17074
13:38:58,720 --> 13:39:02,720
That's the Sakai developer participation from 2015
17075
13:39:02,720 --> 13:39:09,720
through 2005 through 2015, based on which organizations did
17076
13:39:09,720 --> 13:39:12,720
the most commits in Sakai.
17077
13:39:12,720 --> 13:39:15,720
And so I know that I haven't done all this code full justice.
17078
13:39:15,720 --> 13:39:17,720
There's a lot of code here.
17079
13:39:17,720 --> 13:39:20,720
The fun is just to kind of run it and see it.
17080
13:39:20,720 --> 13:39:23,720
And then when the time comes to come back and see
17081
13:39:23,720 --> 13:39:26,720
the techniques that are used when you're trying to build
17082
13:39:26,720 --> 13:39:29,720
your own visualization pipeline.
17083
13:39:29,720 --> 13:39:33,720
So I hope that you found this useful.
17084
13:39:33,720 --> 13:39:35,720
You know, this is a lot of code.
17085
13:39:35,720 --> 13:39:37,720
Hard to explain in 15, 20 minutes.
17086
13:39:37,720 --> 13:39:40,720
But I hope you take some time and look it over.
17087
13:39:40,720 --> 13:39:43,720
And I hope you found all these videos.
17088
13:39:43,720 --> 13:39:45,720
This is kind of the last walk-through video
17089
13:39:45,720 --> 13:39:47,720
for chapter 16 of the book.
17090
13:39:47,720 --> 13:39:50,720
And so I hope that I will see you on the net.
17091
13:39:50,720 --> 13:39:57,720
Thank you.
1360912
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.